metadata
license: apache-2.0
tags:
- text-classification
- language-identification
library_name: fasttext
datasets:
- cis-lmu/GlotSparse
- cis-lmu/GlotStoryBook
metrics:
- f1
GlotLID
Description
GlotLID is a Fasttext language identification (LID) model that supports more than 1600 languages.
- Demo: huggingface
- Repository: github
- Paper: paper (EMNLP 2023)
- Point of Contact: amir@cis.lmu.de
How to use
Here is how to use this model to detect the language of a given text:
>>> import fasttext
>>> from huggingface_hub import hf_hub_download
>>> model_path = hf_hub_download(repo_id="cis-lmu/glotlid", filename="model.bin")
>>> model = fasttext.load_model(model_path)
>>> model.predict("Hello, world!")
If you are not a fan of huggingface_hub, then download the model directyly:
>>> ! wget https://huggingface.co/cis-lmu/glotlid/resolve/main/model.bin
>>> import fasttext
>>> model = fasttext.load_model("/path/to/model.bin")
>>> model.predict("Hello, world!")
License
The model is distributed under the Apache License, Version 2.0.
Version
We always maintain the previous version of GlotLID in our repository.
To access a specific version, simply append the version number to the filename
.
- For v1:
model_v1.bin
(introduced in the GlotLID paper and used in all experiments). - For v2:
model_v2.bin
(an edited version of v1, featuring more languages, and cleaned from noisy corpora based on the analysis of v1).
model.bin
always refers to the latest version (v2).
References
If you use this model, please cite the following paper:
@inproceedings{
kargaran2023glotlid,
title={{GlotLID}: Language Identification for Low-Resource Languages},
author={Kargaran, Amir Hossein and Imani, Ayyoob and Yvon, Fran{\c{c}}ois and Sch{\"u}tze, Hinrich},
booktitle={The 2023 Conference on Empirical Methods in Natural Language Processing},
year={2023},
url={https://openreview.net/forum?id=dl4e3EBz5j}
}