metadata

license: apache-2.0
tags:
  - text-classification
  - language-identification
library_name: fasttext
datasets:
  - cis-lmu/GlotSparse
  - cis-lmu/GlotStoryBook
metrics:
  - f1

GlotLID

Description

GlotLID is a Fasttext language identification (LID) model that supports more than 1600 languages.

Demo: huggingface
Repository: github
Paper: paper (EMNLP 2023)
Point of Contact: amir@cis.lmu.de

How to use

Here is how to use this model to detect the language of a given text:

>>> import fasttext
>>> from huggingface_hub import hf_hub_download

>>> model_path = hf_hub_download(repo_id="cis-lmu/glotlid", filename="model.bin")
>>> model = fasttext.load_model(model_path)
>>> model.predict("Hello, world!")

If you are not a fan of huggingface_hub, then download the model directyly:

>>> ! wget https://huggingface.co/cis-lmu/glotlid/resolve/main/model.bin

>>> import fasttext

>>> model = fasttext.load_model("/path/to/model.bin")
>>> model.predict("Hello, world!")

License

The model is distributed under the Apache License, Version 2.0.

Version

We always maintain the previous version of GlotLID in our repository.

To access a specific version, simply append the version number to the filename.

For v1: model_v1.bin (introduced in the GlotLID paper and used in all experiments).
For v2: model_v2.bin (an edited version of v1, featuring more languages, and cleaned from noisy corpora based on the analysis of v1).

model.bin always refers to the latest version (v2).

References

If you use this model, please cite the following paper:

@inproceedings{
  kargaran2023glotlid,
  title={{GlotLID}: Language Identification for Low-Resource Languages},
  author={Kargaran, Amir Hossein and Imani, Ayyoob and Yvon, Fran{\c{c}}ois and Sch{\"u}tze, Hinrich},
  booktitle={The 2023 Conference on Empirical Methods in Natural Language Processing},
  year={2023},
  url={https://openreview.net/forum?id=dl4e3EBz5j}
}