File size: 2,326 Bytes

04e93aa
 
0c503b1
 
 
 
 
 
 
 
 
04e93aa
0c503b1
 
 
eedafa7
 
0c503b1
 
119abaa
0c503b1
240911c
 
33bd7ec
6ebb502
0c503b1
eedafa7
 
0c503b1
 
 
 
 
 
 
 
0f617a6
0c503b1
 
9310586
 
 
 
 
 
 
0c503b1
9310586
 
 
 
 
0c503b1
 
9310586
0c503b1
 
 
 
f6d6f66
 
 
 
 
 
f1db5e8
f6d6f66
 
 
 
 
0c503b1

---
license: apache-2.0
tags:
- text-classification
- language-identification
library_name: fasttext
datasets:
- cis-lmu/GlotSparse
- cis-lmu/GlotStoryBook
metrics:
- f1
---

# GlotLID

[![GlotLID](https://img.shields.io/badge/🤗-Open%20In%20Spaces-blue.svg)](https://huggingface.co/spaces/cis-lmu/glotlid-space)

## Description

**GlotLID** is a Fasttext language identification (LID) model that supports more than **1600 languages**.

- **Demo:** [huggingface](https://huggingface.co/spaces/cis-lmu/glotlid-space)
- **Repository:** [github](https://github.com/cisnlp/GlotLID)
- **Paper:** [paper](https://arxiv.org/abs/2310.16248) (EMNLP 2023)
- **Point of Contact:** amir@cis.lmu.de



### How to use

Here is how to use this model to detect the language of a given text:

```python
>>> import fasttext
>>> from huggingface_hub import hf_hub_download

>>> model_path = hf_hub_download(repo_id="cis-lmu/glotlid", filename="model.bin")
>>> model = fasttext.load_model(model_path)
>>> model.predict("Hello, world!")
```

If you are not a fan of huggingface_hub, then download the model directyly:

```python
>>> ! wget https://huggingface.co/cis-lmu/glotlid/resolve/main/model.bin
```

```python
>>> import fasttext

>>> model = fasttext.load_model("/path/to/model.bin")
>>> model.predict("Hello, world!")
```


## License

The model is distributed under the Apache License, Version 2.0.

## Version

We always maintain the previous version of GlotLID in our repository.

To access a specific version, simply append the version number to the `filename`.

- For v1: `model_v1.bin` (introduced in the GlotLID [paper](https://arxiv.org/abs/2310.16248) and used in all experiments).
- For v2: `model_v2.bin` (an edited version of v1, featuring more languages, and cleaned from noisy corpora based on the analysis of v1).

`model.bin` always refers to the latest version (v2).


## References

If you use this model, please cite the following paper:

```
@inproceedings{
  kargaran2023glotlid,
  title={{GlotLID}: Language Identification for Low-Resource Languages},
  author={Kargaran, Amir Hossein and Imani, Ayyoob and Yvon, Fran{\c{c}}ois and Sch{\"u}tze, Hinrich},
  booktitle={The 2023 Conference on Empirical Methods in Natural Language Processing},
  year={2023},
  url={https://openreview.net/forum?id=dl4e3EBz5j}
}

```