Text Classification
fastText
language-identification
glotlid / README.md
kargaranamir's picture
Update README.md (#1)
e575548 verified
|
raw
history blame
No virus
13.3 kB
---
license: apache-2.0
language:
- aak
- aau
- aaz
- aba
- abk
- abn
- abq
- abt
- abx
- aby
- abz
- aca
- acd
- ace
- acf
- ach
- acm
- acn
- acq
- acr
- acu
- ada
- ade
- adh
- adi
- adj
- adl
- ady
- adz
- aeb
- aer
- aeu
- aey
- afb
- afh
- afr
- agd
- agg
- agm
- agn
- agr
- agt
- agu
- agw
- agx
- aha
- ahk
- aia
- aii
- aim
- ain
- ajg
- aji
- ajp
- ajz
- aka
- akb
- ake
- akh
- akl
- akp
- ald
- alj
- aln
- alp
- alq
- als
- alt
- aly
- alz
- ame
- amf
- amh
- ami
- amk
- amm
- amn
- amp
- amr
- amu
- amx
- ang
- anm
- ann
- anv
- any
- aoc
- aoi
- aoj
- aom
- aon
- aoz
- apb
- apc
- ape
- apn
- apr
- apt
- apu
- apw
- apy
- apz
- ara
- arb
- are
- arg
- arh
- arl
- arn
- arp
- arq
- ars
- ary
- arz
- asg
- asm
- aso
- ast
- ata
- atb
- atd
- atg
- ati
- att
- auc
- aui
- auy
- ava
- avk
- avt
- avu
- awa
- awb
- awi
- awx
- aym
- ayo
- ayr
- azb
- aze
- azg
- azj
- azz
- bak
- bal
- bam
- ban
- bao
- bar
- bas
- bav
- bba
- bbb
- bbc
- bbj
- bbr
- bcc
- bch
- bci
- bcl
- bco
- bcw
- bdd
- bdh
- bea
- bef
- bel
- bem
- ben
- beq
- ber
- bex
- bfd
- bfo
- bfz
- bgr
- bgs
- bgz
- bhg
- bhl
- bho
- bhp
- bhw
- bib
- big
- bih
- bik
- bim
- bin
- bis
- biu
- biv
- bjn
- bjp
- bjr
- bjv
- bkd
- bkq
- bku
- bkv
- bla
- blh
- blw
- blz
- bmb
- bmh
- bmk
- bmq
- bmr
- bmu
- bnj
- bnp
- boa
- bod
- boj
- bom
- bon
- bos
- bov
- box
- bpr
- bps
- bpy
- bqc
- bqj
- bqp
- bre
- bru
- brx
- bsc
- bsn
- bsp
- bsq
- bss
- btd
- btg
- bth
- bts
- btt
- btx
- bua
- bud
- bug
- buk
- bul
- bum
- bus
- bvr
- bvy
- bvz
- bwd
- bwi
- bwq
- bwu
- bxh
- bxr
- byr
- byv
- byx
- bzd
- bzh
- bzi
- bzj
- bzt
- caa
- cab
- cac
- caf
- cag
- cak
- cao
- cap
- caq
- car
- cas
- cat
- cav
- cax
- cay
- cbc
- cbi
- cbk
- cbr
- cbs
- cbt
- cbu
- cbv
- cce
- cco
- ceb
- ceg
- cek
- ces
- cfm
- cgc
- cgg
- cha
- chd
- che
- chf
- chg
- chj
- chk
- chn
- cho
- chq
- chr
- chu
- chv
- chw
- chz
- cjk
- cjo
- cjp
- cjs
- cjv
- cjy
- ckb
- cko
- ckt
- cle
- clu
- cly
- cme
- cmi
- cmn
- cmo
- cnh
- cni
- cnl
- cnt
- cnw
- coe
- cof
- cok
- con
- cop
- cor
- cos
- cot
- cpa
- cpb
- cpc
- cpi
- cpu
- cpy
- crh
- cri
- crk
- crm
- crn
- crq
- crs
- crt
- crx
- csb
- csk
- cso
- csw
- csy
- cta
- ctd
- cto
- ctp
- ctu
- cub
- cuc
- cui
- cuk
- cul
- cut
- cux
- cwd
- cwe
- cwt
- cya
- cym
- czt
- daa
- dad
- daf
- dah
- dak
- dan
- dar
- ddg
- ddo
- ded
- des
- deu
- dga
- dgc
- dgi
- dgr
- dgz
- dhg
- dhm
- dhv
- dig
- dik
- din
- dip
- diq
- dis
- diu
- div
- dje
- djk
- djr
- dks
- dng
- dnj
- dob
- dop
- dow
- drg
- drt
- dru
- dsb
- dtp
- dts
- dua
- due
- dug
- duo
- dur
- dwr
- dws
- dww
- dyi
- dyo
- dyu
- dzo
- ebk
- efi
- egl
- eka
- ekk
- eko
- ell
- emi
- eml
- emp
- emx
- enb
- eng
- enl
- enm
- enx
- epo
- eri
- ese
- esi
- esk
- est
- esu
- eto
- etr
- etu
- eus
- eve
- evn
- ewe
- ewo
- ext
- eza
- faa
- fad
- fai
- fal
- fan
- fao
- fas
- fat
- ffm
- fij
- fil
- fin
- fkv
- fmp
- fon
- for
- fra
- frm
- fro
- frr
- fry
- fub
- fuc
- fud
- fue
- fuf
- fuh
- fuq
- fur
- fuv
- gaa
- gag
- gah
- gai
- gam
- gaw
- gaz
- gba
- gbi
- gbm
- gbo
- gbr
- gcf
- gcr
- gde
- gdg
- gdn
- gdr
- geb
- gej
- gfk
- ghe
- ghs
- gid
- gil
- giz
- gjn
- gkn
- gkp
- gla
- gle
- glg
- glk
- glv
- gmv
- gnb
- gnd
- gng
- gnn
- gnw
- goa
- gof
- gog
- gom
- gor
- gos
- got
- gqr
- grc
- grn
- grt
- gso
- gsw
- gub
- guc
- gud
- gug
- guh
- gui
- guj
- guk
- gul
- gum
- gun
- guo
- guq
- gur
- guw
- gux
- guz
- gvc
- gvf
- gvl
- gvn
- gwi
- gxx
- gya
- gym
- gyr
- hae
- hag
- hak
- hat
- hau
- hav
- haw
- hay
- hbo
- hbs
- hch
- hdn
- heb
- heg
- heh
- her
- hif
- hig
- hil
- hin
- hix
- hla
- hlt
- hmn
- hmo
- hmr
- hne
- hnj
- hnn
- hns
- hoc
- hop
- hot
- hra
- hrv
- hrx
- hsb
- hsn
- hto
- hub
- hui
- hun
- hus
- huu
- huv
- hvn
- hwc
- hye
- hyw
- ian
- iba
- ibg
- ibo
- icr
- ido
- idu
- ifa
- ifb
- ife
- ifk
- ifu
- ify
- ige
- ign
- igs
- iii
- ijc
- ike
- ikk
- ikw
- ilb
- ile
- ilo
- imo
- ina
- inb
- ind
- ino
- iou
- ipi
- iqw
- iri
- irk
- iry
- isd
- ish
- isl
- iso
- ita
- its
- itv
- ium
- ivb
- ivv
- iws
- ixl
- izh
- izr
- izz
- jac
- jae
- jam
- jav
- jbo
- jbu
- jdt
- jic
- jiv
- jmc
- jmx
- jpa
- jpn
- jra
- jvn
- kaa
- kab
- kac
- kal
- kam
- kan
- kao
- kap
- kaq
- kas
- kat
- kaz
- kbc
- kbd
- kbh
- kbm
- kbp
- kbq
- kbr
- kck
- kdc
- kde
- kdh
- kdi
- kdj
- kdl
- kea
- kei
- kek
- ken
- ket
- kew
- kex
- kez
- kff
- kgf
- kgk
- kgp
- kha
- khk
- khm
- khs
- khy
- khz
- kia
- kik
- kin
- kir
- kiu
- kix
- kjb
- kje
- kjh
- kjs
- kkc
- kki
- kkj
- kkl
- klj
- kln
- klt
- klv
- kma
- kmb
- kmg
- kmh
- kmk
- kmm
- kmo
- kmr
- kms
- kmu
- knc
- kne
- knf
- kng
- knj
- knk
- kno
- knv
- knx
- kny
- kog
- koi
- kom
- kon
- koo
- kor
- kos
- kpf
- kpg
- kpj
- kpr
- kpv
- kpw
- kpx
- kpz
- kqc
- kqe
- kqf
- kql
- kqn
- kqo
- kqp
- kqs
- kqw
- kqy
- krc
- kri
- krj
- krl
- kru
- ksb
- ksc
- ksd
- ksf
- ksh
- ksj
- ksp
- ksr
- kss
- ksw
- ktb
- ktj
- ktm
- kto
- ktu
- kua
- kub
- kud
- kue
- kuj
- kum
- kup
- kus
- kvj
- kvn
- kwd
- kwf
- kwi
- kwj
- kwn
- kwy
- kxc
- kxm
- kxw
- kyc
- kyf
- kyg
- kyq
- kyu
- kyz
- kze
- kzf
- kzj
- kzn
- laa
- lac
- lad
- lai
- laj
- lam
- lao
- las
- lat
- lav
- lbb
- lbe
- lbj
- lbk
- lch
- lcm
- ldi
- ldn
- lea
- led
- lee
- lef
- leh
- lem
- leu
- lew
- lex
- lez
- lfn
- lgm
- lhi
- lhm
- lhu
- lia
- lid
- lif
- lij
- lim
- lin
- lip
- lir
- lit
- liv
- ljp
- lkt
- llb
- lld
- lln
- lmk
- lmo
- lmp
- lob
- loe
- log
- lol
- lom
- loq
- lou
- loz
- lsi
- lsm
- ltg
- ltz
- lua
- lub
- lue
- lug
- lun
- luo
- lus
- lut
- lvs
- lwo
- lww
- lzh
- lzz
- maa
- mad
- maf
- mag
- mah
- mai
- maj
- mak
- mal
- mam
- maq
- mar
- mas
- mau
- mav
- maw
- max
- maz
- mbb
- mbc
- mbd
- mbf
- mbh
- mbi
- mbj
- mbl
- mbs
- mbt
- mca
- mcb
- mcd
- mcf
- mck
- mcn
- mco
- mcp
- mcq
- mcu
- mda
- mdf
- mdy
- med
- mee
- meh
- mej
- mek
- men
- meq
- mer
- meu
- mev
- mfa
- mfe
- mfh
- mfi
- mfk
- mfq
- mfy
- mfz
- mgc
- mgh
- mgm
- mgo
- mgr
- mgv
- mhi
- mhl
- mhr
- mhw
- mhx
- mhy
- mib
- mic
- mie
- mif
- mig
- mih
- mik
- mil
- min
- mio
- miq
- mir
- mit
- miy
- miz
- mjc
- mjw
- mkd
- mkl
- mkn
- mks
- mkz
- mlg
- mlh
- mlp
- mlt
- mlu
- mmn
- mmo
- mmx
- mna
- mnb
- mnc
- mnf
- mni
- mnk
- mnr
- mnw
- mnx
- mny
- moa
- moc
- mog
- moh
- mon
- mop
- mor
- mos
- mox
- mpg
- mph
- mpm
- mpp
- mps
- mpt
- mpx
- mqb
- mqj
- mqy
- mrg
- mri
- mrj
- mrq
- mrv
- mrw
- msa
- msb
- msc
- mse
- msk
- msm
- msy
- mta
- mtg
- mti
- mtj
- mto
- mtp
- mua
- mug
- muh
- mur
- mus
- mux
- muy
- mva
- mvn
- mvp
- mvv
- mwc
- mwf
- mwl
- mwm
- mwn
- mwp
- mwq
- mwv
- mww
- mxb
- mxp
- mxq
- mxt
- mxv
- mya
- myb
- myk
- myu
- myv
- myw
- myx
- myy
- mza
- mzh
- mzk
- mzl
- mzm
- mzn
- mzw
- mzz
- nab
- naf
- nah
- nak
- nan
- nan
- nap
- naq
- nas
- nau
- nav
- naw
- nba
- nbc
- nbe
- nbl
- nbq
- nbu
- nca
- nch
- ncj
- ncl
- nct
- ncu
- ncx
- ndc
- nde
- ndh
- ndi
- ndj
- ndo
- ndp
- nds
- ndz
- neb
- nep
- new
- nfa
- nfr
- ngb
- ngc
- ngl
- ngp
- ngt
- ngu
- nhd
- nhe
- nhg
- nhi
- nhk
- nho
- nhr
- nhu
- nhw
- nhx
- nhy
- nia
- nif
- nii
- nij
- nim
- nin
- niq
- niu
- niv
- niy
- njb
- njm
- njn
- njo
- njz
- nka
- nki
- nko
- nla
- nlc
- nld
- nlv
- nma
- nmf
- nmh
- nmo
- nmw
- nmz
- nnb
- nng
- nnh
- nnl
- nno
- nnp
- nnq
- nnw
- noa
- nob
- nog
- non
- nop
- nor
- not
- nou
- nov
- nph
- npi
- npl
- npo
- npy
- nre
- nrf
- nri
- nsa
- nse
- nsm
- nsn
- nso
- nss
- nst
- nsu
- ntp
- ntr
- nus
- nuy
- nvm
- nwb
- nwi
- nwx
- nxd
- nya
- nyf
- nyk
- nyn
- nyo
- nyu
- nyy
- nzb
- nzi
- nzm
- oar
- obo
- oci
- ofs
- ogo
- ojb
- oji
- ojs
- oke
- okv
- old
- omw
- ong
- ons
- ood
- opm
- ori
- orm
- orv
- ory
- osp
- oss
- ota
- ote
- otm
- otn
- otq
- ots
- otw
- oym
- ozm
- pab
- pad
- pag
- pah
- pam
- pan
- pao
- pap
- pau
- pbb
- pbc
- pbi
- pbl
- pbt
- pcd
- pck
- pcm
- pdc
- pdt
- pem
- pes
- pfe
- pfl
- phm
- phn
- pib
- pid
- pio
- pir
- pis
- pjt
- pkb
- plg
- pli
- pls
- plt
- plu
- plw
- pma
- pmf
- pms
- pmx
- pnb
- pne
- poe
- poh
- poi
- pol
- pon
- por
- pot
- pov
- poy
- ppk
- ppl
- ppo
- pps
- prf
- prg
- pri
- prk
- prs
- pse
- ptp
- ptu
- pua
- pus
- pwg
- pww
- qub
- quc
- que
- quf
- qug
- quh
- qul
- qup
- qus
- quw
- quy
- quz
- qva
- qvc
- qve
- qvh
- qvi
- qvm
- qvn
- qvo
- qvs
- qvw
- qvz
- qwh
- qxh
- qxl
- qxn
- qxo
- qxq
- qxr
- qya
- rad
- rai
- rap
- rar
- rcf
- rhg
- ria
- rif
- rim
- rkb
- rmc
- rme
- rml
- rmn
- rmo
- rmq
- rmy
- rnd
- rng
- rnl
- roh
- rom
- ron
- roo
- rop
- rro
- rtm
- rub
- rue
- ruf
- run
- rup
- rus
- rwo
- ryu
- sab
- sag
- sah
- san
- sas
- sat
- sba
- sbd
- sbe
- sbl
- sbs
- sby
- sck
- scn
- sco
- sda
- sdh
- sdo
- seh
- ses
- sey
- sfw
- sgb
- sgh
- sgs
- sgw
- sgz
- shi
- shk
- shn
- shp
- shr
- shs
- shu
- shy
- sid
- sig
- sil
- sim
- sin
- sja
- sjn
- skg
- skr
- sld
- slk
- sll
- slv
- sma
- sme
- smk
- sml
- smo
- smt
- sna
- snc
- snd
- snf
- snn
- snp
- snw
- sny
- soe
- som
- sop
- soq
- sot
- soy
- spa
- spl
- spm
- spp
- sps
- spy
- sqi
- srd
- sri
- srm
- srn
- srp
- srq
- srr
- ssd
- ssg
- ssw
- ssx
- stn
- stp
- stq
- sua
- suc
- sue
- suk
- sun
- sur
- sus
- sux
- suz
- swa
- swb
- swc
- swe
- swg
- swh
- swk
- swp
- sxb
- sxn
- syb
- syc
- syl
- szb
- szl
- tab
- tac
- tah
- taj
- tam
- tap
- taq
- tar
- tat
- tav
- taw
- tbc
- tbg
- tbk
- tbl
- tbo
- tby
- tbz
- tca
- tcc
- tcf
- tcs
- tcy
- tcz
- tdt
- tdx
- ted
- tee
- tel
- tem
- teo
- ter
- tet
- tew
- tfr
- tgk
- tgl
- tgo
- tgp
- tha
- thk
- thv
- tif
- tig
- tih
- tik
- tim
- tir
- tiv
- tiy
- tke
- tkl
- tkr
- tku
- tlb
- tlf
- tlh
- tlj
- tll
- tly
- tmd
- tmr
- tmw
- tna
- tnc
- tnk
- tnn
- tnp
- tob
- toc
- tod
- tog
- toh
- toi
- toj
- tok
- ton
- too
- top
- tos
- tpa
- tpi
- tpm
- tpp
- tpt
- tpw
- tpz
- tqb
- trc
- trn
- tro
- trp
- trq
- tsc
- tsg
- tsn
- tso
- tsw
- tsz
- ttc
- tte
- ttj
- ttq
- tts
- tuc
- tue
- tuf
- tui
- tuk
- tum
- tuo
- tur
- tuv
- tvk
- tvl
- twi
- twu
- twx
- txq
- txu
- tyv
- tzh
- tzj
- tzl
- tzm
- tzo
- ubr
- ubu
- udm
- udu
- uig
- ukr
- umb
- upv
- ura
- urb
- urd
- urh
- uri
- urk
- urt
- usa
- usp
- uvh
- uvl
- uzb
- uzn
- vag
- vap
- var
- vec
- ven
- vep
- vgt
- vid
- vie
- viv
- vls
- vmk
- vmw
- vmy
- vol
- vro
- vun
- vut
- waj
- wal
- wap
- war
- wat
- way
- wba
- wbm
- wbp
- wca
- wed
- wer
- wes
- wew
- whg
- whk
- wib
- wim
- wiu
- wln
- wls
- wlv
- wmt
- wmw
- wnc
- wnu
- wob
- wol
- wos
- wrk
- wrs
- wsk
- wuu
- wuv
- wwa
- xal
- xav
- xbi
- xbr
- xed
- xho
- xla
- xmf
- xmv
- xnn
- xog
- xon
- xpe
- xqa
- xrb
- xsb
- xsi
- xsm
- xsr
- xsu
- xtd
- xtm
- xtn
- xuo
- yaa
- yad
- yal
- yam
- yan
- yao
- yap
- yaq
- ybb
- yby
- ycn
- ydd
- yid
- yim
- yka
- ykg
- yle
- yli
- yml
- yom
- yon
- yor
- yrb
- yre
- yrk
- yrl
- yss
- yua
- yue
- yuj
- yup
- yut
- yuw
- yuz
- yva
- zaa
- zab
- zac
- zad
- zae
- zai
- zam
- zao
- zar
- zas
- zat
- zav
- zaw
- zca
- zdj
- zea
- zgh
- zho
- zia
- ziw
- zlm
- zne
- zom
- zos
- zpa
- zpc
- zpd
- zpf
- zpg
- zpi
- zpj
- zpl
- zpm
- zpo
- zpq
- zpt
- zpu
- zpv
- zpz
- zsm
- zsr
- ztq
- zty
- zul
- zyb
- zyp
- zza
tags:
- text-classification
- language-identification
library_name: fasttext
datasets:
- cis-lmu/GlotSparse
- cis-lmu/GlotStoryBook
metrics:
- f1
---
# GlotLID
[![GlotLID](https://img.shields.io/badge/🤗-Open%20In%20Spaces-blue.svg)](https://huggingface.co/spaces/cis-lmu/glotlid-space)
## Description
**GlotLID** is a Fasttext language identification (LID) model that supports more than **1600 languages**.
- **Demo:** [huggingface](https://huggingface.co/spaces/cis-lmu/glotlid-space)
- **Repository:** [github](https://github.com/cisnlp/GlotLID)
- **Paper:** [paper](https://arxiv.org/abs/2310.16248) (EMNLP 2023)
- **Point of Contact:** amir@cis.lmu.de
### How to use
Here is how to use this model to detect the language of a given text:
```python
>>> import fasttext
>>> from huggingface_hub import hf_hub_download
>>> model_path = hf_hub_download(repo_id="cis-lmu/glotlid", filename="model.bin")
>>> model = fasttext.load_model(model_path)
>>> model.predict("Hello, world!")
```
If you are not a fan of huggingface_hub, then download the model directyly:
```python
>>> ! wget https://huggingface.co/cis-lmu/glotlid/resolve/main/model.bin
```
```python
>>> import fasttext
>>> model = fasttext.load_model("/path/to/model.bin")
>>> model.predict("Hello, world!")
```
## License
The model is distributed under the Apache License, Version 2.0.
## Version
We always maintain the previous version of GlotLID in our repository.
To access a specific version, simply append the version number to the `filename`.
- For v1: `model_v1.bin` (introduced in the GlotLID [paper](https://arxiv.org/abs/2310.16248) and used in all experiments).
- For v2: `model_v2.bin` (an edited version of v1, featuring more languages, and cleaned from noisy corpora based on the analysis of v1).
`model.bin` always refers to the latest version (v2).
## References
If you use this model, please cite the following paper:
```
@inproceedings{
kargaran2023glotlid,
title={{GlotLID}: Language Identification for Low-Resource Languages},
author={Kargaran, Amir Hossein and Imani, Ayyoob and Yvon, Fran{\c{c}}ois and Sch{\"u}tze, Hinrich},
booktitle={The 2023 Conference on Empirical Methods in Natural Language Processing},
year={2023},
url={https://openreview.net/forum?id=dl4e3EBz5j}
}
```