nb-nordic-lid / README.md
versae's picture
Update README.md
12401f3
|
raw
history blame
38.2 kB
metadata
license: openrail
language:
  - dan
  - eng
  - fao
  - fin
  - isl
  - nno
  - nob
  - sma
  - sme
  - smj
  - smn
  - sms
  - swe
library_name: fasttext
tags:
  - text-classification
  - language-identification
  - language-detection
datasets:
  - tatoeba

NB-NORDIC-LID

This repo contains models for the identification of language in text (also referred to as language detection). It is based on Fasttext and designed with the Nordic languages in mind, including several Sámi languages. It comes in two flavours, nb-nordic-lid, a model that identifies between the 12 most common languages in the Nordic countries (plus English), and nb-nordic-lid.159, a model that extends that list to 159 languages of the world. Moreover, each of them come in large and small (quantized) versions.

Model Size Precision Recall F1-Score Support
nb-nordic-lid.bin (large) 274 MB 0.9901 0.9900 0.9900 5500
nb-nordic-lid.ftz (small) 1.87 MB 0.9889 0.9890 0.9890 5500
nb-nordic-lid.159.bin (large) 9.63 GB 0.9434 0.9528 0.9476 44049
nb-nordic-lid.159.ftz (small) 11.2 MB 0.9275 0.9399 0.9327 44049

Usage

After download, the models can be used through the Fasttext library:

from huggingface_hub import hf_hub_download
import fasttext

model_name = "nb-nordic-lid.ftz"
model = fasttext.load_model(hf_hub_download("NbAiLab/nb-nordic-lid", model_name))

model.predict("Debatt er bra og sunt for demokratier, og en forutsetning for politikkutvikling.", threshold=0.25)
# (('__label__nob',), array([0.95482141]))

Alternatively, these models are also integrated into the the experimental nbailab CLI application:

$ echo "Jeg leser en bok" | nbailab langid --model-name  nb-nordic-lid.ftz
nob,0.9999788999557495

Languages

nb-nordic-lid.bin

Trained on sentences from the GiellaT's Translation Memories and Wortschatz's corpora.

ISO-639-3 Language Precision Recall F1-Score Support
dan Danish 0.9720 0.9838 0.9779 494
eng English 0.9980 0.9940 0.9960 502
fao Faroese 0.9920 0.9940 0.9930 499
fin Finnish 1.0000 1.0000 1.0000 500
isl Icelandic 0.9900 0.9920 0.9910 499
nno Norwegian Nynorsk 0.9920 0.9861 0.9890 503
nob Norwegian Bokmål 0.9840 0.9743 0.9791 505
sma Southern Sami 0.9800 0.9703 0.9751 101
sme Northern Sami 1.0000 0.9921 0.9960 504
smj Lule Sami 0.9920 0.9960 0.9940 498
smn Inari Sami 0.9950 1.0000 0.9975 199
sms Skolt Sami 0.9900 0.9950 0.9925 199
swe Swedish 0.9860 0.9920 0.9890 497
Accuracy 0.9905 5500
Weighted avg 0.9906 0.9905 0.9905 5500
Macro avg 0.9901 0.9900 0.9900 5500

nb-nordic-lid.159.bin

Scores for the 159 languages

Additionally trained on sentences from Taoteba.

ISO-639-3 Language Precision Recall F1-Score Support
afr Afrikaans 0.9634 0.9485 0.9558 194
ara Arabic 0.9771 0.9533 0.9650 492
arq Algerian Arabic 0.9478 0.9316 0.9397 117
arz Egyptian Arabic 0.7193 0.8542 0.7810 48
asm Assamese 0.9828 0.9884 0.9856 173
avk Kotava 0.9895 0.9844 0.9869 192
aze Azerbaijani 0.9707 0.9831 0.9768 236
bel Belarusian 0.9864 0.9785 0.9825 372
ben Bengali 0.9915 0.9873 0.9894 236
ber Berber 0.8991 0.8507 0.8742 576
bos Bosnian 0.1548 0.1781 0.1656 73
bre Breton 0.9613 0.9681 0.9647 282
bua Buryat 0.9333 0.9130 0.9231 46
bul Bulgarian 0.9530 0.9660 0.9595 441
cat Catalan 0.9604 0.9510 0.9557 306
cbk Chavacano 0.9627 0.9923 0.9773 130
ceb Cebuano 0.8974 0.9091 0.9032 77
ces Czech 0.9684 0.9665 0.9675 508
chv Chuvash 0.9878 0.9643 0.9759 84
ckb Central Kurdish (Soranî) 0.9751 0.9944 0.9846 354
ckt Chukchi 0.9615 1.0000 0.9804 25
cmn Mandarin Chinese 0.9726 0.8674 0.9170 573
cor Cornish 1.0000 0.9733 0.9864 187
csb Kashubian 0.9787 1.0000 0.9892 46
cym Welsh 0.9625 0.9625 0.9625 80
dan Danish 0.9401 0.9345 0.9373 1007
deu German 0.9908 0.9765 0.9836 553
dsb Lower Sorbian 0.8704 0.8246 0.8468 57
dtp Central Dusun 0.9161 0.9562 0.9357 137
ell Greek 1.0000 0.9979 0.9989 476
eng English 0.9914 0.9886 0.9900 1052
epo Esperanto 0.9817 0.9853 0.9835 544
est Estonian 0.9659 0.9770 0.9714 174
eus Basque 0.9883 0.9585 0.9732 265
fao Faroese 0.9840 0.9899 0.9870 497
fin Finnish 0.9932 0.9817 0.9874 1041
fkv Kven Finnish 0.5769 0.7500 0.6522 20
fra French 0.9890 0.9890 0.9890 544
frr North Frisian 0.9784 0.9784 0.9784 139
fry Frisian 0.7419 0.9200 0.8214 25
gcf Guadeloupean Creole French 0.9810 0.9904 0.9856 104
gla Scottish Gaelic 0.9608 0.9800 0.9703 50
gle Irish 0.9781 0.9853 0.9817 136
glg Galician 0.9198 0.9330 0.9264 209
gos Gronings 0.9631 0.9671 0.9651 243
grc Ancient Greek 0.9828 1.0000 0.9913 57
grn Guarani 0.9810 0.9936 0.9873 156
guc Wayuu 0.9556 1.0000 0.9773 43
hau Hausa 0.9930 0.9930 0.9930 431
heb Hebrew 1.0000 1.0000 1.0000 536
hin Hindi 1.0000 0.9974 0.9987 391
hoc Ho 0.9143 1.0000 0.9552 32
hrv Croatian 0.6085 0.5652 0.5861 253
hrx Hunsrik 0.8727 0.9231 0.8972 52
hsb Upper Sorbian 0.8533 0.8312 0.8421 77
hun Hungarian 0.9853 0.9889 0.9871 541
hye Armenian 1.0000 1.0000 1.0000 225
ido Ido 0.9731 0.9560 0.9645 341
ile Interlingue 0.9386 0.9450 0.9418 291
ilo Ilocano 0.9917 0.9677 0.9796 124
ina Interlingua 0.9602 0.9775 0.9688 444
ind Indonesian 0.8550 0.8305 0.8426 419
isl Icelandic 0.9874 0.9931 0.9902 869
ita Italian 0.9835 0.9746 0.9791 552
jav Javanese 0.9400 0.9792 0.9592 48
jbo Lojban 1.0000 1.0000 1.0000 402
jpn Japanese 0.9870 1.0000 0.9935 531
kab Kabyle 0.8382 0.9012 0.8686 506
kat Georgian 1.0000 0.9885 0.9942 260
kaz Kazakh 0.9896 0.9896 0.9896 192
kha Khasi 0.9038 0.9400 0.9216 100
khm Khmer 1.0000 1.0000 1.0000 75
kmr Northern Kurdish (Kurmancî) 0.9881 0.9793 0.9837 338
knc Central Kanuri 0.9775 1.0000 0.9886 174
kor Korean 1.0000 0.9806 0.9902 360
kzj Coastal Kadazan 0.9744 0.9580 0.9661 238
lad Ladino 0.8154 0.8281 0.8217 64
lat Latin 0.9756 0.9677 0.9717 496
lfn Lingua Franca Nova 0.9745 0.9768 0.9757 431
lij Ligurian 0.9556 0.9556 0.9556 90
lin Lingala 0.9859 0.9859 0.9859 213
lit Lithuanian 0.9903 0.9942 0.9922 513
ltz Luxembourgish 0.9773 0.9149 0.9451 47
lvs Latvian 0.9732 0.9797 0.9764 148
lzh Literary Chinese 0.7473 0.9444 0.8344 72
mal Malayalam 1.0000 1.0000 1.0000 44
mar Marathi 0.9961 1.0000 0.9980 509
mhr Meadow Mari 0.9899 0.9801 0.9850 201
mkd Macedonian 0.9630 0.9447 0.9538 524
mon Mongolian 0.9781 0.9710 0.9745 138
mus Muskogee (Creek) 0.9333 0.9655 0.9492 29
mya Burmese 1.0000 1.0000 1.0000 27
nds Low German (Low Saxon) 0.9829 0.9805 0.9817 410
nld Dutch 0.9681 0.9810 0.9745 526
nnb Nande 0.9896 0.9845 0.9870 387
nno Norwegian Nynorsk 0.9551 0.9685 0.9617 571
nob Norwegian Bokmål 0.9280 0.9168 0.9224 914
nst Naga (Tangshang) 1.0000 1.0000 1.0000 39
nus Nuer 0.9903 1.0000 0.9951 102
oci Occitan 0.9795 0.9598 0.9696 249
orv Old East Slavic 0.9846 1.0000 0.9922 64
oss Ossetian 0.9891 0.9963 0.9927 272
ota Ottoman Turkish 0.9469 0.9727 0.9596 110
pam Kapampangan 0.9865 0.9733 0.9799 75
pcd Picard 0.9552 0.9697 0.9624 66
pes Persian 0.9934 0.9956 0.9945 454
pms Piedmontese 0.9268 0.9744 0.9500 39
pol Polish 0.9886 0.9886 0.9886 525
por Portuguese 0.9669 0.9686 0.9677 542
prg Old Prussian 0.9800 0.9608 0.9703 51
rhg Rohingya 0.9890 1.0000 0.9945 180
rom Romani 0.9535 0.8913 0.9213 46
ron Romanian 0.9870 0.9785 0.9827 465
run Kirundi 0.9871 0.9746 0.9808 236
rus Russian 0.9671 0.9796 0.9733 540
sah Yakut 1.0000 1.0000 1.0000 48
sat Santali 1.0000 1.0000 1.0000 171
sdh Southern Kurdish 0.9808 0.9107 0.9444 56
shi Tashelhit 0.9779 0.9172 0.9466 145
slk Slovak 0.9235 0.9421 0.9327 397
slv Slovenian 0.7544 0.8958 0.8190 48
sma Southern Sami 0.9600 0.9600 0.9600 100
sme Northern Sami 1.0000 0.9901 0.9950 505
smj Lule Sami 0.9860 1.0000 0.9930 493
smn Inari Sami 0.9950 0.9950 0.9950 200
sms Skolt Sami 0.9850 0.9899 0.9875 199
spa Spanish 0.9779 0.9619 0.9698 551
sqi Albanian 0.9683 0.9839 0.9760 124
srp Serbian 0.8347 0.8313 0.8330 492
swc Congo Swahili 0.8750 0.8594 0.8671 448
swe Swedish 0.9809 0.9839 0.9824 991
swg Swabian 0.9898 0.9604 0.9749 101
swh Swahili 0.6946 0.7382 0.7157 191
tat Tatar 0.9817 0.9843 0.9830 382
tgl Tagalog 0.9830 0.9830 0.9830 412
tha Thai 1.0000 1.0000 1.0000 220
thv Tahaggart Tamahaq 0.7241 0.8400 0.7778 25
tig Tigre 1.0000 1.0000 1.0000 181
tlh Klingon 1.0000 1.0000 1.0000 439
tok Toki Pona 1.0000 1.0000 1.0000 495
tpw Old Tupi 0.8929 0.9615 0.9259 26
tuk Turkmen 0.9890 0.9711 0.9800 277
tur Turkish 0.9872 0.9659 0.9764 558
uig Uyghur 0.9966 0.9933 0.9950 299
ukr Ukrainian 0.9813 0.9850 0.9831 532
urd Urdu 1.0000 0.9914 0.9957 116
uzb Uzbek 0.8200 0.9762 0.8913 42
vie Vietnamese 0.9977 0.9977 0.9977 426
vol Volapük 0.9862 0.9908 0.9885 217
war Waray 0.9505 0.9796 0.9648 98
wuu Shanghainese 0.8364 0.9275 0.8796 193
xal Kalmyk 0.9302 0.9756 0.9524 41
xmf Mingrelian 0.7419 0.8519 0.7931 27
yid Yiddish 0.9971 1.0000 0.9986 348
yue Cantonese 0.9195 0.9877 0.9524 243
zgh Standard Moroccan Tamazight 0.9873 0.9873 0.9873 158
zlm Malay (Vernacular) 0.8605 0.9024 0.8810 82
zsm Malay 0.7782 0.7921 0.7851 279
zza Zaza 0.9294 0.9294 0.9294 85
Accuracy 0.9620 44049
Weighted avg 0.9627 0.9620 0.9621 44049
Macro avg 0.9434 0.9528 0.9476 44049

nb-nordic-lid.ftz

The small models are quantized versions of the large versions using a cutoff of 50,000 words and ngrams and quantizing the norm separately.

ISO-639-3 Language Precision Recall F1-Score Support
dan Danish 0.9700 0.9838 0.9768 493
eng English 0.9980 0.9940 0.9960 502
fao Faroese 0.9920 0.9920 0.9920 500
fin Finnish 1.0000 1.0000 1.0000 500
isl Icelandic 0.9880 0.9920 0.9900 498
nno Norwegian Nynorsk 0.9880 0.9841 0.9860 502
nob Norwegian Bokmål 0.9860 0.9705 0.9782 508
sma Southern Sami 0.9800 0.9703 0.9751 101
sme Northern Sami 1.0000 0.9921 0.9960 504
smj Lule Sami 0.9920 0.9940 0.9930 499
smn Inari Sami 0.9950 1.0000 0.9975 199
sms Skolt Sami 0.9850 0.9949 0.9899 198
swe Swedish 0.9820 0.9899 0.9859 496
Accuracy 0.9895 5500
Weighted avg 0.9895 0.9895 0.9895 5500
Macro avg 0.9889 0.9890 0.9890 5500

nb-nordic-lid.159.ftz

Scores for the 159 languages (compressed model)
ISO-639-3 Language Precision Recall F1-Score Support
afr Afrikaans 0.9529 0.9333 0.9430 195
ara Arabic 0.9708 0.9191 0.9443 507
arq Algerian Arabic 0.8783 0.8783 0.8783 115
arz Egyptian Arabic 0.5439 0.8378 0.6596 37
asm Assamese 0.9828 0.9448 0.9634 181
avk Kotava 0.9843 0.9792 0.9817 192
aze Azerbaijani 0.9582 0.9828 0.9703 233
bel Belarusian 0.9919 0.9683 0.9799 378
ben Bengali 0.9574 0.9868 0.9719 228
ber Berber 0.8495 0.7928 0.8202 584
bos Bosnian 0.1429 0.2264 0.1752 53
bre Breton 0.9507 0.9712 0.9609 278
bua Buryat 0.9333 0.9333 0.9333 45
bul Bulgarian 0.9351 0.9457 0.9404 442
cat Catalan 0.9406 0.9406 0.9406 303
cbk Chavacano 0.9552 0.9624 0.9588 133
ceb Cebuano 0.8718 0.8500 0.8608 80
ces Czech 0.9586 0.9548 0.9567 509
chv Chuvash 1.0000 0.9647 0.9820 85
ckb Central Kurdish (Soranî) 0.9640 0.9748 0.9694 357
ckt Chukchi 0.9615 1.0000 0.9804 25
cmn Mandarin Chinese 0.9667 0.8165 0.8853 605
cor Cornish 0.9780 0.9674 0.9727 184
csb Kashubian 0.9574 1.0000 0.9783 45
cym Welsh 0.9625 0.9506 0.9565 81
dan Danish 0.9281 0.9355 0.9318 993
deu German 0.9853 0.9781 0.9817 549
dsb Lower Sorbian 0.8889 0.8276 0.8571 58
dtp Central Dusun 0.8741 0.9470 0.9091 132
ell Greek 0.9958 0.9937 0.9947 476
eng English 0.9886 0.9876 0.9881 1050
epo Esperanto 0.9853 0.9818 0.9835 548
est Estonian 0.9489 0.9766 0.9625 171
eus Basque 0.9844 0.9583 0.9712 264
fao Faroese 0.9780 0.9819 0.9800 498
fin Finnish 0.9922 0.9724 0.9822 1050
fkv Kven Finnish 0.5385 0.7368 0.6222 19
fra French 0.9871 0.9728 0.9799 552
frr North Frisian 0.9640 0.9640 0.9640 139
fry Frisian 0.7097 0.8462 0.7719 26
gcf Guadeloupean Creole French 0.9714 0.9808 0.9761 104
gla Scottish Gaelic 0.9608 0.9608 0.9608 51
gle Irish 0.9489 0.9924 0.9701 131
glg Galician 0.8868 0.9082 0.8974 207
gos Gronings 0.9426 0.9544 0.9485 241
grc Ancient Greek 0.9483 0.9483 0.9483 58
grn Guarani 0.9684 0.9935 0.9808 154
guc Wayuu 0.9333 1.0000 0.9655 42
hau Hausa 0.9861 0.9884 0.9872 430
heb Hebrew 0.9981 0.9907 0.9944 540
hin Hindi 0.9974 0.9898 0.9936 393
hoc Ho 0.8571 1.0000 0.9231 30
hrv Croatian 0.6766 0.5911 0.6310 269
hrx Hunsrik 0.8545 0.9216 0.8868 51
hsb Upper Sorbian 0.8400 0.8182 0.8289 77
hun Hungarian 0.9816 0.9852 0.9834 541
hye Armenian 1.0000 1.0000 1.0000 225
ido Ido 0.9672 0.9501 0.9586 341
ile Interlingue 0.9352 0.9547 0.9448 287
ilo Ilocano 0.9917 0.9600 0.9756 125
ina Interlingua 0.9580 0.9558 0.9569 453
ind Indonesian 0.8231 0.8034 0.8131 417
isl Icelandic 0.9805 0.9885 0.9845 867
ita Italian 0.9817 0.9555 0.9684 562
jav Javanese 0.9400 0.9792 0.9592 48
jbo Lojban 1.0000 0.9975 0.9988 403
jpn Japanese 0.9684 0.9981 0.9830 522
kab Kabyle 0.7702 0.8516 0.8089 492
kat Georgian 1.0000 0.9847 0.9923 261
kaz Kazakh 0.9792 0.9843 0.9817 191
kha Khasi 0.8942 0.9300 0.9118 100
khm Khmer 1.0000 0.9868 0.9934 76
kmr Northern Kurdish (Kurmancî) 0.9791 0.9647 0.9719 340
knc Central Kanuri 0.9775 0.9943 0.9858 175
kor Korean 0.9972 0.9778 0.9874 360
kzj Coastal Kadazan 0.9658 0.9378 0.9516 241
lad Ladino 0.7538 0.8033 0.7778 61
lat Latin 0.9614 0.9594 0.9604 493
lfn Lingua Franca Nova 0.9722 0.9611 0.9666 437
lij Ligurian 0.8778 0.9753 0.9240 81
lin Lingala 0.9859 0.9677 0.9767 217
lit Lithuanian 0.9864 0.9864 0.9864 515
ltz Luxembourgish 0.9773 0.9149 0.9451 47
lvs Latvian 0.9597 0.9662 0.9630 148
lzh Literary Chinese 0.6593 0.8108 0.7273 74
mal Malayalam 1.0000 1.0000 1.0000 44
mar Marathi 0.9902 0.9980 0.9941 507
mhr Meadow Mari 0.9899 0.9752 0.9825 202
mkd Macedonian 0.9397 0.9253 0.9324 522
mon Mongolian 0.9781 0.9571 0.9675 140
mus Muskogee (Creek) 0.9000 0.9643 0.9310 28
mya Burmese 1.0000 1.0000 1.0000 27
nds Low German (Low Saxon) 0.9829 0.9687 0.9757 415
nld Dutch 0.9644 0.9735 0.9689 528
nnb Nande 0.9870 0.9896 0.9883 384
nno Norwegian Nynorsk 0.9499 0.9632 0.9565 571
nob Norwegian Bokmål 0.9324 0.9073 0.9197 928
nst Naga (Tangshang) 1.0000 0.9750 0.9873 40
nus Nuer 0.9903 1.0000 0.9951 102
oci Occitan 0.9631 0.9476 0.9553 248
orv Old East Slavic 0.9538 0.9254 0.9394 67
oss Ossetian 0.9818 0.9926 0.9872 271
ota Ottoman Turkish 0.9204 0.9455 0.9327 110
pam Kapampangan 0.9730 0.9600 0.9664 75
pcd Picard 0.9254 0.9688 0.9466 64
pes Persian 0.9846 0.9868 0.9857 454
pms Piedmontese 0.9024 0.9487 0.9250 39
pol Polish 0.9867 0.9885 0.9876 524
por Portuguese 0.9595 0.9577 0.9586 544
prg Old Prussian 0.9800 0.9423 0.9608 52
rhg Rohingya 0.9835 0.9835 0.9835 182
rom Romani 0.9302 0.8511 0.8889 47
ron Romanian 0.9783 0.9762 0.9772 462
run Kirundi 0.9871 0.9426 0.9644 244
rus Russian 0.9561 0.9757 0.9658 536
sah Yakut 0.9792 1.0000 0.9895 47
sat Santali 0.9942 1.0000 0.9971 170
sdh Southern Kurdish 0.8462 0.8627 0.8544 51
shi Tashelhit 0.9706 0.8980 0.9329 147
slk Slovak 0.9111 0.9318 0.9213 396
slv Slovenian 0.7018 0.9302 0.8000 43
sma Southern Sami 0.9500 0.9406 0.9453 101
sme Northern Sami 1.0000 0.9843 0.9921 508
smj Lule Sami 0.9840 0.9980 0.9909 493
smn Inari Sami 0.9850 0.9949 0.9899 198
sms Skolt Sami 0.9700 0.9848 0.9773 197
spa Spanish 0.9613 0.9560 0.9586 545
sqi Albanian 0.9603 0.9680 0.9641 125
srp Serbian 0.8122 0.8106 0.8114 491
swc Congo Swahili 0.8500 0.8367 0.8433 447
swe Swedish 0.9759 0.9778 0.9768 992
swg Swabian 0.9796 0.9320 0.9552 103
swh Swahili 0.6650 0.7068 0.6853 191
tat Tatar 0.9739 0.9816 0.9777 380
tgl Tagalog 0.9709 0.9732 0.9721 411
tha Thai 1.0000 1.0000 1.0000 220
thv Tahaggart Tamahaq 0.6552 0.7600 0.7037 25
tig Tigre 1.0000 1.0000 1.0000 181
tlh Klingon 0.9977 0.9955 0.9966 440
tok Toki Pona 1.0000 1.0000 1.0000 495
tpw Old Tupi 0.8214 0.8846 0.8519 26
tuk Turkmen 0.9779 0.9708 0.9744 274
tur Turkish 0.9780 0.9604 0.9691 556
uig Uyghur 0.9933 0.9900 0.9916 299
ukr Ukrainian 0.9682 0.9700 0.9691 533
urd Urdu 1.0000 0.9914 0.9957 116
uzb Uzbek 0.8000 0.9756 0.8791 41
vie Vietnamese 0.9977 0.9977 0.9977 426
vol Volapük 0.9862 0.9817 0.9840 219
war Waray 0.9208 0.9688 0.9442 96
wuu Shanghainese 0.8037 0.9053 0.8515 190
xal Kalmyk 0.9070 0.9512 0.9286 41
xmf Mingrelian 0.6774 0.8400 0.7500 25
yid Yiddish 0.9828 0.9942 0.9885 345
yue Cantonese 0.8314 0.9688 0.8948 224
zgh Standard Moroccan Tamazight 0.9873 0.9873 0.9873 158
zlm Malay (Vernacular) 0.8488 0.8588 0.8538 85
zsm Malay 0.7465 0.7544 0.7504 281
zza Zaza 0.8824 0.9146 0.8982 82
Accuracy 0.9513 44049
Weighted avg 0.9529 0.9513 0.9518 44049
Macro avg 0.9275 0.9399 0.9327 44049

Citing & Authors

The model was trained by Javier de la Rosa. Data was prepared by Per Egil Kummervold and Javier de la Rosa. Documentation written by Javier de la Rosa.