|
--- |
|
tags: |
|
- Transformers |
|
- text-classification |
|
- multi-class-classification |
|
languages: |
|
- af-ZA |
|
- am-ET |
|
- ar-SA |
|
- az-AZ |
|
- bn-BD |
|
- cy-GB |
|
- da-DK |
|
- de-DE |
|
- el-GR |
|
- en-US |
|
- es-ES |
|
- fa-IR |
|
- fi-FI |
|
- fr-FR |
|
- he-IL |
|
- hi-IN |
|
- hu-HU |
|
- hy-AM |
|
- id-ID |
|
- is-IS |
|
- it-IT |
|
- ja-JP |
|
- jv-ID |
|
- ka-GE |
|
- km-KH |
|
- kn-IN |
|
- ko-KR |
|
- lv-LV |
|
- ml-IN |
|
- mn-MN |
|
- ms-MY |
|
- my-MM |
|
- nb-NO |
|
- nl-NL |
|
- pl-PL |
|
- pt-PT |
|
- ro-RO |
|
- ru-RU |
|
- sl-SL |
|
- sq-AL |
|
- sv-SE |
|
- sw-KE |
|
- ta-IN |
|
- te-IN |
|
- th-TH |
|
- tl-PH |
|
- tr-TR |
|
- ur-PK |
|
- vi-VN |
|
- zh-CN |
|
- zh-TW |
|
multilinguality: |
|
- af-ZA |
|
- am-ET |
|
- ar-SA |
|
- az-AZ |
|
- bn-BD |
|
- cy-GB |
|
- da-DK |
|
- de-DE |
|
- el-GR |
|
- en-US |
|
- es-ES |
|
- fa-IR |
|
- fi-FI |
|
- fr-FR |
|
- he-IL |
|
- hi-IN |
|
- hu-HU |
|
- hy-AM |
|
- id-ID |
|
- is-IS |
|
- it-IT |
|
- ja-JP |
|
- jv-ID |
|
- ka-GE |
|
- km-KH |
|
- kn-IN |
|
- ko-KR |
|
- lv-LV |
|
- ml-IN |
|
- mn-MN |
|
- ms-MY |
|
- my-MM |
|
- nb-NO |
|
- nl-NL |
|
- pl-PL |
|
- pt-PT |
|
- ro-RO |
|
- ru-RU |
|
- sl-SL |
|
- sq-AL |
|
- sv-SE |
|
- sw-KE |
|
- ta-IN |
|
- te-IN |
|
- th-TH |
|
- tl-PH |
|
- tr-TR |
|
- ur-PK |
|
- vi-VN |
|
- zh-CN |
|
- zh-TW |
|
datasets: |
|
- qanastek/MASSIVE |
|
widget: |
|
- text: "wake me up at five am this week" |
|
- text: "je veux écouter la chanson de jacques brel encore une fois" |
|
- text: "quiero escuchar la canción de arijit singh una vez más" |
|
- text: "olly onde é que á um parque por perto onde eu possa correr" |
|
- text: "פרק הבא בפודקאסט בבקשה" |
|
- text: "亚马逊股价" |
|
- text: "найди билет на поезд в санкт-петербург" |
|
license: cc-by-4.0 |
|
--- |
|
|
|
**People Involved** |
|
|
|
* [LABRAK Yanis](https://www.linkedin.com/in/yanis-labrak-8a7412145/) (1) |
|
|
|
**Affiliations** |
|
|
|
1. [LIA, NLP team](https://lia.univ-avignon.fr/), Avignon University, Avignon, France. |
|
|
|
## Model |
|
|
|
XLM-Roberta : [https://huggingface.co/xlm-roberta-base](https://huggingface.co/xlm-roberta-base) |
|
|
|
Paper : [Unsupervised Cross-lingual Representation Learning at Scale](https://arxiv.org/pdf/1911.02116.pdf) |
|
|
|
## Demo: How to use in HuggingFace Transformers Pipeline |
|
|
|
Requires [transformers](https://pypi.org/project/transformers/): ```pip install transformers``` |
|
|
|
```python |
|
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TextClassificationPipeline |
|
model_name = 'qanastek/51-languages-classifier' |
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
model = AutoModelForSequenceClassification.from_pretrained(model_name) |
|
classifier = TextClassificationPipeline(model=model, tokenizer=tokenizer) |
|
res = classifier("פרק הבא בפודקאסט בבקשה") |
|
print(res) |
|
``` |
|
|
|
Outputs: |
|
|
|
```python |
|
[{'label': 'he-IL', 'score': 0.9998375177383423}] |
|
``` |
|
|
|
## Training data |
|
|
|
[MASSIVE](https://huggingface.co/datasets/qanastek/MASSIVE) is a parallel dataset of > 1M utterances across 51 languages with annotations for the Natural Language Understanding tasks of intent prediction and slot annotation. Utterances span 60 intents and include 55 slot types. MASSIVE was created by localizing the SLURP dataset, composed of general Intelligent Voice Assistant single-shot interactions. |
|
|
|
### Languages |
|
|
|
Thee model is capable of distinguish 51 languages : |
|
|
|
- `Afrikaans - South Africa (af-ZA)` |
|
- `Amharic - Ethiopia (am-ET)` |
|
- `Arabic - Saudi Arabia (ar-SA)` |
|
- `Azeri - Azerbaijan (az-AZ)` |
|
- `Bengali - Bangladesh (bn-BD)` |
|
- `Chinese - China (zh-CN)` |
|
- `Chinese - Taiwan (zh-TW)` |
|
- `Danish - Denmark (da-DK)` |
|
- `German - Germany (de-DE)` |
|
- `Greek - Greece (el-GR)` |
|
- `English - United States (en-US)` |
|
- `Spanish - Spain (es-ES)` |
|
- `Farsi - Iran (fa-IR)` |
|
- `Finnish - Finland (fi-FI)` |
|
- `French - France (fr-FR)` |
|
- `Hebrew - Israel (he-IL)` |
|
- `Hungarian - Hungary (hu-HU)` |
|
- `Armenian - Armenia (hy-AM)` |
|
- `Indonesian - Indonesia (id-ID)` |
|
- `Icelandic - Iceland (is-IS)` |
|
- `Italian - Italy (it-IT)` |
|
- `Japanese - Japan (ja-JP)` |
|
- `Javanese - Indonesia (jv-ID)` |
|
- `Georgian - Georgia (ka-GE)` |
|
- `Khmer - Cambodia (km-KH)` |
|
- `Korean - Korea (ko-KR)` |
|
- `Latvian - Latvia (lv-LV)` |
|
- `Mongolian - Mongolia (mn-MN)` |
|
- `Malay - Malaysia (ms-MY)` |
|
- `Burmese - Myanmar (my-MM)` |
|
- `Norwegian - Norway (nb-NO)` |
|
- `Dutch - Netherlands (nl-NL)` |
|
- `Polish - Poland (pl-PL)` |
|
- `Portuguese - Portugal (pt-PT)` |
|
- `Romanian - Romania (ro-RO)` |
|
- `Russian - Russia (ru-RU)` |
|
- `Slovanian - Slovania (sl-SL)` |
|
- `Albanian - Albania (sq-AL)` |
|
- `Swedish - Sweden (sv-SE)` |
|
- `Swahili - Kenya (sw-KE)` |
|
- `Hindi - India (hi-IN)` |
|
- `Kannada - India (kn-IN)` |
|
- `Malayalam - India (ml-IN)` |
|
- `Tamil - India (ta-IN)` |
|
- `Telugu - India (te-IN)` |
|
- `Thai - Thailand (th-TH)` |
|
- `Tagalog - Philippines (tl-PH)` |
|
- `Turkish - Turkey (tr-TR)` |
|
- `Urdu - Pakistan (ur-PK)` |
|
- `Vietnamese - Vietnam (vi-VN)` |
|
- `Welsh - United Kingdom (cy-GB)` |
|
|
|
## Evaluation results |
|
|
|
```plain |
|
precision recall f1-score support |
|
|
|
af-ZA 0.9821 0.9805 0.9813 2974 |
|
am-ET 1.0000 1.0000 1.0000 2974 |
|
ar-SA 0.9809 0.9822 0.9815 2974 |
|
az-AZ 0.9946 0.9845 0.9895 2974 |
|
bn-BD 0.9997 0.9990 0.9993 2974 |
|
cy-GB 0.9970 0.9929 0.9949 2974 |
|
da-DK 0.9575 0.9617 0.9596 2974 |
|
de-DE 0.9906 0.9909 0.9908 2974 |
|
el-GR 0.9997 0.9973 0.9985 2974 |
|
en-US 0.9712 0.9866 0.9788 2974 |
|
es-ES 0.9825 0.9842 0.9834 2974 |
|
fa-IR 0.9940 0.9973 0.9956 2974 |
|
fi-FI 0.9943 0.9946 0.9945 2974 |
|
fr-FR 0.9963 0.9923 0.9943 2974 |
|
he-IL 1.0000 0.9997 0.9998 2974 |
|
hi-IN 1.0000 0.9980 0.9990 2974 |
|
hu-HU 0.9983 0.9950 0.9966 2974 |
|
hy-AM 1.0000 0.9993 0.9997 2974 |
|
id-ID 0.9319 0.9291 0.9305 2974 |
|
is-IS 0.9966 0.9943 0.9955 2974 |
|
it-IT 0.9698 0.9926 0.9811 2974 |
|
ja-JP 0.9987 0.9963 0.9975 2974 |
|
jv-ID 0.9628 0.9744 0.9686 2974 |
|
ka-GE 0.9993 0.9997 0.9995 2974 |
|
km-KH 0.9867 0.9963 0.9915 2974 |
|
kn-IN 1.0000 0.9993 0.9997 2974 |
|
ko-KR 0.9917 0.9997 0.9956 2974 |
|
lv-LV 0.9990 0.9950 0.9970 2974 |
|
ml-IN 0.9997 0.9997 0.9997 2974 |
|
mn-MN 0.9987 0.9966 0.9976 2974 |
|
ms-MY 0.9359 0.9418 0.9388 2974 |
|
my-MM 1.0000 0.9993 0.9997 2974 |
|
nb-NO 0.9600 0.9533 0.9566 2974 |
|
nl-NL 0.9850 0.9748 0.9799 2974 |
|
pl-PL 0.9946 0.9923 0.9934 2974 |
|
pt-PT 0.9885 0.9798 0.9841 2974 |
|
ro-RO 0.9919 0.9916 0.9918 2974 |
|
ru-RU 0.9976 0.9983 0.9980 2974 |
|
sl-SL 0.9956 0.9939 0.9948 2974 |
|
sq-AL 0.9936 0.9896 0.9916 2974 |
|
sv-SE 0.9902 0.9842 0.9872 2974 |
|
sw-KE 0.9867 0.9953 0.9910 2974 |
|
ta-IN 1.0000 1.0000 1.0000 2974 |
|
te-IN 1.0000 0.9997 0.9998 2974 |
|
th-TH 1.0000 0.9983 0.9992 2974 |
|
tl-PH 0.9929 0.9899 0.9914 2974 |
|
tr-TR 0.9869 0.9872 0.9871 2974 |
|
ur-PK 0.9983 0.9929 0.9956 2974 |
|
vi-VN 0.9993 0.9973 0.9983 2974 |
|
zh-CN 0.9812 0.9832 0.9822 2974 |
|
zh-TW 0.9832 0.9815 0.9823 2974 |
|
|
|
accuracy 0.9889 151674 |
|
macro avg 0.9889 0.9889 0.9889 151674 |
|
weighted avg 0.9889 0.9889 0.9889 151674 |
|
``` |
|
|
|
Keywords : language identification ; language identification ; multilingual ; classification |