Spaces:
Runtime error
Runtime error
<p align="center"> | |
<img src="flores_logo.png" width="500"> | |
</p> | |
# Flores101: Large-Scale Multilingual Machine Translation | |
## Introduction | |
Baseline pretrained models for small and large tracks of WMT 21 Large-Scale Multilingual Machine Translation competition. | |
Flores Task at WMT 21: http://www.statmt.org/wmt21/large-scale-multilingual-translation-task.html | |
Flores announement blog post: https://ai.facebook.com/blog/flores-researchers-kick-off-multilingual-translation-challenge-at-wmt-and-call-for-compute-grants/ | |
## Pretrained models | |
Model | Num layers | Embed dimension | FFN dimension| Vocab Size | #params | Download | |
---|---|---|---|---|---|--- | |
`flores101_mm100_615M` | 12 | 1024 | 4096 | 256,000 | 615M | https://dl.fbaipublicfiles.com/flores101/pretrained_models/flores101_mm100_615M.tar.gz | |
`flores101_mm100_175M` | 6 | 512 | 2048 | 256,000 | 175M | https://dl.fbaipublicfiles.com/flores101/pretrained_models/flores101_mm100_175M.tar.gz | |
These models are trained similar to [M2M-100](https://arxiv.org/abs/2010.11125) with additional support for the languages that are part of the WMT Large-Scale Multilingual Machine Translation track. Full list of languages can be found at the bottom. | |
## Example Generation code | |
### Download model, sentencepiece vocab | |
```bash | |
fairseq=/path/to/fairseq | |
cd $fairseq | |
# Download 615M param model. | |
wget https://dl.fbaipublicfiles.com/flores101/pretrained_models/flores101_mm100_615M.tar.gz | |
# Extract | |
tar -xvzf flores101_mm100_615M.tar.gz | |
``` | |
### Encode using our SentencePiece Model | |
Note: Install SentencePiece from [here](https://github.com/google/sentencepiece) | |
```bash | |
fairseq=/path/to/fairseq | |
cd $fairseq | |
# Download example dataset From German to French | |
sacrebleu --echo src -l de-fr -t wmt19 | head -n 20 > raw_input.de-fr.de | |
sacrebleu --echo ref -l de-fr -t wmt19 | head -n 20 > raw_input.de-fr.fr | |
for lang in de fr ; do | |
python scripts/spm_encode.py \ | |
--model flores101_mm100_615M/sentencepiece.bpe.model \ | |
--output_format=piece \ | |
--inputs=raw_input.de-fr.${lang} \ | |
--outputs=spm.de-fr.${lang} | |
done | |
``` | |
### Binarization | |
```bash | |
fairseq-preprocess \ | |
--source-lang de --target-lang fr \ | |
--testpref spm.de-fr \ | |
--thresholdsrc 0 --thresholdtgt 0 \ | |
--destdir data_bin \ | |
--srcdict flores101_mm100_615M/dict.txt --tgtdict flores101_mm100_615M/dict.txt | |
``` | |
### Generation | |
```bash | |
fairseq-generate \ | |
data_bin \ | |
--batch-size 1 \ | |
--path flores101_mm100_615M/model.pt \ | |
--fixed-dictionary flores101_mm100_615M/dict.txt \ | |
-s de -t fr \ | |
--remove-bpe 'sentencepiece' \ | |
--beam 5 \ | |
--task translation_multi_simple_epoch \ | |
--lang-pairs flores101_mm100_615M/language_pairs.txt \ | |
--decoder-langtok --encoder-langtok src \ | |
--gen-subset test \ | |
--fp16 \ | |
--dataset-impl mmap \ | |
--distributed-world-size 1 --distributed-no-spawn | |
``` | |
### Supported Languages and lang code | |
Language | lang code | |
---|--- | |
Akrikaans | af | |
Amharic | am | |
Arabic | ar | |
Assamese | as | |
Asturian | ast | |
Aymara | ay | |
Azerbaijani | az | |
Bashkir | ba | |
Belarusian | be | |
Bulgarian | bg | |
Bengali | bn | |
Breton | br | |
Bosnian | bs | |
Catalan | ca | |
Cebuano | ceb | |
Chokwe | cjk | |
Czech | cs | |
Welsh | cy | |
Danish | da | |
German | de | |
Dyula| dyu | |
Greek | el | |
English | en | |
Spanish | es | |
Estonian | et | |
Persian | fa | |
Fulah | ff | |
Finnish | fi | |
French | fr | |
Western Frisian | fy | |
Irish | ga | |
Scottish Gaelic | gd | |
Galician | gl | |
Gujarati | gu | |
Hausa | ha | |
Hebrew | he | |
Hindi | hi | |
Croatian | hr | |
Haitian Creole | ht | |
Hungarian | hu | |
Armenian | hy | |
Indonesian | id | |
Igbo | ig | |
Iloko | ilo | |
Icelandic | is | |
Italian | it | |
Japanese | ja | |
Javanese | jv | |
Georgian | ka | |
Kachin | kac | |
Kamba | kam | |
Kabuverdianu | kea | |
Kongo | kg | |
Kazakh | kk | |
Central Khmer | km | |
Kimbundu | kmb | |
Northern Kurdish | kmr | |
Kannada | kn | |
Korean | ko | |
Kurdish | ku | |
Kyrgyz | ky | |
Luxembourgish | lb | |
Ganda | lg | |
Lingala | ln | |
Lao | lo | |
Lithuanian | lt | |
Luo | luo | |
Latvian | lv | |
Malagasy | mg | |
Maori | mi | |
Macedonian | mk | |
Malayalam | ml | |
Mongolian | mn | |
Marathi | mr | |
Malay | ms | |
Maltese | mt | |
Burmese | my | |
Nepali | ne | |
Dutch | nl | |
Norwegian | no | |
Northern Sotho | ns | |
Nyanja | ny | |
Occitan | oc | |
Oromo | om | |
Oriya | or | |
Punjabi | pa | |
Polish | pl | |
Pashto | ps | |
Portuguese | pt | |
Quechua | qu | |
Romanian | ro | |
Russian | ru | |
Sindhi | sd | |
Shan | shn | |
Sinhala | si | |
Slovak | sk | |
Slovenian | sl | |
Shona | sn | |
Somali | so | |
Albanian | sq | |
Serbian | sr | |
Swati | ss | |
Sundanese | su | |
Swedish | sv | |
Swahili | sw | |
Tamil | ta | |
Telugu | te | |
Tajik | tg | |
Thai | th | |
Tigrinya | ti | |
Tagalog | tl | |
Tswana | tn | |
Turkish | tr | |
Ukrainian | uk | |
Umbundu | umb | |
Urdu | ur | |
Uzbek | uz | |
Vietnamese | vi | |
Wolof | wo | |
Xhosa | xh | |
Yiddish | yi | |
Yoruba | yo | |
Chinese| zh | |
Zulu | zu | |