|
--- |
|
language: |
|
- multilingual |
|
- ar |
|
- cs |
|
- de |
|
- en |
|
- es |
|
- et |
|
- fi |
|
- fr |
|
- gu |
|
- hi |
|
- it |
|
- ja |
|
- kk |
|
- ko |
|
- lt |
|
- lv |
|
- my |
|
- ne |
|
- nl |
|
- ro |
|
- ru |
|
- si |
|
- tr |
|
- vi |
|
- zh |
|
- af |
|
- az |
|
- bn |
|
- fa |
|
- he |
|
- hr |
|
- id |
|
- ka |
|
- km |
|
- mk |
|
- ml |
|
- mn |
|
- mr |
|
- pl |
|
- ps |
|
- pt |
|
- sv |
|
- sw |
|
- ta |
|
- te |
|
- th |
|
- tl |
|
- uk |
|
- ur |
|
- xh |
|
- gl |
|
- sl |
|
tags: |
|
- mbart-50 |
|
pipeline_tag: translation |
|
--- |
|
|
|
# mBART-50 many to many multilingual machine translation |
|
|
|
|
|
This model is a fine-tuned checkpoint of [mBART-large-50](https://huggingface.co/facebook/mbart-large-50). `mbart-large-50-many-to-many-mmt` is fine-tuned for multilingual machine translation. It was introduced in [Multilingual Translation with Extensible Multilingual Pretraining and Finetuning](https://arxiv.org/abs/2008.00401) paper. |
|
|
|
|
|
The model can translate directly between any pair of 50 languages. To translate into a target language, the target language id is forced as the first generated token. To force the target language id as the first generated token, pass the `forced_bos_token_id` parameter to the `generate` method. |
|
|
|
|
|
```python |
|
from transformers import MBartForConditionalGeneration, MBart50TokenizerFast |
|
|
|
article_hi = "संयुक्त राष्ट्र के प्रमुख का कहना है कि सीरिया में कोई सैन्य समाधान नहीं है" |
|
article_ar = "HI, my name is khan i love you?" |
|
|
|
model = MBartForConditionalGeneration.from_pretrained("facebook/mbart-large-50-many-to-many-mmt") |
|
tokenizer = MBart50TokenizerFast.from_pretrained("facebook/mbart-large-50-many-to-many-mmt") |
|
|
|
# translate Hindi to French |
|
tokenizer.src_lang = "hi_IN" |
|
encoded_hi = tokenizer(article_hi, return_tensors="pt") |
|
generated_tokens = model.generate( |
|
**encoded_hi, |
|
forced_bos_token_id=tokenizer.lang_code_to_id["fr_XX"] |
|
) |
|
tokenizer.batch_decode(generated_tokens, skip_special_tokens=True) |
|
# => "Le chef de l 'ONU affirme qu 'il n 'y a pas de solution militaire dans la Syrie." |
|
|
|
# translate Arabic to English |
|
tokenizer.src_lang = "en_XX" |
|
encoded_ar = tokenizer(article_ar, return_tensors="pt") |
|
generated_tokens = model.generate( |
|
**encoded_ar, |
|
forced_bos_token_id=tokenizer.lang_code_to_id["ps_AF"] |
|
) |
|
tokenizer.batch_decode(generated_tokens, skip_special_tokens=True) |
|
# => "The Secretary-General of the United Nations says there is no military solution in Syria." |
|
``` |
|
|
|
|
|
See the [model hub](https://huggingface.co/models?filter=mbart-50) to look for more fine-tuned versions. |
|
|
|
|
|
## Languages covered |
|
Arabic (ar_AR), Czech (cs_CZ), German (de_DE), English (en_XX), Spanish (es_XX), Estonian (et_EE), Finnish (fi_FI), French (fr_XX), Gujarati (gu_IN), Hindi (hi_IN), Italian (it_IT), Japanese (ja_XX), Kazakh (kk_KZ), Korean (ko_KR), Lithuanian (lt_LT), Latvian (lv_LV), Burmese (my_MM), Nepali (ne_NP), Dutch (nl_XX), Romanian (ro_RO), Russian (ru_RU), Sinhala (si_LK), Turkish (tr_TR), Vietnamese (vi_VN), Chinese (zh_CN), Afrikaans (af_ZA), Azerbaijani (az_AZ), Bengali (bn_IN), Persian (fa_IR), Hebrew (he_IL), Croatian (hr_HR), Indonesian (id_ID), Georgian (ka_GE), Khmer (km_KH), Macedonian (mk_MK), Malayalam (ml_IN), Mongolian (mn_MN), Marathi (mr_IN), Polish (pl_PL), Pashto (ps_AF), Portuguese (pt_XX), Swedish (sv_SE), Swahili (sw_KE), Tamil (ta_IN), Telugu (te_IN), Thai (th_TH), Tagalog (tl_XX), Ukrainian (uk_UA), Urdu (ur_PK), Xhosa (xh_ZA), Galician (gl_ES), Slovene (sl_SI) |
|
|
|
|
|
## BibTeX entry and citation info |
|
``` |
|
@article{tang2020multilingual, |
|
title={Multilingual Translation with Extensible Multilingual Pretraining and Finetuning}, |
|
author={Yuqing Tang and Chau Tran and Xian Li and Peng-Jen Chen and Naman Goyal and Vishrav Chaudhary and Jiatao Gu and Angela Fan}, |
|
year={2020}, |
|
eprint={2008.00401}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.CL} |
|
} |
|
``` |
|
|