opus-mt-tc-bible-big-deu_eng_fra_por_spa-iir

Table of Contents

Model Details

Neural machine translation model for translating from unknown (deu+eng+fra+por+spa) to Indo-Iranian languages (iir).

This model is part of the OPUS-MT project, an effort to make neural machine translation models widely available and accessible for many languages in the world. All models are originally trained using the amazing framework of Marian NMT, an efficient NMT implementation written in pure C++. The models have been converted to pyTorch using the transformers library by huggingface. Training data is taken from OPUS and training pipelines use the procedures of OPUS-MT-train. Model Description:

  • Developed by: Language Technology Research Group at the University of Helsinki
  • Model Type: Translation (transformer-big)
  • Release: 2024-05-30
  • License: Apache-2.0
  • Language(s):
    • Source Language(s): deu eng fra por spa
    • Target Language(s): anp asm awa bal ben bho bpy ckb diq div dty fas gbm glk guj hif hin hne hns jdt kas kmr kok kur lah lrc mag mai mar mzn nep npi ori oss pal pan pes pli prs pus rhg rmy rom san sdh sin skr snd syl tgk tly urd zza
    • Valid Target Language Labels: >>aee<< >>aeq<< >>aiq<< >>anp<< >>anr<< >>ask<< >>asm<< >>atn<< >>avd<< >>ave<< >>awa<< >>bal<< >>bal_Latn<< >>bdv<< >>ben<< >>bfb<< >>bfy<< >>bfz<< >>bgc<< >>bgd<< >>bge<< >>bgw<< >>bha<< >>bhb<< >>bhd<< >>bhe<< >>bhh<< >>bhi<< >>bho<< >>bht<< >>bhu<< >>bjj<< >>bjm<< >>bkk<< >>bmj<< >>bns<< >>bpx<< >>bpy<< >>bqi<< >>bra<< >>bsg<< >>bsh<< >>btv<< >>ccp<< >>cdh<< >>cdi<< >>cdj<< >>cih<< >>ckb<< >>clh<< >>ctg<< >>dcc<< >>def<< >>deh<< >>dhn<< >>dho<< >>diq<< >>div<< >>dmk<< >>dml<< >>doi<< >>dry<< >>dty<< >>dub<< >>duh<< >>dwz<< >>emx<< >>esh<< >>fas<< >>fay<< >>gas<< >>gbk<< >>gbl<< >>gbm<< >>gbz<< >>gdx<< >>ggg<< >>ghr<< >>gig<< >>gjk<< >>glh<< >>glk<< >>goz<< >>gra<< >>guj<< >>gwc<< >>gwf<< >>gwt<< >>gzi<< >>hac<< >>haj<< >>haz<< >>hca<< >>hif<< >>hif_Latn<< >>hii<< >>hin<< >>hin_Latn<< >>hlb<< >>hne<< >>hns<< >>hrz<< >>isk<< >>jdg<< >>jdt<< >>jdt_Cyrl<< >>jml<< >>jnd<< >>jns<< >>jpr<< >>kas<< >>kas_Arab<< >>kas_Deva<< >>kbu<< >>keq<< >>key<< >>kfm<< >>kfr<< >>kfs<< >>kft<< >>kfu<< >>kfv<< >>kfx<< >>kfy<< >>kgn<< >>khn<< >>kho<< >>khw<< >>kjo<< >>kls<< >>kmr<< >>kok<< >>kra<< >>ksy<< >>ktl<< >>kur<< >>kur_Arab<< >>kur_Cyrl<< >>kur_Latn<< >>kvx<< >>kxp<< >>kyw<< >>lah<< >>lbm<< >>lhl<< >>lki<< >>lmn<< >>lrc<< >>lrl<< >>lsa<< >>lss<< >>luv<< >>luz<< >>mag<< >>mai<< >>mar<< >>mby<< >>mjl<< >>mjz<< >>mkb<< >>mke<< >>mki<< >>mnj<< >>mvy<< >>mwr<< >>mzn<< >>nag<< >>nep<< >>nhh<< >>nli<< >>nlx<< >>noe<< >>noi<< >>npi<< >>ntz<< >>nyq<< >>odk<< >>okh<< >>omr<< >>oos<< >>ori<< >>ort<< >>oru<< >>oss<< >>pal<< >>pan<< >>pan_Guru<< >>paq<< >>pcl<< >>peo<< >>pes<< >>pgg<< >>phd<< >>phl<< >>phv<< >>pli<< >>plk<< >>plp<< >>pmh<< >>prc<< >>prn<< >>prs<< >>psh<< >>psi<< >>psu<< >>pus<< >>pwr<< >>raj<< >>rat<< >>rdb<< >>rei<< >>rhg<< >>rhg_Latn<< >>rjs<< >>rkt<< >>rmi<< >>rmq<< >>rmt<< >>rmy<< >>rom<< >>rtw<< >>san<< >>san_Deva<< >>saz<< >>sbn<< >>sck<< >>scl<< >>sdb<< >>sdf<< >>sdg<< >>sdh<< >>sdr<< >>sgh<< >>sgl<< >>sgr<< >>sgy<< >>shd<< >>shm<< >>sin<< >>siy<< >>sjp<< >>skr<< >>smm<< >>smv<< >>smy<< >>snd<< >>snd_Arab<< >>sog<< >>soi<< >>soj<< >>sqo<< >>srh<< >>srx<< >>srz<< >>ssi<< >>sts<< >>syl<< >>syl_Sylo<< >>tdb<< >>tgk<< >>tgk_Cyrl<< >>tgk_Latn<< >>the<< >>thl<< >>thq<< >>thr<< >>tkb<< >>tks<< >>tkt<< >>tly<< >>tly_Latn<< >>tnv<< >>tov<< >>tra<< >>trm<< >>trw<< >>ttt<< >>urd<< >>ush<< >>vaa<< >>vaf<< >>vah<< >>vas<< >>vav<< >>ved<< >>vgr<< >>vmh<< >>wbk<< >>wbl<< >>wne<< >>wsv<< >>wtm<< >>xbc<< >>xco<< >>xka<< >>xkc<< >>xkj<< >>xkp<< >>xpr<< >>xsc<< >>xtq<< >>xvi<< >>xxx<< >>yah<< >>yai<< >>ydg<< >>zum<< >>zza<<
  • Original Model: opusTCv20230926max50+bt+jhubc_transformer-big_2024-05-30.zip
  • Resources for more information:

This is a multilingual translation model with multiple target languages. A sentence initial language token is required in the form of >>id<< (id = valid target language ID), e.g. >>anp<<

Uses

This model can be used for translation and text-to-text generation.

Risks, Limitations and Biases

CONTENT WARNING: Readers should be aware that the model is trained on various public data sets that may contain content that is disturbing, offensive, and can propagate historical and current stereotypes.

Significant research has explored bias and fairness issues with language models (see, e.g., Sheng et al. (2021) and Bender et al. (2021)).

How to Get Started With the Model

A short example code:

from transformers import MarianMTModel, MarianTokenizer

src_text = [
    ">>anp<< Replace this with text in an accepted source language.",
    ">>zza<< This is the second sentence."
]

model_name = "pytorch-models/opus-mt-tc-bible-big-deu_eng_fra_por_spa-iir"
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)
translated = model.generate(**tokenizer(src_text, return_tensors="pt", padding=True))

for t in translated:
    print( tokenizer.decode(t, skip_special_tokens=True) )

You can also use OPUS-MT models with the transformers pipelines, for example:

from transformers import pipeline
pipe = pipeline("translation", model="Helsinki-NLP/opus-mt-tc-bible-big-deu_eng_fra_por_spa-iir")
print(pipe(">>anp<< Replace this with text in an accepted source language."))

Training

Evaluation

langpair testset chr-F BLEU #sent #words
deu-fas tatoeba-test-v2021-08-07 0.45763 20.3 3185 24941
deu-kur_Latn tatoeba-test-v2021-08-07 1.027 0.6 223 1249
eng-ben tatoeba-test-v2021-08-07 0.47927 17.6 2500 11654
eng-fas tatoeba-test-v2021-08-07 0.40192 17.1 3762 31110
eng-hin tatoeba-test-v2021-08-07 0.52525 28.4 5000 32904
eng-kur_Latn tatoeba-test-v2021-08-07 0.493 0.0 290 1682
eng-mar tatoeba-test-v2021-08-07 0.52549 24.4 10396 61140
eng-pes tatoeba-test-v2021-08-07 0.40401 17.3 3757 31044
eng-urd tatoeba-test-v2021-08-07 0.45764 18.1 1663 12155
fra-fas tatoeba-test-v2021-08-07 0.42414 18.9 376 3217
deu-npi flores101-devtest 3.082 0.2 1012 19762
eng-ben flores101-devtest 0.51055 17.0 1012 21155
eng-ckb flores101-devtest 0.45337 7.1 1012 21159
eng-guj flores101-devtest 0.53972 22.3 1012 23840
eng-hin flores101-devtest 0.57980 33.4 1012 27743
eng-mar flores101-devtest 0.48206 14.3 1012 21810
eng-urd flores101-devtest 0.48050 20.5 1012 28098
fra-ben flores101-devtest 0.43806 10.9 1012 21155
fra-ckb flores101-devtest 0.41016 4.9 1012 21159
por-ben flores101-devtest 0.42730 10.0 1012 21155
por-npi flores101-devtest 2.084 0.2 1012 19762
spa-hin flores101-devtest 0.43371 16.0 1012 27743
deu-ben flores200-devtest 0.44005 10.6 1012 21155
deu-hin flores200-devtest 0.48448 22.3 1012 27743
deu-hne flores200-devtest 0.42659 13.8 1012 26582
deu-mag flores200-devtest 0.42477 14.0 1012 26516
deu-npi flores200-devtest 5.870 0.1 1012 19762
deu-pes flores200-devtest 0.42726 14.9 1012 24986
deu-tgk flores200-devtest 0.40932 12.9 1012 25530
deu-urd flores200-devtest 0.41250 14.4 1012 28098
eng-ben flores200-devtest 0.51361 17.1 1012 21155
eng-ckb flores200-devtest 0.45750 7.7 1012 21152
eng-guj flores200-devtest 0.54231 22.4 1012 23840
eng-hin flores200-devtest 0.58371 33.7 1012 27743
eng-hne flores200-devtest 0.47591 19.9 1012 26582
eng-mag flores200-devtest 0.51070 22.2 1012 26516
eng-mar flores200-devtest 0.48733 14.8 1012 21810
eng-pan flores200-devtest 0.45015 18.1 1012 27451
eng-pes flores200-devtest 0.48588 21.1 1012 24986
eng-prs flores200-devtest 0.51879 24.5 1012 25885
eng-sin flores200-devtest 0.43823 10.6 1012 23278
eng-tgk flores200-devtest 0.47323 17.8 1012 25530
eng-urd flores200-devtest 0.48212 20.4 1012 28098
fra-ben flores200-devtest 0.44029 11.0 1012 21155
fra-ckb flores200-devtest 0.41353 5.3 1012 21152
fra-hin flores200-devtest 0.48406 22.6 1012 27743
fra-hne flores200-devtest 0.42353 13.9 1012 26582
fra-mag flores200-devtest 0.42678 14.3 1012 26516
fra-npi flores200-devtest 6.525 0.1 1012 19762
fra-pes flores200-devtest 0.43526 15.5 1012 24986
fra-tgk flores200-devtest 0.42982 13.7 1012 25530
fra-urd flores200-devtest 0.41438 14.2 1012 28098
por-ben flores200-devtest 0.43390 10.4 1012 21155
por-ckb flores200-devtest 0.42303 5.6 1012 21152
por-hin flores200-devtest 0.49524 23.6 1012 27743
por-hne flores200-devtest 0.42269 13.9 1012 26582
por-mag flores200-devtest 0.42753 15.0 1012 26516
por-npi flores200-devtest 6.737 0.1 1012 19762
por-pes flores200-devtest 0.43194 15.4 1012 24986
por-tgk flores200-devtest 0.41860 13.2 1012 25530
por-urd flores200-devtest 0.41799 14.8 1012 28098
spa-ben flores200-devtest 0.41893 8.3 1012 21155
spa-hin flores200-devtest 0.43777 16.4 1012 27743
spa-kas_Arab flores200-devtest 9.380 0.1 1012 23514
spa-npi flores200-devtest 7.518 0.2 1012 19762
spa-pes flores200-devtest 0.40856 12.2 1012 24986
spa-prs flores200-devtest 0.40361 12.8 1012 25885
spa-tgk flores200-devtest 0.40100 10.8 1012 25530
eng-hin newstest2014 0.51249 23.6 2507 60872
eng-guj newstest2019 0.57282 25.5 998 21924
deu-ben ntrex128 0.43971 9.6 1997 40095
deu-fas ntrex128 0.41469 13.8 1997 50525
deu-hin ntrex128 0.42940 16.8 1997 55219
deu-snd_Arab ntrex128 6.129 0.1 1997 49866
deu-urd ntrex128 0.41881 14.5 1997 54259
eng-ben ntrex128 0.51555 16.6 1997 40095
eng-fas ntrex128 0.46895 19.7 1997 50525
eng-guj ntrex128 0.48990 17.1 1997 45335
eng-hin ntrex128 0.52307 26.9 1997 55219
eng-mar ntrex128 0.44580 10.4 1997 42375
eng-nep ntrex128 0.42955 8.4 1997 40570
eng-pan ntrex128 0.46141 19.6 1997 54355
eng-sin ntrex128 0.42236 9.7 1997 44429
eng-snd_Arab ntrex128 1.932 0.1 1997 49866
eng-urd ntrex128 0.49646 22.1 1997 54259
fra-ben ntrex128 0.41716 8.9 1997 40095
fra-fas ntrex128 0.41282 13.8 1997 50525
fra-hin ntrex128 0.42475 17.1 1997 55219
fra-snd_Arab ntrex128 6.047 0.0 1997 49866
fra-urd ntrex128 0.41536 14.8 1997 54259
por-ben ntrex128 0.43855 9.9 1997 40095
por-fas ntrex128 0.42010 14.4 1997 50525
por-hin ntrex128 0.43275 17.6 1997 55219
por-snd_Arab ntrex128 6.336 0.1 1997 49866
por-urd ntrex128 0.42484 15.2 1997 54259
spa-ben ntrex128 0.44905 10.3 1997 40095
spa-fas ntrex128 0.42207 14.1 1997 50525
spa-hin ntrex128 0.43380 17.6 1997 55219
spa-snd_Arab ntrex128 5.551 0.0 1997 49866
spa-urd ntrex128 0.42434 15.0 1997 54259
eng-ben tico19-test 0.51563 17.9 2100 51695
eng-ckb tico19-test 0.46188 8.9 2100 50500
eng-fas tico19-test 0.53182 25.8 2100 59779
eng-hin tico19-test 0.63128 41.6 2100 62680
eng-mar tico19-test 0.45619 12.9 2100 50872
eng-nep tico19-test 0.53413 17.6 2100 48363
eng-prs tico19-test 0.44101 17.3 2100 62972
eng-pus tico19-test 0.47063 20.5 2100 66213
eng-urd tico19-test 0.51054 22.0 2100 65312
fra-fas tico19-test 0.43476 17.9 2100 59779
fra-hin tico19-test 0.48625 25.6 2100 62680
fra-nep tico19-test 0.41153 9.7 2100 48363
fra-urd tico19-test 0.40482 14.4 2100 65312
por-ben tico19-test 0.45814 12.5 2100 51695
por-ckb tico19-test 0.41684 5.6 2100 50500
por-fas tico19-test 0.49181 21.3 2100 59779
por-hin tico19-test 0.55759 31.1 2100 62680
por-mar tico19-test 0.40067 9.1 2100 50872
por-nep tico19-test 0.47378 12.1 2100 48363
por-pus tico19-test 0.42496 15.9 2100 66213
por-urd tico19-test 0.45560 16.6 2100 65312
spa-ben tico19-test 0.45751 12.7 2100 51695
spa-ckb tico19-test 0.41568 5.4 2100 50500
spa-fas tico19-test 0.48974 21.0 2100 59779
spa-hin tico19-test 0.55641 30.9 2100 62680
spa-mar tico19-test 0.40329 9.4 2100 50872
spa-nep tico19-test 0.47164 12.1 2100 48363
spa-prs tico19-test 0.41879 14.3 2100 62972
spa-pus tico19-test 0.41714 15.1 2100 66213
spa-urd tico19-test 0.44931 15.3 2100 65312

Citation Information

@article{tiedemann2023democratizing,
  title={Democratizing neural machine translation with {OPUS-MT}},
  author={Tiedemann, J{\"o}rg and Aulamo, Mikko and Bakshandaeva, Daria and Boggia, Michele and Gr{\"o}nroos, Stig-Arne and Nieminen, Tommi and Raganato, Alessandro and Scherrer, Yves and Vazquez, Raul and Virpioja, Sami},
  journal={Language Resources and Evaluation},
  number={58},
  pages={713--755},
  year={2023},
  publisher={Springer Nature},
  issn={1574-0218},
  doi={10.1007/s10579-023-09704-w}
}

@inproceedings{tiedemann-thottingal-2020-opus,
    title = "{OPUS}-{MT} {--} Building open translation services for the World",
    author = {Tiedemann, J{\"o}rg  and Thottingal, Santhosh},
    booktitle = "Proceedings of the 22nd Annual Conference of the European Association for Machine Translation",
    month = nov,
    year = "2020",
    address = "Lisboa, Portugal",
    publisher = "European Association for Machine Translation",
    url = "https://aclanthology.org/2020.eamt-1.61",
    pages = "479--480",
}

@inproceedings{tiedemann-2020-tatoeba,
    title = "The Tatoeba Translation Challenge {--} Realistic Data Sets for Low Resource and Multilingual {MT}",
    author = {Tiedemann, J{\"o}rg},
    booktitle = "Proceedings of the Fifth Conference on Machine Translation",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2020.wmt-1.139",
    pages = "1174--1182",
}

Acknowledgements

The work is supported by the HPLT project, funded by the European Union’s Horizon Europe research and innovation programme under grant agreement No 101070350. We are also grateful for the generous computational resources and IT infrastructure provided by CSC -- IT Center for Science, Finland, and the EuroHPC supercomputer LUMI.

Model conversion info

  • transformers version: 4.45.1
  • OPUS-MT git hash: 0882077
  • port time: Tue Oct 8 10:05:20 EEST 2024
  • port machine: LM0-400-22516.local
Downloads last month
166
Safetensors
Model size
240M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Collection including Helsinki-NLP/opus-mt-tc-bible-big-deu_eng_fra_por_spa-iir

Evaluation results