opus-mt-tc-bible-big-inc-deu_eng_fra_por_spa

Table of Contents

Model Details

Neural machine translation model for translating from Indic languages (inc) to unknown (deu+eng+fra+por+spa).

This model is part of the OPUS-MT project, an effort to make neural machine translation models widely available and accessible for many languages in the world. All models are originally trained using the amazing framework of Marian NMT, an efficient NMT implementation written in pure C++. The models have been converted to pyTorch using the transformers library by huggingface. Training data is taken from OPUS and training pipelines use the procedures of OPUS-MT-train. Model Description:

This is a multilingual translation model with multiple target languages. A sentence initial language token is required in the form of >>id<< (id = valid target language ID), e.g. >>deu<<

Uses

This model can be used for translation and text-to-text generation.

Risks, Limitations and Biases

CONTENT WARNING: Readers should be aware that the model is trained on various public data sets that may contain content that is disturbing, offensive, and can propagate historical and current stereotypes.

Significant research has explored bias and fairness issues with language models (see, e.g., Sheng et al. (2021) and Bender et al. (2021)).

How to Get Started With the Model

A short example code:

from transformers import MarianMTModel, MarianTokenizer

src_text = [
    ">>deu<< Replace this with text in an accepted source language.",
    ">>spa<< This is the second sentence."
]

model_name = "pytorch-models/opus-mt-tc-bible-big-inc-deu_eng_fra_por_spa"
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)
translated = model.generate(**tokenizer(src_text, return_tensors="pt", padding=True))

for t in translated:
    print( tokenizer.decode(t, skip_special_tokens=True) )

You can also use OPUS-MT models with the transformers pipelines, for example:

from transformers import pipeline
pipe = pipeline("translation", model="Helsinki-NLP/opus-mt-tc-bible-big-inc-deu_eng_fra_por_spa")
print(pipe(">>deu<< Replace this with text in an accepted source language."))

Training

Evaluation

langpair testset chr-F BLEU #sent #words
awa-eng tatoeba-test-v2021-08-07 0.60390 40.8 279 1335
ben-eng tatoeba-test-v2021-08-07 0.64078 49.4 2500 13978
hin-eng tatoeba-test-v2021-08-07 0.64929 49.1 5000 33943
mar-eng tatoeba-test-v2021-08-07 0.64074 48.0 10396 67527
urd-eng tatoeba-test-v2021-08-07 0.52963 35.0 1663 12029
ben-eng flores101-devtest 0.57906 30.4 1012 24721
ben-fra flores101-devtest 0.50109 21.9 1012 28343
guj-spa flores101-devtest 0.44065 15.2 1012 29199
mar-deu flores101-devtest 0.44067 13.8 1012 25094
mar-por flores101-devtest 0.46685 18.6 1012 26519
mar-spa flores101-devtest 0.41662 14.0 1012 29199
pan-eng flores101-devtest 0.59922 33.0 1012 24721
pan-por flores101-devtest 0.49373 21.9 1012 26519
pan-spa flores101-devtest 0.43910 15.4 1012 29199
asm-eng flores200-devtest 0.48584 21.9 1012 24721
awa-deu flores200-devtest 0.47173 16.5 1012 25094
awa-eng flores200-devtest 0.50582 24.5 1012 24721
awa-fra flores200-devtest 0.49682 21.4 1012 28343
awa-por flores200-devtest 0.49663 21.5 1012 26519
awa-spa flores200-devtest 0.43740 15.1 1012 29199
ben-deu flores200-devtest 0.47330 16.6 1012 25094
ben-eng flores200-devtest 0.58077 30.5 1012 24721
ben-fra flores200-devtest 0.50884 22.6 1012 28343
ben-por flores200-devtest 0.50054 21.4 1012 26519
ben-spa flores200-devtest 0.44159 15.2 1012 29199
bho-deu flores200-devtest 0.42660 12.6 1012 25094
bho-eng flores200-devtest 0.50609 22.7 1012 24721
bho-fra flores200-devtest 0.44889 16.8 1012 28343
bho-por flores200-devtest 0.44582 16.9 1012 26519
bho-spa flores200-devtest 0.40581 13.1 1012 29199
guj-deu flores200-devtest 0.46665 16.8 1012 25094
guj-eng flores200-devtest 0.61383 34.3 1012 24721
guj-fra flores200-devtest 0.50410 22.3 1012 28343
guj-por flores200-devtest 0.49257 21.3 1012 26519
guj-spa flores200-devtest 0.44565 15.6 1012 29199
hin-deu flores200-devtest 0.50226 20.4 1012 25094
hin-eng flores200-devtest 0.63336 37.3 1012 24721
hin-fra flores200-devtest 0.53701 25.9 1012 28343
hin-por flores200-devtest 0.53448 25.5 1012 26519
hin-spa flores200-devtest 0.46171 17.2 1012 29199
hne-deu flores200-devtest 0.49698 19.0 1012 25094
hne-eng flores200-devtest 0.63936 38.5 1012 24721
hne-fra flores200-devtest 0.52835 25.3 1012 28343
hne-por flores200-devtest 0.52788 25.0 1012 26519
hne-spa flores200-devtest 0.45443 16.7 1012 29199
mag-deu flores200-devtest 0.50359 19.7 1012 25094
mag-eng flores200-devtest 0.63906 38.0 1012 24721
mag-fra flores200-devtest 0.53616 25.8 1012 28343
mag-por flores200-devtest 0.53537 25.9 1012 26519
mag-spa flores200-devtest 0.45822 16.9 1012 29199
mai-deu flores200-devtest 0.46791 16.2 1012 25094
mai-eng flores200-devtest 0.57461 30.4 1012 24721
mai-fra flores200-devtest 0.50585 22.1 1012 28343
mai-por flores200-devtest 0.50490 22.0 1012 26519
mai-spa flores200-devtest 0.44366 15.3 1012 29199
mar-deu flores200-devtest 0.44725 14.5 1012 25094
mar-eng flores200-devtest 0.58500 31.4 1012 24721
mar-fra flores200-devtest 0.47027 19.5 1012 28343
mar-por flores200-devtest 0.47216 19.3 1012 26519
mar-spa flores200-devtest 0.42178 14.2 1012 29199
npi-deu flores200-devtest 0.46631 16.4 1012 25094
npi-eng flores200-devtest 0.59776 32.3 1012 24721
npi-fra flores200-devtest 0.50548 22.5 1012 28343
npi-por flores200-devtest 0.50202 21.7 1012 26519
npi-spa flores200-devtest 0.43804 15.3 1012 29199
pan-deu flores200-devtest 0.48421 18.7 1012 25094
pan-eng flores200-devtest 0.60676 33.8 1012 24721
pan-fra flores200-devtest 0.51368 23.5 1012 28343
pan-por flores200-devtest 0.50586 22.7 1012 26519
pan-spa flores200-devtest 0.44653 16.5 1012 29199
sin-deu flores200-devtest 0.44676 14.2 1012 25094
sin-eng flores200-devtest 0.54777 26.8 1012 24721
sin-fra flores200-devtest 0.47283 19.0 1012 28343
sin-por flores200-devtest 0.46935 18.4 1012 26519
sin-spa flores200-devtest 0.42143 13.7 1012 29199
urd-deu flores200-devtest 0.46542 17.1 1012 25094
urd-eng flores200-devtest 0.56935 29.3 1012 24721
urd-fra flores200-devtest 0.50276 22.3 1012 28343
urd-por flores200-devtest 0.48010 20.3 1012 26519
urd-spa flores200-devtest 0.43032 14.7 1012 29199
hin-eng newstest2014 0.59329 30.3 2507 55571
guj-eng newstest2019 0.53383 26.9 1016 17757
ben-deu ntrex128 0.45180 14.6 1997 48761
ben-eng ntrex128 0.57247 29.5 1997 47673
ben-fra ntrex128 0.46475 18.0 1997 53481
ben-por ntrex128 0.45486 16.8 1997 51631
ben-spa ntrex128 0.48738 21.1 1997 54107
guj-deu ntrex128 0.43539 13.9 1997 48761
guj-eng ntrex128 0.58894 31.6 1997 47673
guj-fra ntrex128 0.45075 16.9 1997 53481
guj-por ntrex128 0.43567 15.2 1997 51631
guj-spa ntrex128 0.47525 20.2 1997 54107
hin-deu ntrex128 0.46336 15.0 1997 48761
hin-eng ntrex128 0.59842 31.5 1997 47673
hin-fra ntrex128 0.48208 19.2 1997 53481
hin-por ntrex128 0.46509 17.6 1997 51631
hin-spa ntrex128 0.49436 21.8 1997 54107
mar-deu ntrex128 0.43119 12.8 1997 48761
mar-eng ntrex128 0.55151 27.3 1997 47673
mar-fra ntrex128 0.43957 16.2 1997 53481
mar-por ntrex128 0.43555 15.4 1997 51631
mar-spa ntrex128 0.46271 19.1 1997 54107
nep-deu ntrex128 0.42940 13.0 1997 48761
nep-eng ntrex128 0.56277 29.1 1997 47673
nep-fra ntrex128 0.44663 16.5 1997 53481
nep-por ntrex128 0.43686 15.4 1997 51631
nep-spa ntrex128 0.46553 19.3 1997 54107
pan-deu ntrex128 0.44036 14.1 1997 48761
pan-eng ntrex128 0.58427 31.6 1997 47673
pan-fra ntrex128 0.45593 17.3 1997 53481
pan-por ntrex128 0.44264 15.9 1997 51631
pan-spa ntrex128 0.47199 20.0 1997 54107
sin-deu ntrex128 0.42280 12.4 1997 48761
sin-eng ntrex128 0.52576 24.6 1997 47673
sin-fra ntrex128 0.43594 15.6 1997 53481
sin-por ntrex128 0.42751 14.4 1997 51631
sin-spa ntrex128 0.45890 18.3 1997 54107
urd-deu ntrex128 0.45737 15.6 1997 48761
urd-eng ntrex128 0.56781 28.6 1997 47673
urd-fra ntrex128 0.47298 18.9 1997 53481
urd-por ntrex128 0.45273 16.2 1997 51631
urd-spa ntrex128 0.48644 21.0 1997 54107
ben-eng tico19-test 0.64568 38.2 2100 56824
ben-fra tico19-test 0.49799 22.0 2100 64661
ben-por tico19-test 0.55115 27.2 2100 62729
ben-spa tico19-test 0.56847 29.9 2100 66563
hin-eng tico19-test 0.70694 46.6 2100 56323
hin-fra tico19-test 0.53932 26.7 2100 64661
hin-por tico19-test 0.60581 33.4 2100 62729
hin-spa tico19-test 0.61585 35.7 2100 66563
mar-eng tico19-test 0.59329 31.8 2100 56315
mar-fra tico19-test 0.46574 19.3 2100 64661
mar-por tico19-test 0.51463 23.6 2100 62729
mar-spa tico19-test 0.52551 25.7 2100 66563
nep-eng tico19-test 0.66283 40.7 2100 56824
nep-fra tico19-test 0.50397 22.8 2100 64661
nep-por tico19-test 0.55951 28.1 2100 62729
nep-spa tico19-test 0.57272 30.3 2100 66563
urd-eng tico19-test 0.57473 30.5 2100 56315
urd-fra tico19-test 0.46725 19.6 2100 64661
urd-por tico19-test 0.50913 23.5 2100 62729
urd-spa tico19-test 0.52387 25.8 2100 66563

Citation Information

@article{tiedemann2023democratizing,
  title={Democratizing neural machine translation with {OPUS-MT}},
  author={Tiedemann, J{\"o}rg and Aulamo, Mikko and Bakshandaeva, Daria and Boggia, Michele and Gr{\"o}nroos, Stig-Arne and Nieminen, Tommi and Raganato, Alessandro and Scherrer, Yves and Vazquez, Raul and Virpioja, Sami},
  journal={Language Resources and Evaluation},
  number={58},
  pages={713--755},
  year={2023},
  publisher={Springer Nature},
  issn={1574-0218},
  doi={10.1007/s10579-023-09704-w}
}

@inproceedings{tiedemann-thottingal-2020-opus,
    title = "{OPUS}-{MT} {--} Building open translation services for the World",
    author = {Tiedemann, J{\"o}rg  and Thottingal, Santhosh},
    booktitle = "Proceedings of the 22nd Annual Conference of the European Association for Machine Translation",
    month = nov,
    year = "2020",
    address = "Lisboa, Portugal",
    publisher = "European Association for Machine Translation",
    url = "https://aclanthology.org/2020.eamt-1.61",
    pages = "479--480",
}

@inproceedings{tiedemann-2020-tatoeba,
    title = "The Tatoeba Translation Challenge {--} Realistic Data Sets for Low Resource and Multilingual {MT}",
    author = {Tiedemann, J{\"o}rg},
    booktitle = "Proceedings of the Fifth Conference on Machine Translation",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2020.wmt-1.139",
    pages = "1174--1182",
}

Acknowledgements

The work is supported by the HPLT project, funded by the European Union’s Horizon Europe research and innovation programme under grant agreement No 101070350. We are also grateful for the generous computational resources and IT infrastructure provided by CSC -- IT Center for Science, Finland, and the EuroHPC supercomputer LUMI.

Model conversion info

  • transformers version: 4.45.1
  • OPUS-MT git hash: 0882077
  • port time: Tue Oct 8 11:39:25 EEST 2024
  • port machine: LM0-400-22516.local
Downloads last month
32
Safetensors
Model size
240M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Collection including Helsinki-NLP/opus-mt-tc-bible-big-inc-deu_eng_fra_por_spa

Evaluation results