Edit model card

opus-mt-tc-bible-big-gmq-deu_eng_fra_por_spa

Table of Contents

Model Details

Neural machine translation model for translating from North Germanic languages (gmq) to unknown (deu+eng+fra+por+spa).

This model is part of the OPUS-MT project, an effort to make neural machine translation models widely available and accessible for many languages in the world. All models are originally trained using the amazing framework of Marian NMT, an efficient NMT implementation written in pure C++. The models have been converted to pyTorch using the transformers library by huggingface. Training data is taken from OPUS and training pipelines use the procedures of OPUS-MT-train. Model Description:

This is a multilingual translation model with multiple target languages. A sentence initial language token is required in the form of >>id<< (id = valid target language ID), e.g. >>deu<<

Uses

This model can be used for translation and text-to-text generation.

Risks, Limitations and Biases

CONTENT WARNING: Readers should be aware that the model is trained on various public data sets that may contain content that is disturbing, offensive, and can propagate historical and current stereotypes.

Significant research has explored bias and fairness issues with language models (see, e.g., Sheng et al. (2021) and Bender et al. (2021)).

How to Get Started With the Model

A short example code:

from transformers import MarianMTModel, MarianTokenizer

src_text = [
    ">>deu<< Replace this with text in an accepted source language.",
    ">>spa<< This is the second sentence."
]

model_name = "pytorch-models/opus-mt-tc-bible-big-gmq-deu_eng_fra_por_spa"
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)
translated = model.generate(**tokenizer(src_text, return_tensors="pt", padding=True))

for t in translated:
    print( tokenizer.decode(t, skip_special_tokens=True) )

You can also use OPUS-MT models with the transformers pipelines, for example:

from transformers import pipeline
pipe = pipeline("translation", model="Helsinki-NLP/opus-mt-tc-bible-big-gmq-deu_eng_fra_por_spa")
print(pipe(">>deu<< Replace this with text in an accepted source language."))

Training

Evaluation

langpair testset chr-F BLEU #sent #words
dan-deu tatoeba-test-v2021-08-07 0.74460 56.7 9998 76055
dan-eng tatoeba-test-v2021-08-07 0.77233 64.3 10795 79684
dan-fra tatoeba-test-v2021-08-07 0.76425 60.8 1731 11882
dan-por tatoeba-test-v2021-08-07 0.77248 60.0 873 5360
dan-spa tatoeba-test-v2021-08-07 0.72567 54.9 5000 35528
fao-eng tatoeba-test-v2021-08-07 0.54571 39.6 294 1984
isl-deu tatoeba-test-v2021-08-07 0.68535 51.4 969 6279
isl-eng tatoeba-test-v2021-08-07 0.67066 51.7 2503 19788
isl-spa tatoeba-test-v2021-08-07 0.65659 48.5 238 1229
nno-eng tatoeba-test-v2021-08-07 0.69415 55.5 460 3524
nob-deu tatoeba-test-v2021-08-07 0.69862 50.5 3525 33592
nob-eng tatoeba-test-v2021-08-07 0.72912 59.2 4539 36823
nob-fra tatoeba-test-v2021-08-07 0.71392 52.5 323 2269
nob-spa tatoeba-test-v2021-08-07 0.73300 55.1 885 6866
nor-deu tatoeba-test-v2021-08-07 0.69923 50.7 3651 34575
nor-eng tatoeba-test-v2021-08-07 0.72587 58.8 5000 40355
nor-fra tatoeba-test-v2021-08-07 0.73052 55.1 477 3213
nor-por tatoeba-test-v2021-08-07 0.67948 45.4 481 4182
nor-spa tatoeba-test-v2021-08-07 0.73320 55.3 960 7311
swe-deu tatoeba-test-v2021-08-07 0.71816 55.4 3410 23494
swe-eng tatoeba-test-v2021-08-07 0.76648 64.8 10362 68513
swe-fra tatoeba-test-v2021-08-07 0.72847 57.4 1407 9580
swe-por tatoeba-test-v2021-08-07 0.70554 50.3 320 2032
swe-spa tatoeba-test-v2021-08-07 0.70926 54.3 1351 8235
dan-eng flores101-devtest 0.71193 47.6 1012 24721
dan-fra flores101-devtest 0.63349 38.1 1012 28343
dan-por flores101-devtest 0.62063 36.2 1012 26519
dan-spa flores101-devtest 0.52557 24.2 1012 29199
isl-deu flores101-devtest 0.50581 22.2 1012 25094
isl-eng flores101-devtest 0.57294 31.6 1012 24721
isl-por flores101-devtest 0.52192 25.8 1012 26519
isl-spa flores101-devtest 0.46364 18.5 1012 29199
nob-eng flores101-devtest 0.67120 42.6 1012 24721
nob-fra flores101-devtest 0.60289 33.9 1012 28343
nob-spa flores101-devtest 0.50848 21.9 1012 29199
swe-deu flores101-devtest 0.60306 32.2 1012 25094
swe-eng flores101-devtest 0.70404 47.9 1012 24721
swe-por flores101-devtest 0.61418 35.7 1012 26519
dan-deu flores200-devtest 0.60897 32.3 1012 25094
dan-eng flores200-devtest 0.71641 48.2 1012 24721
dan-fra flores200-devtest 0.63777 38.9 1012 28343
dan-por flores200-devtest 0.62302 36.7 1012 26519
dan-spa flores200-devtest 0.52803 24.4 1012 29199
fao-deu flores200-devtest 0.41184 16.0 1012 25094
fao-eng flores200-devtest 0.43308 21.2 1012 24721
fao-por flores200-devtest 0.42649 19.0 1012 26519
isl-deu flores200-devtest 0.51165 22.7 1012 25094
isl-eng flores200-devtest 0.57745 32.2 1012 24721
isl-fra flores200-devtest 0.54210 27.6 1012 28343
isl-por flores200-devtest 0.52479 26.1 1012 26519
isl-spa flores200-devtest 0.46837 19.2 1012 29199
nno-deu flores200-devtest 0.58054 29.2 1012 25094
nno-eng flores200-devtest 0.69114 45.0 1012 24721
nno-fra flores200-devtest 0.61334 36.0 1012 28343
nno-por flores200-devtest 0.60055 34.1 1012 26519
nno-spa flores200-devtest 0.51190 22.8 1012 29199
nob-deu flores200-devtest 0.57023 27.6 1012 25094
nob-eng flores200-devtest 0.67540 43.1 1012 24721
nob-fra flores200-devtest 0.60568 34.2 1012 28343
nob-por flores200-devtest 0.59466 32.8 1012 26519
nob-spa flores200-devtest 0.51138 22.4 1012 29199
swe-deu flores200-devtest 0.60630 32.6 1012 25094
swe-eng flores200-devtest 0.70584 48.1 1012 24721
swe-fra flores200-devtest 0.63608 39.1 1012 28343
swe-por flores200-devtest 0.62046 36.4 1012 26519
swe-spa flores200-devtest 0.52328 23.9 1012 29199
isl-eng newstest2021 0.56364 32.4 1000 22529
dan-deu ntrex128 0.54229 25.3 1997 48761
dan-eng ntrex128 0.63083 38.7 1997 47673
dan-fra ntrex128 0.54088 26.2 1997 53481
dan-por ntrex128 0.53626 27.0 1997 51631
dan-spa ntrex128 0.56217 30.8 1997 54107
fao-deu ntrex128 0.41701 16.4 1997 48761
fao-eng ntrex128 0.47105 25.3 1997 47673
fao-fra ntrex128 0.40070 16.3 1997 53481
fao-por ntrex128 0.42005 18.0 1997 51631
fao-spa ntrex128 0.44085 20.5 1997 54107
isl-deu ntrex128 0.49932 20.5 1997 48761
isl-eng ntrex128 0.56856 29.7 1997 47673
isl-fra ntrex128 0.51998 24.6 1997 53481
isl-por ntrex128 0.49903 21.7 1997 51631
isl-spa ntrex128 0.53171 27.1 1997 54107
nno-deu ntrex128 0.53000 24.4 1997 48761
nno-eng ntrex128 0.65866 42.9 1997 47673
nno-fra ntrex128 0.54339 27.5 1997 53481
nno-por ntrex128 0.53242 26.3 1997 51631
nno-spa ntrex128 0.55889 30.4 1997 54107
nob-deu ntrex128 0.55549 26.8 1997 48761
nob-eng ntrex128 0.65580 40.9 1997 47673
nob-fra ntrex128 0.56187 29.2 1997 53481
nob-por ntrex128 0.54392 26.6 1997 51631
nob-spa ntrex128 0.57998 32.6 1997 54107
swe-deu ntrex128 0.55549 26.7 1997 48761
swe-eng ntrex128 0.66348 42.2 1997 47673
swe-fra ntrex128 0.56310 29.0 1997 53481
swe-por ntrex128 0.54965 27.8 1997 51631
swe-spa ntrex128 0.58035 32.8 1997 54107

Citation Information

@article{tiedemann2023democratizing,
  title={Democratizing neural machine translation with {OPUS-MT}},
  author={Tiedemann, J{\"o}rg and Aulamo, Mikko and Bakshandaeva, Daria and Boggia, Michele and Gr{\"o}nroos, Stig-Arne and Nieminen, Tommi and Raganato, Alessandro and Scherrer, Yves and Vazquez, Raul and Virpioja, Sami},
  journal={Language Resources and Evaluation},
  number={58},
  pages={713--755},
  year={2023},
  publisher={Springer Nature},
  issn={1574-0218},
  doi={10.1007/s10579-023-09704-w}
}

@inproceedings{tiedemann-thottingal-2020-opus,
    title = "{OPUS}-{MT} {--} Building open translation services for the World",
    author = {Tiedemann, J{\"o}rg  and Thottingal, Santhosh},
    booktitle = "Proceedings of the 22nd Annual Conference of the European Association for Machine Translation",
    month = nov,
    year = "2020",
    address = "Lisboa, Portugal",
    publisher = "European Association for Machine Translation",
    url = "https://aclanthology.org/2020.eamt-1.61",
    pages = "479--480",
}

@inproceedings{tiedemann-2020-tatoeba,
    title = "The Tatoeba Translation Challenge {--} Realistic Data Sets for Low Resource and Multilingual {MT}",
    author = {Tiedemann, J{\"o}rg},
    booktitle = "Proceedings of the Fifth Conference on Machine Translation",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2020.wmt-1.139",
    pages = "1174--1182",
}

Acknowledgements

The work is supported by the HPLT project, funded by the European Union’s Horizon Europe research and innovation programme under grant agreement No 101070350. We are also grateful for the generous computational resources and IT infrastructure provided by CSC -- IT Center for Science, Finland, and the EuroHPC supercomputer LUMI.

Model conversion info

  • transformers version: 4.45.1
  • OPUS-MT git hash: 0882077
  • port time: Tue Oct 8 11:11:37 EEST 2024
  • port machine: LM0-400-22516.local
Downloads last month
4
Safetensors
Model size
234M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Collection including Helsinki-NLP/opus-mt-tc-bible-big-gmq-deu_eng_fra_por_spa

Evaluation results