opus-mt-tc-bible-big-gmq-deu_eng_fra_por_spa

Model Details
Uses
Risks, Limitations and Biases
How to Get Started With the Model
Training
Evaluation
Citation Information
Acknowledgements

Model Details

Neural machine translation model for translating from North Germanic languages (gmq) to unknown (deu+eng+fra+por+spa).

This model is part of the OPUS-MT project, an effort to make neural machine translation models widely available and accessible for many languages in the world. All models are originally trained using the amazing framework of Marian NMT, an efficient NMT implementation written in pure C++. The models have been converted to pyTorch using the transformers library by huggingface. Training data is taken from OPUS and training pipelines use the procedures of OPUS-MT-train. Model Description:

Developed by: Language Technology Research Group at the University of Helsinki
Model Type: Translation (transformer-big)
Release: 2024-05-30
License: Apache-2.0
Language(s):
- Source Language(s): dan fao isl nno nob non nor swe
- Target Language(s): deu eng fra por spa
- Valid Target Language Labels: >>deu<< >>eng<< >>fra<< >>por<< >>spa<< >>xxx<<
Original Model: opusTCv20230926max50+bt+jhubc_transformer-big_2024-05-30.zip
Resources for more information:

This is a multilingual translation model with multiple target languages. A sentence initial language token is required in the form of >>id<< (id = valid target language ID), e.g. >>deu<<

Uses

This model can be used for translation and text-to-text generation.

Risks, Limitations and Biases

CONTENT WARNING: Readers should be aware that the model is trained on various public data sets that may contain content that is disturbing, offensive, and can propagate historical and current stereotypes.

Significant research has explored bias and fairness issues with language models (see, e.g., Sheng et al. (2021) and Bender et al. (2021)).

How to Get Started With the Model

A short example code:

from transformers import MarianMTModel, MarianTokenizer

src_text = [
    ">>deu<< Replace this with text in an accepted source language.",
    ">>spa<< This is the second sentence."
]

model_name = "pytorch-models/opus-mt-tc-bible-big-gmq-deu_eng_fra_por_spa"
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)
translated = model.generate(**tokenizer(src_text, return_tensors="pt", padding=True))

for t in translated:
    print( tokenizer.decode(t, skip_special_tokens=True) )

You can also use OPUS-MT models with the transformers pipelines, for example:

from transformers import pipeline
pipe = pipeline("translation", model="Helsinki-NLP/opus-mt-tc-bible-big-gmq-deu_eng_fra_por_spa")
print(pipe(">>deu<< Replace this with text in an accepted source language."))

Training

Data: opusTCv20230926max50+bt+jhubc (source)
Pre-processing: SentencePiece (spm32k,spm32k)
Model Type: transformer-big
Original MarianNMT Model: opusTCv20230926max50+bt+jhubc_transformer-big_2024-05-30.zip
Training Scripts: GitHub Repo

Evaluation

Model scores at the OPUS-MT dashboard
test set translations: opusTCv20230926max50+bt+jhubc_transformer-big_2024-05-29.test.txt
test set scores: opusTCv20230926max50+bt+jhubc_transformer-big_2024-05-29.eval.txt
benchmark results: benchmark_results.txt
benchmark output: benchmark_translations.zip

langpair	testset	chr-F	BLEU	#sent	#words
dan-deu	tatoeba-test-v2021-08-07	0.74460	56.7	9998	76055
dan-eng	tatoeba-test-v2021-08-07	0.77233	64.3	10795	79684
dan-fra	tatoeba-test-v2021-08-07	0.76425	60.8	1731	11882
dan-por	tatoeba-test-v2021-08-07	0.77248	60.0	873	5360
dan-spa	tatoeba-test-v2021-08-07	0.72567	54.9	5000	35528
fao-eng	tatoeba-test-v2021-08-07	0.54571	39.6	294	1984
isl-deu	tatoeba-test-v2021-08-07	0.68535	51.4	969	6279
isl-eng	tatoeba-test-v2021-08-07	0.67066	51.7	2503	19788
isl-spa	tatoeba-test-v2021-08-07	0.65659	48.5	238	1229
nno-eng	tatoeba-test-v2021-08-07	0.69415	55.5	460	3524
nob-deu	tatoeba-test-v2021-08-07	0.69862	50.5	3525	33592
nob-eng	tatoeba-test-v2021-08-07	0.72912	59.2	4539	36823
nob-fra	tatoeba-test-v2021-08-07	0.71392	52.5	323	2269
nob-spa	tatoeba-test-v2021-08-07	0.73300	55.1	885	6866
nor-deu	tatoeba-test-v2021-08-07	0.69923	50.7	3651	34575
nor-eng	tatoeba-test-v2021-08-07	0.72587	58.8	5000	40355
nor-fra	tatoeba-test-v2021-08-07	0.73052	55.1	477	3213
nor-por	tatoeba-test-v2021-08-07	0.67948	45.4	481	4182
nor-spa	tatoeba-test-v2021-08-07	0.73320	55.3	960	7311
swe-deu	tatoeba-test-v2021-08-07	0.71816	55.4	3410	23494
swe-eng	tatoeba-test-v2021-08-07	0.76648	64.8	10362	68513
swe-fra	tatoeba-test-v2021-08-07	0.72847	57.4	1407	9580
swe-por	tatoeba-test-v2021-08-07	0.70554	50.3	320	2032
swe-spa	tatoeba-test-v2021-08-07	0.70926	54.3	1351	8235
dan-eng	flores101-devtest	0.71193	47.6	1012	24721
dan-fra	flores101-devtest	0.63349	38.1	1012	28343
dan-por	flores101-devtest	0.62063	36.2	1012	26519
dan-spa	flores101-devtest	0.52557	24.2	1012	29199
isl-deu	flores101-devtest	0.50581	22.2	1012	25094
isl-eng	flores101-devtest	0.57294	31.6	1012	24721
isl-por	flores101-devtest	0.52192	25.8	1012	26519
isl-spa	flores101-devtest	0.46364	18.5	1012	29199
nob-eng	flores101-devtest	0.67120	42.6	1012	24721
nob-fra	flores101-devtest	0.60289	33.9	1012	28343
nob-spa	flores101-devtest	0.50848	21.9	1012	29199
swe-deu	flores101-devtest	0.60306	32.2	1012	25094
swe-eng	flores101-devtest	0.70404	47.9	1012	24721
swe-por	flores101-devtest	0.61418	35.7	1012	26519
dan-deu	flores200-devtest	0.60897	32.3	1012	25094
dan-eng	flores200-devtest	0.71641	48.2	1012	24721
dan-fra	flores200-devtest	0.63777	38.9	1012	28343
dan-por	flores200-devtest	0.62302	36.7	1012	26519
dan-spa	flores200-devtest	0.52803	24.4	1012	29199
fao-deu	flores200-devtest	0.41184	16.0	1012	25094
fao-eng	flores200-devtest	0.43308	21.2	1012	24721
fao-por	flores200-devtest	0.42649	19.0	1012	26519
isl-deu	flores200-devtest	0.51165	22.7	1012	25094
isl-eng	flores200-devtest	0.57745	32.2	1012	24721
isl-fra	flores200-devtest	0.54210	27.6	1012	28343
isl-por	flores200-devtest	0.52479	26.1	1012	26519
isl-spa	flores200-devtest	0.46837	19.2	1012	29199
nno-deu	flores200-devtest	0.58054	29.2	1012	25094
nno-eng	flores200-devtest	0.69114	45.0	1012	24721
nno-fra	flores200-devtest	0.61334	36.0	1012	28343
nno-por	flores200-devtest	0.60055	34.1	1012	26519
nno-spa	flores200-devtest	0.51190	22.8	1012	29199
nob-deu	flores200-devtest	0.57023	27.6	1012	25094
nob-eng	flores200-devtest	0.67540	43.1	1012	24721
nob-fra	flores200-devtest	0.60568	34.2	1012	28343
nob-por	flores200-devtest	0.59466	32.8	1012	26519
nob-spa	flores200-devtest	0.51138	22.4	1012	29199
swe-deu	flores200-devtest	0.60630	32.6	1012	25094
swe-eng	flores200-devtest	0.70584	48.1	1012	24721
swe-fra	flores200-devtest	0.63608	39.1	1012	28343
swe-por	flores200-devtest	0.62046	36.4	1012	26519
swe-spa	flores200-devtest	0.52328	23.9	1012	29199
isl-eng	newstest2021	0.56364	32.4	1000	22529
dan-deu	ntrex128	0.54229	25.3	1997	48761
dan-eng	ntrex128	0.63083	38.7	1997	47673
dan-fra	ntrex128	0.54088	26.2	1997	53481
dan-por	ntrex128	0.53626	27.0	1997	51631
dan-spa	ntrex128	0.56217	30.8	1997	54107
fao-deu	ntrex128	0.41701	16.4	1997	48761
fao-eng	ntrex128	0.47105	25.3	1997	47673
fao-fra	ntrex128	0.40070	16.3	1997	53481
fao-por	ntrex128	0.42005	18.0	1997	51631
fao-spa	ntrex128	0.44085	20.5	1997	54107
isl-deu	ntrex128	0.49932	20.5	1997	48761
isl-eng	ntrex128	0.56856	29.7	1997	47673
isl-fra	ntrex128	0.51998	24.6	1997	53481
isl-por	ntrex128	0.49903	21.7	1997	51631
isl-spa	ntrex128	0.53171	27.1	1997	54107
nno-deu	ntrex128	0.53000	24.4	1997	48761
nno-eng	ntrex128	0.65866	42.9	1997	47673
nno-fra	ntrex128	0.54339	27.5	1997	53481
nno-por	ntrex128	0.53242	26.3	1997	51631
nno-spa	ntrex128	0.55889	30.4	1997	54107
nob-deu	ntrex128	0.55549	26.8	1997	48761
nob-eng	ntrex128	0.65580	40.9	1997	47673
nob-fra	ntrex128	0.56187	29.2	1997	53481
nob-por	ntrex128	0.54392	26.6	1997	51631
nob-spa	ntrex128	0.57998	32.6	1997	54107
swe-deu	ntrex128	0.55549	26.7	1997	48761
swe-eng	ntrex128	0.66348	42.2	1997	47673
swe-fra	ntrex128	0.56310	29.0	1997	53481
swe-por	ntrex128	0.54965	27.8	1997	51631
swe-spa	ntrex128	0.58035	32.8	1997	54107

Citation Information

Publications: Democratizing neural machine translation with OPUS-MT and OPUS-MT – Building open translation services for the World and The Tatoeba Translation Challenge – Realistic Data Sets for Low Resource and Multilingual MT (Please, cite if you use this model.)

@article{tiedemann2023democratizing,
  title={Democratizing neural machine translation with {OPUS-MT}},
  author={Tiedemann, J{\"o}rg and Aulamo, Mikko and Bakshandaeva, Daria and Boggia, Michele and Gr{\"o}nroos, Stig-Arne and Nieminen, Tommi and Raganato, Alessandro and Scherrer, Yves and Vazquez, Raul and Virpioja, Sami},
  journal={Language Resources and Evaluation},
  number={58},
  pages={713--755},
  year={2023},
  publisher={Springer Nature},
  issn={1574-0218},
  doi={10.1007/s10579-023-09704-w}
}

@inproceedings{tiedemann-thottingal-2020-opus,
    title = "{OPUS}-{MT} {--} Building open translation services for the World",
    author = {Tiedemann, J{\"o}rg  and Thottingal, Santhosh},
    booktitle = "Proceedings of the 22nd Annual Conference of the European Association for Machine Translation",
    month = nov,
    year = "2020",
    address = "Lisboa, Portugal",
    publisher = "European Association for Machine Translation",
    url = "https://aclanthology.org/2020.eamt-1.61",
    pages = "479--480",
}

@inproceedings{tiedemann-2020-tatoeba,
    title = "The Tatoeba Translation Challenge {--} Realistic Data Sets for Low Resource and Multilingual {MT}",
    author = {Tiedemann, J{\"o}rg},
    booktitle = "Proceedings of the Fifth Conference on Machine Translation",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2020.wmt-1.139",
    pages = "1174--1182",
}

Acknowledgements

The work is supported by the HPLT project, funded by the European Union’s Horizon Europe research and innovation programme under grant agreement No 101070350. We are also grateful for the generous computational resources and IT infrastructure provided by CSC -- IT Center for Science, Finland, and the EuroHPC supercomputer LUMI.

Model conversion info

transformers version: 4.45.1
OPUS-MT git hash: 0882077
port time: Tue Oct 8 11:11:37 EEST 2024
port machine: LM0-400-22516.local

Helsinki-NLP
/

opus-mt-tc-bible-big-gmq-deu_eng_fra_por_spa