---
language:
- cs
- en
- pl
- sk
- sl
library_name: transformers
license: cc-by-4.0
tags:
- translation
- mt
- marian
- pytorch
- sentence-piece
- many2one
- multilingual
- pivot
- allegro
- laniqo
---
# MultiSlav P5-many2ces
This model is described in the paper [MultiSlav: Massively Multilingual Machine Translation for Slavic Languages](https://hf.co/papers/2502.14509).
## Multilingual Many-to-Czech MT Model
___P5-many2ces___ is an Encoder-Decoder vanilla transformer model trained on sentence-level Machine Translation task.
Model is supporting translation from 4 languages: English, Polish, Slovak, and Slovene to Czech.
This model is part of the [___MultiSlav___ collection](https://huggingface.co/collections/allegro/multislav-6793d6b6419e5963e759a683).
More information will be available soon in our upcoming MultiSlav paper.
Experiments were conducted under research project by [Machine Learning Research](https://ml.allegro.tech/) lab for [Allegro.com](https://ml.allegro.tech/).
Big thanks to [laniqo.com](https://laniqo.com) for cooperation in the research.
___P5-many2ces___ - _5_-language _Many-to-Czech_ model translating from all applicable languages to Czech.
This model and [_P5-ces2many_](https://huggingface.co/allegro/P5-ces2many) combine into ___P5-ces___ pivot system translating between _5_ languages.
_P5-ces_ translates all supported languages using Many2One model to Czech bridge sentence
and next using the One2Many model from Czech bridge sentence to target language.
### Model description
* **Model name:** P5-many2ces
* **Source Languages:** English, Polish, Slovak, Slovene
* **Target Language:** Czech
* **Model Collection:** [MultiSlav](https://huggingface.co/collections/allegro/multislav-6793d6b6419e5963e759a683)
* **Model type:** MarianMTModel Encoder-Decoder
* **License:** CC BY 4.0 (commercial use allowed)
* **Developed by:** [MLR @ Allegro](https://ml.allegro.tech/) & [Laniqo.com](https://laniqo.com/)
### Supported languages
Using model you must specify source language for translation.
Source language tokens are represented as 3-letter ISO 639-3 language codes embedded in a format >>xxx<<.
All accepted directions and their respective tokens are listed below.
Each of them was added as a special token to Sentence-Piece tokenizer.
| **Source Language** | **First token** |
|---------------------|-----------------|
| English | `>>eng<<` |
| Polish | `>>pol<<` |
| Slovak | `>>slk<<` |
| Slovene | `>>slv<<` |
## Use case quickstart
Example code-snippet to use model. Due to bug the `MarianMTModel` must be used explicitly.
```python
from transformers import AutoTokenizer, MarianMTModel
m2o_model_name = "Allegro/P5-many2ces"
m2o_tokenizer = AutoTokenizer.from_pretrained(m2o_model_name)
m2o_model = MarianMTModel.from_pretrained(m2o_model_name)
text = ">>pol<<" + " " + "Allegro to internetowa platforma e-commerce, na której swoje produkty sprzedają średnie i małe firmy, jak również duże marki."
translations = m2o_model.generate(**m2o_tokenizer.batch_encode_plus([text], return_tensors="pt"))
bridge_translation = m2o_tokenizer.batch_decode(translations, skip_special_tokens=True, clean_up_tokenization_spaces=True)
print(bridge_translation[0])
```
Generated _bridge_ Czech output:
> Allegro je online e-commerce platforma, na které své produkty prodávají střední a malé firmy, stejně jako velké značky.
To pivot-translate to other languages via _bridge_ Czech sentence, we need One2Many model.
One2Many model requires explicit target language token as well:
```python
o2m_model_name = "Allegro/P5-ces2many"
o2m_tokenizer = AutoTokenizer.from_pretrained(o2m_model_name)
o2m_model = MarianMTModel.from_pretrained(o2m_model_name)
texts_to_translate = [
">>eng<<" + bridge_translation[0],
">>slk<<" + bridge_translation[0],
">>slv<<" + bridge_translation[0]
]
translation = o2m_model.generate(**o2m_tokenizer.batch_encode_plus(texts_to_translate, return_tensors="pt"))
decoded_translations = o2m_tokenizer.batch_decode(translation, skip_special_tokens=True, clean_up_tokenization_spaces=True)
for trans in decoded_translations:
print(trans)
```
Generated Polish to English pivot translation via Czech:
> Allegro is an online e-commerce platform on which medium and small businesses as well as large brands sell their products.
Generated Polish to Slovak pivot translation via Czech:
> Allegro je online e-commerce platforma, na ktorej svoje produkty predávajú stredné a malé firmy, rovnako ako veľké značky.
Generated Polish to Slovene pivot translation via Czech:
> Allegro je spletna e-poslovanje platforma, na kateri prodajajo svoje izdelke srednje velika in mala podjetja ter velike blagovne znamke.
## Training
[SentencePiece](https://github.com/google/sentencepiece) tokenizer has a vocab size 80k in total (16k per language). Tokenizer was trained on randomly sampled part of the training corpus.
During the training we used the [MarianNMT](https://marian-nmt.github.io/) framework.
Base marian configuration used: [transfromer-big](https://github.com/marian-nmt/marian-dev/blob/master/src/common/aliases.cpp#L113).
All training parameters are listed in table below.
### Training hyperparameters: