Add link to paper

Browse files

This PR adds a link to the paper at the top of the model card.

Files changed (1) hide show

README.md +5 -158

README.md CHANGED Viewed

@@ -1,5 +1,4 @@
 ---
-license: cc-by-4.0
 language:
 - cs
 - en
@@ -7,6 +6,7 @@ language:
 - sk
 - sl
 library_name: transformers
 tags:
 - translation
 - mt
@@ -22,6 +22,8 @@ tags:
 # MultiSlav P5-many2ces
 <p align="center">
   <a href="https://ml.allegro.tech/"><img src="allegro-title.svg" alt="MLR @ Allegro.com"></a>
 </p>
@@ -34,7 +36,7 @@ This model is part of the [___MultiSlav___ collection](https://huggingface.co/co
 More information will be available soon in our upcoming MultiSlav paper.
 Experiments were conducted under research project by [Machine Learning Research](https://ml.allegro.tech/) lab for [Allegro.com](https://ml.allegro.tech/).
-Big thanks to [laniqo.com](laniqo.com) for cooperation in the research.
 <p align="center">
   <img src="p5-ces.svg">
@@ -130,159 +132,4 @@ During the training we used the [MarianNMT](https://marian-nmt.github.io/) frame
 Base marian configuration used: [transfromer-big](https://github.com/marian-nmt/marian-dev/blob/master/src/common/aliases.cpp#L113).
 All training parameters are listed in table below.
-### Training hyperparameters:
-| **Hyperparameter**          | **Value**                                                                                                  |
-|-----------------------------|------------------------------------------------------------------------------------------------------------|
-| Total Parameter Size        | 258M                                                                                                       |
-| Training Examples           | 269M                                                                                                       |
-| Vocab Size                  | 80k                                                                                                        |
-| Base Parameters             | [Marian transfromer-big](https://github.com/marian-nmt/marian-dev/blob/master/src/common/aliases.cpp#L113) |
-| Number of Encoding Layers   | 6                                                                                                          |
-| Number of Decoding Layers   | 6                                                                                                          |
-| Model Dimension             | 1024                                                                                                       |
-| FF Dimension                | 4096                                                                                                       |
-| Heads                       | 16                                                                                                         |
-| Dropout                     | 0.1                                                                                                        |
-| Batch Size                  | mini batch fit to VRAM                                                                                     |
-| Training Accelerators       | 4x A100 40GB                                                                                               |
-| Max Length                  | 100 tokens                                                                                                 |
-| Optimizer                   | Adam                                                                                                       |
-| Warmup steps                | 8000                                                                                                       |
-| Context                     | Sentence-level MT                                                                                          |
-| Source Languages Supported  | English, Polish, Slovak, Slovene                                                                           |
-| Target Language Supported   | Czech                                                                                                      |
-| Precision                   | float16                                                                                                    |
-| Validation Freq             | 3000 steps                                                                                                 |
-| Stop Metric                 | ChrF                                                                                                       |
-| Stop Criterion              | 20 Validation steps                                                                                        |
-## Training corpora
-<p align="center">
-  <img src="pivot-data-many2ces.svg">
-</p>
-The main research question was: "How does adding additional, related languages impact the quality of the model?" - we explored it in the Slavic language family.
-In this model we experimented with expanding data-regime by using data from multiple source language and expanding language-pool by adding English.
-We found that additional fluency data clearly improved performance compared to the bi-directional baseline models.
-For example in translation from Polish to Czech, this allowed us to expand training data-size from 63M to 269M examples, and from 25M to 269M for Slovene to Czech translation.
-We only used explicitly open-source data to ensure open-source license of our model.
-Datasets were downloaded via [MT-Data](https://pypi.org/project/mtdata/0.2.10/) library. Number of total examples post filtering and deduplication: __269M__.
-The datasets used:
-| **Corpus**           |
-|----------------------|
-| paracrawl            |
-| opensubtitles        |
-| multiparacrawl       |
-| dgt                  |
-| elrc                 |
-| xlent                |
-| wikititles           |
-| wmt                  |
-| wikimatrix           |
-| dcep                 |
-| ELRC                 |
-| tildemodel           |
-| europarl             |
-| eesc                 |
-| eubookshop           |
-| emea                 |
-| jrc_acquis           |
-| ema                  |
-| qed                  |
-| elitr_eca            |
-| EU-dcep              |
-| rapid                |
-| ecb                  |
-| kde4                 |
-| news_commentary      |
-| kde                  |
-| bible_uedin          |
-| europat              |
-| elra                 |
-| wikipedia            |
-| wikimedia            |
-| tatoeba              |
-| globalvoices         |
-| euconst              |
-| ubuntu               |
-| php                  |
-| ecdc                 |
-| eac                  |
-| eac_reference        |
-| gnome                |
-| EU-eac               |
-| books                |
-| EU-ecdc              |
-| newsdev              |
-| khresmoi_summary     |
-| czechtourism         |
-| khresmoi_summary_dev |
-| worldbank            |
-## Evaluation
-Evaluation of the models was performed on [Flores200](https://huggingface.co/datasets/facebook/flores) dataset.
-The table below compares performance of the open-source models and all applicable models from our collection.
-Metrics BLEU, ChrF2, and Unbabel/wmt22-comet-da.
-Translation results on translation from Polish to Czech (Slavic direction with the __highest__ data-regime):
-| **Model**                                             | **Comet22** | **BLEU** | **ChrF** | **Model Size** |
-|-------------------------------------------------------|:-----------:|:--------:|:--------:|---------------:|
-| M2M−100                                               |    89.6     |   19.8   |   47.7   |           1.2B |
-| NLLB−200                                              |    89.4     |   19.2   |   46.7   |           1.3B |
-| Opus Sla-Sla                                          |    82.9     |   14.6   |   42.6   |            64M |
-| BiDi-ces-pol (baseline)                               |    90.0     |   20.3   |   48.5   |           209M |
-| P4-pol <span style="color:red;">◊</span>              |    90.2     |   20.2   |   48.5   |        2x 242M |
-| P5-eng <span style="color:red;">◊</span>              |    89.0     |   19.9   |   48.3   |        2x 258M |
-| ___P5-many2ces___ <span style="color:green;">*</span> |    90.3     |   20.2   |   48.6   |           258M |
-| MultiSlav-4slav                                       |    90.2     |   20.6   |   48.7   |           242M |
-| MultiSlav-5lang                                       |  __90.4__   | __20.7__ | __48.9__ |           258M |
-Translation results on translation from Slovene to Czech (direction to Czech with the __lowest__ data-regime):
-| **Model**                                             | **Comet22** | **BLEU** | **ChrF** | **Model Size** |
-|-------------------------------------------------------|:-----------:|:--------:|:--------:|---------------:|
-| M2M−100                                               |    90.3     |   24.3   |   51.6   |           1.2B |
-| NLLB−200                                              |    90.0     |   22.5   |   49.9   |           1.3B |
-| Opus Sla-Sla                                          |    83.5     |   17.4   |   46.0   |           1.3B |
-| BiDi-ces-slv (baseline)                               |    90.0     |   24.4   |   52.0   |           209M |
-| P4-pol <span style="color:red;">◊</span>              |    89.3     |   22.7   |   50.4   |        2x 242M |
-| P5-eng <span style="color:red;">◊</span>              |    89.6     |   24.7   |   52.4   |        2x 258M |
-| ___P5-many2ces___ <span style="color:green;">*</span> |    90.3     |   24.9   |   52.4   |           258M |
-| MultiSlav-4slav                                       |  __90.6__   | __25.3__ | __52.7__ |           242M |
-| MultiSlav-5lang                                       |  __90.6__   |   25.2   |   52.5   |           258M |
-<span style="color:green;">*</span> this model is Many2One part of P5-ces pivot system.
-<span style="color:red;">◊</span> system of 2 models *Many2XXX* and *XXX2Many*.
-## Limitations and Biases
-We did not evaluate inherent bias contained in training datasets. It is advised to validate bias of our models in perspective domain. This might be especially problematic in translation from English to Slavic languages, which require explicitly indicated gender and might hallucinate based on bias present in training data.
-## License
-The model is licensed under CC BY 4.0, which allows for commercial use.
-## Citation
-TO BE UPDATED SOON 🤗
-## Contact Options
-Authors:
-- MLR @ Allegro: [Artur Kot](https://linkedin.com/in/arturkot), [Mikołaj Koszowski](https://linkedin.com/in/mkoszowski), [Wojciech Chojnowski](https://linkedin.com/in/wojciech-chojnowski-744702348), [Mieszko Rutkowski](https://linkedin.com/in/mieszko-rutkowski)
-- Laniqo.com: [Artur Nowakowski](https://linkedin.com/in/artur-nowakowski-mt), [Kamil Guttmann](https://linkedin.com/in/kamil-guttmann), [Mikołaj Pokrywka](https://linkedin.com/in/mikolaj-pokrywka)
-Please don't hesitate to contact authors if you have any questions or suggestions:
-- e-mail: artur.kot@allegro.com or mikolaj.koszowski@allegro.com
-- LinkedIn: [Artur Kot](https://linkedin.com/in/arturkot) or [Mikołaj Koszowski](https://linkedin.com/in/mkoszowski)

 ---
 language:
 - cs
 - en
 - sk
 - sl
 library_name: transformers
+license: cc-by-4.0
 tags:
 - translation
 - mt
 # MultiSlav P5-many2ces
+This model is described in the paper [MultiSlav: Massively Multilingual Machine Translation for Slavic Languages](https://hf.co/papers/2502.14509).
 <p align="center">
   <a href="https://ml.allegro.tech/"><img src="allegro-title.svg" alt="MLR @ Allegro.com"></a>
 </p>
 More information will be available soon in our upcoming MultiSlav paper.
 Experiments were conducted under research project by [Machine Learning Research](https://ml.allegro.tech/) lab for [Allegro.com](https://ml.allegro.tech/).
+Big thanks to [laniqo.com](https://laniqo.com) for cooperation in the research.
 <p align="center">
   <img src="p5-ces.svg">
 Base marian configuration used: [transfromer-big](https://github.com/marian-nmt/marian-dev/blob/master/src/common/aliases.cpp#L113).
 All training parameters are listed in table below.
+### Training hyperparameters: