nielsr HF staff commited on
Commit
e86b5f4
·
verified ·
1 Parent(s): 2c2bae1

Add link to paper

Browse files

This PR adds a link to the paper at the top of the model card.

Files changed (1) hide show
  1. README.md +5 -158
README.md CHANGED
@@ -1,5 +1,4 @@
1
  ---
2
- license: cc-by-4.0
3
  language:
4
  - cs
5
  - en
@@ -7,6 +6,7 @@ language:
7
  - sk
8
  - sl
9
  library_name: transformers
 
10
  tags:
11
  - translation
12
  - mt
@@ -22,6 +22,8 @@ tags:
22
 
23
  # MultiSlav P5-many2ces
24
 
 
 
25
  <p align="center">
26
  <a href="https://ml.allegro.tech/"><img src="allegro-title.svg" alt="MLR @ Allegro.com"></a>
27
  </p>
@@ -34,7 +36,7 @@ This model is part of the [___MultiSlav___ collection](https://huggingface.co/co
34
  More information will be available soon in our upcoming MultiSlav paper.
35
 
36
  Experiments were conducted under research project by [Machine Learning Research](https://ml.allegro.tech/) lab for [Allegro.com](https://ml.allegro.tech/).
37
- Big thanks to [laniqo.com](laniqo.com) for cooperation in the research.
38
 
39
  <p align="center">
40
  <img src="p5-ces.svg">
@@ -130,159 +132,4 @@ During the training we used the [MarianNMT](https://marian-nmt.github.io/) frame
130
  Base marian configuration used: [transfromer-big](https://github.com/marian-nmt/marian-dev/blob/master/src/common/aliases.cpp#L113).
131
  All training parameters are listed in table below.
132
 
133
- ### Training hyperparameters:
134
-
135
- | **Hyperparameter** | **Value** |
136
- |-----------------------------|------------------------------------------------------------------------------------------------------------|
137
- | Total Parameter Size | 258M |
138
- | Training Examples | 269M |
139
- | Vocab Size | 80k |
140
- | Base Parameters | [Marian transfromer-big](https://github.com/marian-nmt/marian-dev/blob/master/src/common/aliases.cpp#L113) |
141
- | Number of Encoding Layers | 6 |
142
- | Number of Decoding Layers | 6 |
143
- | Model Dimension | 1024 |
144
- | FF Dimension | 4096 |
145
- | Heads | 16 |
146
- | Dropout | 0.1 |
147
- | Batch Size | mini batch fit to VRAM |
148
- | Training Accelerators | 4x A100 40GB |
149
- | Max Length | 100 tokens |
150
- | Optimizer | Adam |
151
- | Warmup steps | 8000 |
152
- | Context | Sentence-level MT |
153
- | Source Languages Supported | English, Polish, Slovak, Slovene |
154
- | Target Language Supported | Czech |
155
- | Precision | float16 |
156
- | Validation Freq | 3000 steps |
157
- | Stop Metric | ChrF |
158
- | Stop Criterion | 20 Validation steps |
159
-
160
-
161
- ## Training corpora
162
-
163
- <p align="center">
164
- <img src="pivot-data-many2ces.svg">
165
- </p>
166
-
167
- The main research question was: "How does adding additional, related languages impact the quality of the model?" - we explored it in the Slavic language family.
168
- In this model we experimented with expanding data-regime by using data from multiple source language and expanding language-pool by adding English.
169
- We found that additional fluency data clearly improved performance compared to the bi-directional baseline models.
170
- For example in translation from Polish to Czech, this allowed us to expand training data-size from 63M to 269M examples, and from 25M to 269M for Slovene to Czech translation.
171
- We only used explicitly open-source data to ensure open-source license of our model.
172
-
173
- Datasets were downloaded via [MT-Data](https://pypi.org/project/mtdata/0.2.10/) library. Number of total examples post filtering and deduplication: __269M__.
174
-
175
- The datasets used:
176
-
177
- | **Corpus** |
178
- |----------------------|
179
- | paracrawl |
180
- | opensubtitles |
181
- | multiparacrawl |
182
- | dgt |
183
- | elrc |
184
- | xlent |
185
- | wikititles |
186
- | wmt |
187
- | wikimatrix |
188
- | dcep |
189
- | ELRC |
190
- | tildemodel |
191
- | europarl |
192
- | eesc |
193
- | eubookshop |
194
- | emea |
195
- | jrc_acquis |
196
- | ema |
197
- | qed |
198
- | elitr_eca |
199
- | EU-dcep |
200
- | rapid |
201
- | ecb |
202
- | kde4 |
203
- | news_commentary |
204
- | kde |
205
- | bible_uedin |
206
- | europat |
207
- | elra |
208
- | wikipedia |
209
- | wikimedia |
210
- | tatoeba |
211
- | globalvoices |
212
- | euconst |
213
- | ubuntu |
214
- | php |
215
- | ecdc |
216
- | eac |
217
- | eac_reference |
218
- | gnome |
219
- | EU-eac |
220
- | books |
221
- | EU-ecdc |
222
- | newsdev |
223
- | khresmoi_summary |
224
- | czechtourism |
225
- | khresmoi_summary_dev |
226
- | worldbank |
227
-
228
- ## Evaluation
229
-
230
- Evaluation of the models was performed on [Flores200](https://huggingface.co/datasets/facebook/flores) dataset.
231
- The table below compares performance of the open-source models and all applicable models from our collection.
232
- Metrics BLEU, ChrF2, and Unbabel/wmt22-comet-da.
233
-
234
- Translation results on translation from Polish to Czech (Slavic direction with the __highest__ data-regime):
235
-
236
- | **Model** | **Comet22** | **BLEU** | **ChrF** | **Model Size** |
237
- |-------------------------------------------------------|:-----------:|:--------:|:--------:|---------------:|
238
- | M2M−100 | 89.6 | 19.8 | 47.7 | 1.2B |
239
- | NLLB−200 | 89.4 | 19.2 | 46.7 | 1.3B |
240
- | Opus Sla-Sla | 82.9 | 14.6 | 42.6 | 64M |
241
- | BiDi-ces-pol (baseline) | 90.0 | 20.3 | 48.5 | 209M |
242
- | P4-pol <span style="color:red;">◊</span> | 90.2 | 20.2 | 48.5 | 2x 242M |
243
- | P5-eng <span style="color:red;">◊</span> | 89.0 | 19.9 | 48.3 | 2x 258M |
244
- | ___P5-many2ces___ <span style="color:green;">*</span> | 90.3 | 20.2 | 48.6 | 258M |
245
- | MultiSlav-4slav | 90.2 | 20.6 | 48.7 | 242M |
246
- | MultiSlav-5lang | __90.4__ | __20.7__ | __48.9__ | 258M |
247
-
248
- Translation results on translation from Slovene to Czech (direction to Czech with the __lowest__ data-regime):
249
-
250
- | **Model** | **Comet22** | **BLEU** | **ChrF** | **Model Size** |
251
- |-------------------------------------------------------|:-----------:|:--------:|:--------:|---------------:|
252
- | M2M−100 | 90.3 | 24.3 | 51.6 | 1.2B |
253
- | NLLB−200 | 90.0 | 22.5 | 49.9 | 1.3B |
254
- | Opus Sla-Sla | 83.5 | 17.4 | 46.0 | 1.3B |
255
- | BiDi-ces-slv (baseline) | 90.0 | 24.4 | 52.0 | 209M |
256
- | P4-pol <span style="color:red;">◊</span> | 89.3 | 22.7 | 50.4 | 2x 242M |
257
- | P5-eng <span style="color:red;">◊</span> | 89.6 | 24.7 | 52.4 | 2x 258M |
258
- | ___P5-many2ces___ <span style="color:green;">*</span> | 90.3 | 24.9 | 52.4 | 258M |
259
- | MultiSlav-4slav | __90.6__ | __25.3__ | __52.7__ | 242M |
260
- | MultiSlav-5lang | __90.6__ | 25.2 | 52.5 | 258M |
261
-
262
-
263
- <span style="color:green;">*</span> this model is Many2One part of P5-ces pivot system.
264
-
265
- <span style="color:red;">◊</span> system of 2 models *Many2XXX* and *XXX2Many*.
266
-
267
- ## Limitations and Biases
268
-
269
- We did not evaluate inherent bias contained in training datasets. It is advised to validate bias of our models in perspective domain. This might be especially problematic in translation from English to Slavic languages, which require explicitly indicated gender and might hallucinate based on bias present in training data.
270
-
271
- ## License
272
-
273
- The model is licensed under CC BY 4.0, which allows for commercial use.
274
-
275
- ## Citation
276
- TO BE UPDATED SOON 🤗
277
-
278
-
279
-
280
- ## Contact Options
281
-
282
- Authors:
283
- - MLR @ Allegro: [Artur Kot](https://linkedin.com/in/arturkot), [Mikołaj Koszowski](https://linkedin.com/in/mkoszowski), [Wojciech Chojnowski](https://linkedin.com/in/wojciech-chojnowski-744702348), [Mieszko Rutkowski](https://linkedin.com/in/mieszko-rutkowski)
284
- - Laniqo.com: [Artur Nowakowski](https://linkedin.com/in/artur-nowakowski-mt), [Kamil Guttmann](https://linkedin.com/in/kamil-guttmann), [Mikołaj Pokrywka](https://linkedin.com/in/mikolaj-pokrywka)
285
-
286
- Please don't hesitate to contact authors if you have any questions or suggestions:
287
- - e-mail: artur.kot@allegro.com or mikolaj.koszowski@allegro.com
288
- - LinkedIn: [Artur Kot](https://linkedin.com/in/arturkot) or [Mikołaj Koszowski](https://linkedin.com/in/mkoszowski)
 
1
  ---
 
2
  language:
3
  - cs
4
  - en
 
6
  - sk
7
  - sl
8
  library_name: transformers
9
+ license: cc-by-4.0
10
  tags:
11
  - translation
12
  - mt
 
22
 
23
  # MultiSlav P5-many2ces
24
 
25
+ This model is described in the paper [MultiSlav: Massively Multilingual Machine Translation for Slavic Languages](https://hf.co/papers/2502.14509).
26
+
27
  <p align="center">
28
  <a href="https://ml.allegro.tech/"><img src="allegro-title.svg" alt="MLR @ Allegro.com"></a>
29
  </p>
 
36
  More information will be available soon in our upcoming MultiSlav paper.
37
 
38
  Experiments were conducted under research project by [Machine Learning Research](https://ml.allegro.tech/) lab for [Allegro.com](https://ml.allegro.tech/).
39
+ Big thanks to [laniqo.com](https://laniqo.com) for cooperation in the research.
40
 
41
  <p align="center">
42
  <img src="p5-ces.svg">
 
132
  Base marian configuration used: [transfromer-big](https://github.com/marian-nmt/marian-dev/blob/master/src/common/aliases.cpp#L113).
133
  All training parameters are listed in table below.
134
 
135
+ ### Training hyperparameters: