Fairseq
Italian
Catalan
AudreyVM commited on
Commit
d7c3da8
·
verified ·
1 Parent(s): 5250893

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +26 -19
README.md CHANGED
@@ -13,8 +13,7 @@ library_name: fairseq
13
 
14
  ## Model description
15
 
16
- This model was trained from scratch using the [Fairseq toolkit](https://fairseq.readthedocs.io/en/latest/) on a combination of Catalan-Italian datasets,
17
- which after filtering and cleaning comprised 9.482.927 sentence pairs. The model was evaluated on the Flores and NTREX evaluation datasets.
18
 
19
  ## Intended uses and limitations
20
 
@@ -49,26 +48,33 @@ However, we are well aware that our models may be biased. We intend to conduct r
49
  ## Training
50
 
51
  ### Training data
52
-
53
  The model was trained on a combination of the following datasets:
54
 
55
- | Dataset | Sentences | Sentences after Cleaning|
56
- |-------------------|----------------|-------------------|
57
- | CCMatrix v1 | 11.444.720 | 7.757.357|
58
- | MultiCCAligned v1 | 1.379.251 | 1.010.921|
59
- | WikiMatrix | 316.208 | 271.587 |
60
- | GNOME | 8.571 | 1.198|
61
- | KDE4 | 163.907 | 115.027 |
62
- | OpenSubtitles | 391.293 | 225.732 |
63
- | GlobalVoices| 6.318 | 5.209|
 
 
 
 
 
 
 
 
64
 
65
  ### Training procedure
66
 
67
  ### Data preparation
68
 
69
- All datasets are deduplicated and filtered to remove any sentence pairs with a cosine similarity of less than 0.75.
70
  This is done using sentence embeddings calculated using [LaBSE](https://huggingface.co/sentence-transformers/LaBSE).
71
- The filtered datasets are then concatenated to form a final corpus of 9.482.927 and before training the punctuation is normalized using
72
  a modified version of the join-single-file.py script from [SoftCatalà](https://github.com/Softcatala/nmt-models/blob/master/data-processing-tools/join-single-file.py)
73
 
74
 
@@ -109,7 +115,7 @@ The model was trained for a total of 19.000 updates. Weights were saved every 10
109
 
110
  ### Variable and metrics
111
 
112
- We use the BLEU score for evaluation on the [Flores-101](https://github.com/facebookresearch/flores) and [NTREX](https://github.com/MicrosoftTranslator/NTREX) test sets.
113
 
114
  ### Evaluation results
115
 
@@ -118,10 +124,11 @@ Below are the evaluation results on the machine translation from Italian to Cata
118
 
119
  | Test set | SoftCatalà | Google Translate | aina-translator-it-ca |
120
  |----------------------|------------|------------------|---------------|
121
- | Flores 101 dev | 25,4 | **30,4** | 27,5 |
122
- | Flores 101 devtest |26,6 | **31,2** | 27,7 |
123
- | NTREX | 29,3 | **33,5** | 30,7 |
124
- | Average | 27,1 | **31,7** | 28,6 |
 
125
 
126
  ## Additional information
127
 
 
13
 
14
  ## Model description
15
 
16
+ This model was trained from scratch using the Fairseq toolkit on a combination of datasets comprising both Catalan-Italian data sourced from Opus, and additional datasets where synthetic Catalan was generated from the Spanish side of Spanish-Italian corpora using Projecte Aina’s Spanish-Catalan model. This gave a total of approximately 100 million sentence pairs. The model is evaluated on the Flores, NTEU and NTREX evaluation sets.  
 
17
 
18
  ## Intended uses and limitations
19
 
 
48
  ## Training
49
 
50
  ### Training data
 
51
  The model was trained on a combination of the following datasets:
52
 
53
+ | Datasets       | 
54
+ |----------------------|
55
+ |EU Bookshop |
56
+ |Global Voices |
57
+ | GNOME |
58
+ |KDE 4 |
59
+ | Multi CCAligned |
60
+ | Multi Paracrawl |
61
+ | Multi UN |
62
+ | NLLB    |
63
+ | NTEU |
64
+ | Open Subtitles |
65
+ | WikiMatrix | 
66
+
67
+ All data was sourced from [OPUS](https://opus.nlpl.eu/) and [ELRC](https://www.elrc-share.eu/).
68
+ After all Catalan-Italian data had been collected, Spanish-Italian data was collected and the Spanish data
69
+ translated to Catalan using [Projecte Aina’s Spanish-Catalan model.](https://huggingface.co/projecte-aina/aina-translator-es-ca)
70
 
71
  ### Training procedure
72
 
73
  ### Data preparation
74
 
75
+ All datasets are deduplicated, filtered for language identification, and filtered to remove any sentence pairs with a cosine similarity of less than 0.75.
76
  This is done using sentence embeddings calculated using [LaBSE](https://huggingface.co/sentence-transformers/LaBSE).
77
+ The filtered datasets are then concatenated to form the final corpus and before training the punctuation is normalized using
78
  a modified version of the join-single-file.py script from [SoftCatalà](https://github.com/Softcatala/nmt-models/blob/master/data-processing-tools/join-single-file.py)
79
 
80
 
 
115
 
116
  ### Variable and metrics
117
 
118
+ We use the BLEU score for evaluation on the [Flores-101](https://github.com/facebookresearch/flores), NTEU (unpublished) and [NTREX](https://github.com/MicrosoftTranslator/NTREX) test sets.
119
 
120
  ### Evaluation results
121
 
 
124
 
125
  | Test set | SoftCatalà | Google Translate | aina-translator-it-ca |
126
  |----------------------|------------|------------------|---------------|
127
+ | Flores 101 dev | 26,3 | **30,4** | 28,8 |
128
+ | Flores 101 devtest |27 | **30,9** | 29,1 |
129
+ | NTEU | 40,4 | 43,4 | **47,2** |
130
+ | NTREX | 30,3 | **33,5** | 32,4 |
131
+ | Average | 31 | **34,55** | 34,4 |
132
 
133
  ## Additional information
134