projecte-aina
/

aina-translator-it-ca

Fairseq

Italian

Catalan

Model card Files Files and versions Community

AudreyVM commited on Nov 6, 2024

Commit

d7c3da8

verified ·

1 Parent(s): 5250893

Update README.md

Browse files

Files changed (1) hide show

README.md +26 -19

README.md CHANGED Viewed

@@ -13,8 +13,7 @@ library_name: fairseq
 ## Model description
-This model was trained from scratch using the [Fairseq toolkit](https://fairseq.readthedocs.io/en/latest/) on a combination of Catalan-Italian datasets,
-which after filtering and cleaning comprised 9.482.927 sentence pairs. The model was evaluated on the Flores and NTREX evaluation datasets.
 ## Intended uses and limitations
@@ -49,26 +48,33 @@ However, we are well aware that our models may be biased. We intend to conduct r
 ## Training
 ### Training data
 The model was trained on a combination of the following datasets:
-| Dataset       	| Sentences  	| Sentences after Cleaning|
-|-------------------|----------------|-------------------|
-| CCMatrix  v1  	| 11.444.720  	| 	7.757.357|
-|  MultiCCAligned v1	| 1.379.251	|   1.010.921|
-| WikiMatrix  	| 316.208 	| 271.587 	|
-| GNOME	| 8.571	|	1.198|
-| KDE4    	| 163.907   	|  115.027 	|
-| OpenSubtitles	| 391.293	| 225.732	|
-| GlobalVoices| 6.318 	|	5.209|
 ### Training procedure
 ### Data preparation
- All datasets are deduplicated and filtered to remove any sentence pairs with a cosine similarity of less than 0.75.
  This is done using sentence embeddings calculated using [LaBSE](https://huggingface.co/sentence-transformers/LaBSE).
- The filtered datasets are then concatenated to form a final corpus of 9.482.927 and before training the punctuation is normalized using
  a modified version of the join-single-file.py script from [SoftCatalà](https://github.com/Softcatala/nmt-models/blob/master/data-processing-tools/join-single-file.py)
@@ -109,7 +115,7 @@ The model was trained for a total of 19.000 updates. Weights were saved every 10
 ### Variable and metrics
-We use the BLEU score for evaluation on the [Flores-101](https://github.com/facebookresearch/flores) and [NTREX](https://github.com/MicrosoftTranslator/NTREX) test sets.
 ### Evaluation results
@@ -118,10 +124,11 @@ Below are the evaluation results on the machine translation from Italian to Cata
 | Test set         	| SoftCatalà | Google Translate | aina-translator-it-ca |
 |----------------------|------------|------------------|---------------|
-| Flores 101 dev   	| 25,4     	| **30,4**     	| 27,5     	|
-| Flores 101 devtest   |26,6   	| **31,2**     	| 27,7     	|
-| NTREX | 29,3 | **33,5** | 30,7 |
-| Average          	| 27,1  	| **31,7**     	| 28,6      	|
 ## Additional information

 ## Model description
+This model was trained from scratch using the Fairseq toolkit on a combination of datasets comprising both Catalan-Italian data sourced from Opus, and additional datasets where synthetic Catalan was generated from the Spanish side of Spanish-Italian corpora using Projecte Aina’s Spanish-Catalan model. This gave a total of approximately 100 million sentence pairs. The model is evaluated on the Flores, NTEU and NTREX evaluation sets.
 ## Intended uses and limitations
 ## Training
 ### Training data
 The model was trained on a combination of the following datasets:
+| Datasets       |
+|----------------------|
+|EU Bookshop |
+|Global Voices |
+| GNOME |
+|KDE 4 |
+| Multi CCAligned |
+| Multi Paracrawl |
+| Multi UN |
+| NLLB    |
+| NTEU |
+| Open Subtitles |
+| WikiMatrix |
+All data was sourced from [OPUS](https://opus.nlpl.eu/) and [ELRC](https://www.elrc-share.eu/).
+After all Catalan-Italian data had been collected, Spanish-Italian data was collected and the Spanish data
+translated to Catalan using [Projecte Aina’s Spanish-Catalan model.](https://huggingface.co/projecte-aina/aina-translator-es-ca)
 ### Training procedure
 ### Data preparation
+ All datasets are deduplicated, filtered for language identification, and filtered to remove any sentence pairs with a cosine similarity of less than 0.75.
  This is done using sentence embeddings calculated using [LaBSE](https://huggingface.co/sentence-transformers/LaBSE).
+ The filtered datasets are then concatenated to form the final corpus and before training the punctuation is normalized using
  a modified version of the join-single-file.py script from [SoftCatalà](https://github.com/Softcatala/nmt-models/blob/master/data-processing-tools/join-single-file.py)
 ### Variable and metrics
+We use the BLEU score for evaluation on the [Flores-101](https://github.com/facebookresearch/flores), NTEU (unpublished) and [NTREX](https://github.com/MicrosoftTranslator/NTREX) test sets.
 ### Evaluation results
 | Test set         	| SoftCatalà | Google Translate | aina-translator-it-ca |
 |----------------------|------------|------------------|---------------|
+| Flores 101 dev   	| 26,3     	| **30,4**     	| 28,8     	|
+| Flores 101 devtest   |27   	| **30,9**     	| 29,1     	|
+| NTEU | 40,4 | 43,4 | **47,2** |
+| NTREX | 30,3 | **33,5** | 32,4 |
+| Average          	| 31  	| **34,55**     	| 34,4      	|
 ## Additional information