Update README.md
Browse files
README.md
CHANGED
@@ -59,6 +59,10 @@ The Catalan-German data collected from the web was a combination of the followin
|
|
59 |
| GNOME |
|
60 |
| KDE4 |
|
61 |
| OpenSubtitles |
|
|
|
|
|
|
|
|
|
62 |
| GlobalVoices|
|
63 |
| Tatoeba |
|
64 |
| Books |
|
@@ -67,17 +71,10 @@ The Catalan-German data collected from the web was a combination of the followin
|
|
67 |
|
68 |
All corpora except Europarl and Tilde were collected from [Opus](https://opus.nlpl.eu/).
|
69 |
The Europarl and Tilde corpora are synthetic parallel corpora created from the original Spanish-German corpora by [SoftCatalà](https://github.com/Softcatala).
|
|
|
70 |
|
71 |
The synthetic parallel data was created from the following Spanish-German datasets:
|
72 |
|
73 |
-
| Datasets |
|
74 |
-
|-------------------|
|
75 |
-
|globalvoices_es-de_20230901 |
|
76 |
-
|multiparacrawl_es-de_20230901 |
|
77 |
-
|dgt_es-de_20240129 |
|
78 |
-
|eubookshop_es-de_20240129 |
|
79 |
-
|nllb_es-de_20240129 |
|
80 |
-
|opensubtitles_es-de_20240129 |
|
81 |
|
82 |
|
83 |
### Training procedure
|
|
|
59 |
| GNOME |
|
60 |
| KDE4 |
|
61 |
| OpenSubtitles |
|
62 |
+
| MultiParaCrawl |
|
63 |
+
| DGT |
|
64 |
+
| EUBookshop |
|
65 |
+
| NLLB |
|
66 |
| GlobalVoices|
|
67 |
| Tatoeba |
|
68 |
| Books |
|
|
|
71 |
|
72 |
All corpora except Europarl and Tilde were collected from [Opus](https://opus.nlpl.eu/).
|
73 |
The Europarl and Tilde corpora are synthetic parallel corpora created from the original Spanish-German corpora by [SoftCatalà](https://github.com/Softcatala).
|
74 |
+
Once all available Catalan-German data had been collected, additional synthetic Catalan data was created from the Spanish side of Spanish-German corpora using [Projecte Aina’s Spanish-Catalan model.](https://huggingface.co/projecte-aina/aina-translator-es-ca)
|
75 |
|
76 |
The synthetic parallel data was created from the following Spanish-German datasets:
|
77 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
78 |
|
79 |
|
80 |
### Training procedure
|