Update README.md
Browse files
README.md
CHANGED
@@ -51,9 +51,9 @@ language:
|
|
51 |
### Pretraining Data
|
52 |
|
53 |
The training corpus consists of 70+XX billion tokens of Catalan-, Spanish-centric, and English-centric parallel data, including all of the official European languages plus Catalan, Basque,
|
54 |
-
Galician, Asturian, Aragonese and Aranese. It amounts to
|
55 |
|
56 |
-
This highly multilingual corpus is predominantly composed of data sourced from [OPUS](https://opus.nlpl.eu/), with additional data taken from the [NTEU project](https://nteu.eu/)
|
57 |
Where little parallel Catalan <-> xx data could be found, synthetic Catalan data was generated from the Spanish side of the collected Spanish <-> xx corpora using
|
58 |
[Projecte Aina’s Spanish-Catalan model](https://huggingface.co/projecte-aina/aina-translator-es-ca). The final distribution of languages was as below:
|
59 |
|
@@ -66,20 +66,26 @@ Click the expand button below to see the full list of corpora included in the tr
|
|
66 |
|
67 |
| Dataset | Ca-xx Languages | Es-xx Langugages | En-xx Languages |
|
68 |
|-----------------------------------------------|----------------------------------------------------------------|-----------------------------------------------|----------------------------------------------------------------|
|
69 |
-
|[AINA
|
70 |
-
|
|
71 |
-
|
|
|
|
|
|
|
|
72 |
|[DGT](https://opus.nlpl.eu/DGT/corpus/version/DGT) | |bg,cs,da,de,el ,et,fi,fr,ga,hr,hu,lt,lv,mt,nl,pl,pt,ro,sk,sl,sv | da,et,ga,hr,hu,lt,lv,mt,sh,sl|
|
|
|
|
|
73 |
|[ELRC-EMEA](https://opus.nlpl.eu/ELRC-EMEA/corpus/version/ELRC-EMEA) | |bg,cs,da,hu,lt,lv,mt,pl,ro,sk,sl | et,hr,lv,ro,sk,sl |
|
74 |
|[EMEA](https://opus.nlpl.eu/EMEA/corpus/version/EMEA) | |bg,cs,da,el,fi,hu,lt,mt,nl,pl,ro,sk,sl,sv | et,mt |
|
75 |
|[EUBookshop](https://opus.nlpl.eu/EUbookshop/corpus/version/EUbookshop) |lt,pl,pt |cs,da,de,el,fi,fr,ga,it,lv,mt,nl,pl,pt,ro,sk,sl,sv |cy,ga|
|
76 |
-
|[Europarl](https://opus.nlpl.eu/Europarl/corpus/version/Europarl) | |bg,cs,da,el,en,fi,fr,hu,lt,lv,nl,pl,pt ,ro,sk,sl,sv |
|
77 |
|[Europat](https://opus.nlpl.eu/EuroPat/corpus/version/EuroPat) | |en,hr | no |
|
78 |
|[GAITU](https://gaitu.eus/) | | | eu|
|
79 |
|[KDE4](https://opus.nlpl.eu/KDE4/corpus/version/KDE4) |bg,cs,da,de,el ,et,eu,fi,fr,ga,gl,hr,it,lt,lv,nl,pl,pt,ro,sk,sl,sv |bg,ga,hr |cy,ga,nn,oc |
|
80 |
|[GlobalVoices](https://opus.nlpl.eu/GlobalVoices/corpus/version/GlobalVoices) | bg,de,fr,it,nl,pl,pt |bg,de,fr,pt | |
|
81 |
|[GNOME](https://opus.nlpl.eu/GNOME/corpus/version/GNOME) |eu,fr,ga,gl,pt |ga |cy,ga,nn|
|
82 |
|[JRC-Arquis](https://opus.nlpl.eu/JRC-Acquis/corpus/version/JRC-Acquis) | |cs,da,et,fr,lt,lv,mt,nl,pl ,ro,sv| et |
|
|
|
83 |
|[MaCoCu](https://opus.nlpl.eu/MaCoCu/corpus/version/MaCoCu) | en | | hr,mt,uk |
|
84 |
|[MultiCCAligned](https://opus.nlpl.eu/JRC-Acquis/corpus/version/JRC-Acquis) |bg,cs,de,el,et,fi,fr,hr,hu,it,lt,lv,nl,pl,ro,sk,sv |bg,fi,fr,hr,it,lv,nl,pt |bg,cy,da,et,fi,hr,hu,lt,lv,no,sl,sr,uk|
|
85 |
|[MultiHPLT](https://opus.nlpl.eu/MultiHPLT/corpus/version/MultiHPLT) |en, et,fi,ga,hr,mt | |fi,ga,gl,hr,mt,nn,sr |
|
@@ -87,14 +93,18 @@ Click the expand button below to see the full list of corpora included in the tr
|
|
87 |
|[MultiUN](https://opus.nlpl.eu/MultiUN/corpus/version/MultiUN) | |fr | |
|
88 |
|[News-Commentary](https://opus.nlpl.eu/News-Commentary/corpus/version/News-Commentary) | |fr | |
|
89 |
|[NLLB](https://opus.nlpl.eu/NLLB/corpus/version/NLLB) |bg,da,el,en,et,fi,fr,gl,hu,it ,lt,lv,pt,ro,sk,sl |bg,cs,da,de,el ,et,fi,fr,hu,it,lt,lv,nl,pl,pt ,ro,sk,sl,sv| bg,cs,cy,da,de,el,et,fi,fr,ga,hr,hu,it,lt,lv,mt,nl,no,oc,pl,pt,ro,ru,sk,sl,sr,sv,uk|
|
|
|
|
|
90 |
|[NTEU](https://www.elrc-share.eu/repository/search/?q=NTEU) | |bg,cs,da,de,el,en,et,fi,fr,ga,hr,hu,it,lt,lv,mt,nl,pl,pt,ro,sk,sl,sv | da,et,ga,hr,lt,lv,mt,ro,sk,sl,sv |
|
91 |
-
|[OpenSubtitles](https://opus.nlpl.eu/OpenSubtitles/corpus/version/OpenSubtitles) |bg,cs,da,de,el ,et,eu,fi,gl,hr,hu,lt,lv,nl,pl,pt,ro,sk,sl,sv |da,de,fi,fr,hr,hu,it,lv,nl | bg,cs,de,el,et,hr,fi,fr,
|
92 |
|[OPUS-100](https://opus.nlpl.eu/opus-100.php) | en | | gl |
|
93 |
|[StanfordNLP-NMT](https://opus.nlpl.eu/StanfordNLP-NMT/corpus/version/StanfordNLP-NMT) | | |cs |
|
94 |
|[Tatoeba](https://opus.nlpl.eu/Tatoeba/corpus/version/Tatoeba) |de,pt |pt | |
|
95 |
|[TildeModel](https://opus.nlpl.eu/TildeMODEL/corpus/version/TildeMODEL) | |bg | et,hr,lt,lv,mt |
|
96 |
|[UNPC](https://opus.nlpl.eu/UNPC/corpus/version/UNPC) | |en,fr | ru |
|
97 |
-
|[
|
|
|
|
|
98 |
|[Wikimedia](https://opus.nlpl.eu/wikimedia/corpus/version/wikimedia) | | |cy,nn |
|
99 |
|[XLENT](https://opus.nlpl.eu/XLEnt/corpus/version/XLEnt) |eu,ga,gl |ga |cy,et,ga,gl,hr,oc,sh|
|
100 |
|
|
|
51 |
### Pretraining Data
|
52 |
|
53 |
The training corpus consists of 70+XX billion tokens of Catalan-, Spanish-centric, and English-centric parallel data, including all of the official European languages plus Catalan, Basque,
|
54 |
+
Galician, Asturian, Aragonese and Aranese. It amounts to 6,574,251,526 parallel sentence pairs.
|
55 |
|
56 |
+
This highly multilingual corpus is predominantly composed of data sourced from [OPUS](https://opus.nlpl.eu/), with additional data taken from the [NTEU project](https://nteu.eu/), Project Aina’s existing corpora, and our own unpublished datasets.
|
57 |
Where little parallel Catalan <-> xx data could be found, synthetic Catalan data was generated from the Spanish side of the collected Spanish <-> xx corpora using
|
58 |
[Projecte Aina’s Spanish-Catalan model](https://huggingface.co/projecte-aina/aina-translator-es-ca). The final distribution of languages was as below:
|
59 |
|
|
|
66 |
|
67 |
| Dataset | Ca-xx Languages | Es-xx Langugages | En-xx Languages |
|
68 |
|-----------------------------------------------|----------------------------------------------------------------|-----------------------------------------------|----------------------------------------------------------------|
|
69 |
+
|[AINA](https://huggingface.co/projecte-aina) | en | | |
|
70 |
+
|ARANESE-SYNTH-CORPUS-BSC | arn | | |
|
71 |
+
|BOUA-BSC | | val | |
|
72 |
+
|[BOUMH](https://github.com/transducens/PILAR/tree/main/valencian/BOUMH) | | val | |
|
73 |
+
|[BOUA-PILAR](https://github.com/transducens/PILAR/tree/main/valencian/BOUA) | | val | |
|
74 |
+
|[CCMatrix](https://opus.nlpl.eu/CCMatrix/corpus/version/CCMatrix) |eu | | ga |
|
75 |
|[DGT](https://opus.nlpl.eu/DGT/corpus/version/DGT) | |bg,cs,da,de,el ,et,fi,fr,ga,hr,hu,lt,lv,mt,nl,pl,pt,ro,sk,sl,sv | da,et,ga,hr,hu,lt,lv,mt,sh,sl|
|
76 |
+
|DOGV-BSC | | val | |
|
77 |
+
|[DOGV-PILAR](https://github.com/transducens/PILAR/tree/main/valencian/DOGV-html) | | val | |
|
78 |
|[ELRC-EMEA](https://opus.nlpl.eu/ELRC-EMEA/corpus/version/ELRC-EMEA) | |bg,cs,da,hu,lt,lv,mt,pl,ro,sk,sl | et,hr,lv,ro,sk,sl |
|
79 |
|[EMEA](https://opus.nlpl.eu/EMEA/corpus/version/EMEA) | |bg,cs,da,el,fi,hu,lt,mt,nl,pl,ro,sk,sl,sv | et,mt |
|
80 |
|[EUBookshop](https://opus.nlpl.eu/EUbookshop/corpus/version/EUbookshop) |lt,pl,pt |cs,da,de,el,fi,fr,ga,it,lv,mt,nl,pl,pt,ro,sk,sl,sv |cy,ga|
|
81 |
+
|[Europarl](https://opus.nlpl.eu/Europarl/corpus/version/Europarl) | |bg,cs,da,el,en,fi,fr,hu,lt,lv,nl,pl,pt ,ro,sk,sl,sv | |
|
82 |
|[Europat](https://opus.nlpl.eu/EuroPat/corpus/version/EuroPat) | |en,hr | no |
|
83 |
|[GAITU](https://gaitu.eus/) | | | eu|
|
84 |
|[KDE4](https://opus.nlpl.eu/KDE4/corpus/version/KDE4) |bg,cs,da,de,el ,et,eu,fi,fr,ga,gl,hr,it,lt,lv,nl,pl,pt,ro,sk,sl,sv |bg,ga,hr |cy,ga,nn,oc |
|
85 |
|[GlobalVoices](https://opus.nlpl.eu/GlobalVoices/corpus/version/GlobalVoices) | bg,de,fr,it,nl,pl,pt |bg,de,fr,pt | |
|
86 |
|[GNOME](https://opus.nlpl.eu/GNOME/corpus/version/GNOME) |eu,fr,ga,gl,pt |ga |cy,ga,nn|
|
87 |
|[JRC-Arquis](https://opus.nlpl.eu/JRC-Acquis/corpus/version/JRC-Acquis) | |cs,da,et,fr,lt,lv,mt,nl,pl ,ro,sv| et |
|
88 |
+
|LES-CORTS-VALENCIANES-BSC | | val | |
|
89 |
|[MaCoCu](https://opus.nlpl.eu/MaCoCu/corpus/version/MaCoCu) | en | | hr,mt,uk |
|
90 |
|[MultiCCAligned](https://opus.nlpl.eu/JRC-Acquis/corpus/version/JRC-Acquis) |bg,cs,de,el,et,fi,fr,hr,hu,it,lt,lv,nl,pl,ro,sk,sv |bg,fi,fr,hr,it,lv,nl,pt |bg,cy,da,et,fi,hr,hu,lt,lv,no,sl,sr,uk|
|
91 |
|[MultiHPLT](https://opus.nlpl.eu/MultiHPLT/corpus/version/MultiHPLT) |en, et,fi,ga,hr,mt | |fi,ga,gl,hr,mt,nn,sr |
|
|
|
93 |
|[MultiUN](https://opus.nlpl.eu/MultiUN/corpus/version/MultiUN) | |fr | |
|
94 |
|[News-Commentary](https://opus.nlpl.eu/News-Commentary/corpus/version/News-Commentary) | |fr | |
|
95 |
|[NLLB](https://opus.nlpl.eu/NLLB/corpus/version/NLLB) |bg,da,el,en,et,fi,fr,gl,hu,it ,lt,lv,pt,ro,sk,sl |bg,cs,da,de,el ,et,fi,fr,hu,it,lt,lv,nl,pl,pt ,ro,sk,sl,sv| bg,cs,cy,da,de,el,et,fi,fr,ga,hr,hu,it,lt,lv,mt,nl,no,oc,pl,pt,ro,ru,sk,sl,sr,sv,uk|
|
96 |
+
|[NÓS](https://zenodo.org/records/7675110) | | | gl |
|
97 |
+
|[NÓS-SYN](https://zenodo.org/records/7685180) | | | gl |
|
98 |
|[NTEU](https://www.elrc-share.eu/repository/search/?q=NTEU) | |bg,cs,da,de,el,en,et,fi,fr,ga,hr,hu,it,lt,lv,mt,nl,pl,pt,ro,sk,sl,sv | da,et,ga,hr,lt,lv,mt,ro,sk,sl,sv |
|
99 |
+
|[OpenSubtitles](https://opus.nlpl.eu/OpenSubtitles/corpus/version/OpenSubtitles) |bg,cs,da,de,el ,et,eu,fi,gl,hr,hu,lt,lv,nl,pl,pt,ro,sk,sl,sv |da,de,fi,fr,hr,hu,it,lv,nl | bg,cs,de,el,et,hr,fi,fr,hr,hu,no,sl,sr|
|
100 |
|[OPUS-100](https://opus.nlpl.eu/opus-100.php) | en | | gl |
|
101 |
|[StanfordNLP-NMT](https://opus.nlpl.eu/StanfordNLP-NMT/corpus/version/StanfordNLP-NMT) | | |cs |
|
102 |
|[Tatoeba](https://opus.nlpl.eu/Tatoeba/corpus/version/Tatoeba) |de,pt |pt | |
|
103 |
|[TildeModel](https://opus.nlpl.eu/TildeMODEL/corpus/version/TildeMODEL) | |bg | et,hr,lt,lv,mt |
|
104 |
|[UNPC](https://opus.nlpl.eu/UNPC/corpus/version/UNPC) | |en,fr | ru |
|
105 |
+
|[VALENCIAN-AUTH](https://github.com/transducens/PILAR/tree/main/valencian/Generalitat) | | val | |
|
106 |
+
|[VALENCIAN-SYNTH](https://github.com/transducens/PILAR/tree/main/valencian/Generalitat) | | val | |
|
107 |
+
|[WikiMatrix](https://opus.nlpl.eu/WikiMatrix/corpus/version/WikiMatrix) |bg,cs,da,de,el ,et,eu,fi,fr,gl,hr,hu,it,lt,nl,pl,pt,ro,sk,sl,sv |bg,en,fr,hr,it,pt | oc,sh |
|
108 |
|[Wikimedia](https://opus.nlpl.eu/wikimedia/corpus/version/wikimedia) | | |cy,nn |
|
109 |
|[XLENT](https://opus.nlpl.eu/XLEnt/corpus/version/XLEnt) |eu,ga,gl |ga |cy,et,ga,gl,hr,oc,sh|
|
110 |
|