xixianliao commited on
Commit
f487813
·
verified ·
1 Parent(s): 41ed92e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +18 -8
README.md CHANGED
@@ -51,9 +51,9 @@ language:
51
  ### Pretraining Data
52
 
53
  The training corpus consists of 70+XX billion tokens of Catalan-, Spanish-centric, and English-centric parallel data, including all of the official European languages plus Catalan, Basque,
54
- Galician, Asturian, Aragonese and Aranese. It amounts to 3,157,965,012+XX parallel sentence pairs.
55
 
56
- This highly multilingual corpus is predominantly composed of data sourced from [OPUS](https://opus.nlpl.eu/), with additional data taken from the [NTEU project](https://nteu.eu/) and Project Aina’s existing corpora.
57
  Where little parallel Catalan <-> xx data could be found, synthetic Catalan data was generated from the Spanish side of the collected Spanish <-> xx corpora using
58
  [Projecte Aina’s Spanish-Catalan model](https://huggingface.co/projecte-aina/aina-translator-es-ca). The final distribution of languages was as below:
59
 
@@ -66,20 +66,26 @@ Click the expand button below to see the full list of corpora included in the tr
66
 
67
  | Dataset | Ca-xx Languages | Es-xx Langugages | En-xx Languages |
68
  |-----------------------------------------------|----------------------------------------------------------------|-----------------------------------------------|----------------------------------------------------------------|
69
- |[AINA CA-EN](https://huggingface.co/datasets/projecte-aina/CA-EN_Parallel_Corpus) | en | | |
70
- |[CCMatrix](https://opus.nlpl.eu/CCMatrix/corpus/version/CCMatrix) |eu | | ga,gl |
71
- |[CLUVI](https://dataverse.csuc.cat/dataset.xhtml?persistentId=doi:10.34810/data267) | | | gl |
 
 
 
72
  |[DGT](https://opus.nlpl.eu/DGT/corpus/version/DGT) | |bg,cs,da,de,el ,et,fi,fr,ga,hr,hu,lt,lv,mt,nl,pl,pt,ro,sk,sl,sv | da,et,ga,hr,hu,lt,lv,mt,sh,sl|
 
 
73
  |[ELRC-EMEA](https://opus.nlpl.eu/ELRC-EMEA/corpus/version/ELRC-EMEA) | |bg,cs,da,hu,lt,lv,mt,pl,ro,sk,sl | et,hr,lv,ro,sk,sl |
74
  |[EMEA](https://opus.nlpl.eu/EMEA/corpus/version/EMEA) | |bg,cs,da,el,fi,hu,lt,mt,nl,pl,ro,sk,sl,sv | et,mt |
75
  |[EUBookshop](https://opus.nlpl.eu/EUbookshop/corpus/version/EUbookshop) |lt,pl,pt |cs,da,de,el,fi,fr,ga,it,lv,mt,nl,pl,pt,ro,sk,sl,sv |cy,ga|
76
- |[Europarl](https://opus.nlpl.eu/Europarl/corpus/version/Europarl) | |bg,cs,da,el,en,fi,fr,hu,lt,lv,nl,pl,pt ,ro,sk,sl,sv |gl_pt |
77
  |[Europat](https://opus.nlpl.eu/EuroPat/corpus/version/EuroPat) | |en,hr | no |
78
  |[GAITU](https://gaitu.eus/) | | | eu|
79
  |[KDE4](https://opus.nlpl.eu/KDE4/corpus/version/KDE4) |bg,cs,da,de,el ,et,eu,fi,fr,ga,gl,hr,it,lt,lv,nl,pl,pt,ro,sk,sl,sv |bg,ga,hr |cy,ga,nn,oc |
80
  |[GlobalVoices](https://opus.nlpl.eu/GlobalVoices/corpus/version/GlobalVoices) | bg,de,fr,it,nl,pl,pt |bg,de,fr,pt | |
81
  |[GNOME](https://opus.nlpl.eu/GNOME/corpus/version/GNOME) |eu,fr,ga,gl,pt |ga |cy,ga,nn|
82
  |[JRC-Arquis](https://opus.nlpl.eu/JRC-Acquis/corpus/version/JRC-Acquis) | |cs,da,et,fr,lt,lv,mt,nl,pl ,ro,sv| et |
 
83
  |[MaCoCu](https://opus.nlpl.eu/MaCoCu/corpus/version/MaCoCu) | en | | hr,mt,uk |
84
  |[MultiCCAligned](https://opus.nlpl.eu/JRC-Acquis/corpus/version/JRC-Acquis) |bg,cs,de,el,et,fi,fr,hr,hu,it,lt,lv,nl,pl,ro,sk,sv |bg,fi,fr,hr,it,lv,nl,pt |bg,cy,da,et,fi,hr,hu,lt,lv,no,sl,sr,uk|
85
  |[MultiHPLT](https://opus.nlpl.eu/MultiHPLT/corpus/version/MultiHPLT) |en, et,fi,ga,hr,mt | |fi,ga,gl,hr,mt,nn,sr |
@@ -87,14 +93,18 @@ Click the expand button below to see the full list of corpora included in the tr
87
  |[MultiUN](https://opus.nlpl.eu/MultiUN/corpus/version/MultiUN) | |fr | |
88
  |[News-Commentary](https://opus.nlpl.eu/News-Commentary/corpus/version/News-Commentary) | |fr | |
89
  |[NLLB](https://opus.nlpl.eu/NLLB/corpus/version/NLLB) |bg,da,el,en,et,fi,fr,gl,hu,it ,lt,lv,pt,ro,sk,sl |bg,cs,da,de,el ,et,fi,fr,hu,it,lt,lv,nl,pl,pt ,ro,sk,sl,sv| bg,cs,cy,da,de,el,et,fi,fr,ga,hr,hu,it,lt,lv,mt,nl,no,oc,pl,pt,ro,ru,sk,sl,sr,sv,uk|
 
 
90
  |[NTEU](https://www.elrc-share.eu/repository/search/?q=NTEU) | |bg,cs,da,de,el,en,et,fi,fr,ga,hr,hu,it,lt,lv,mt,nl,pl,pt,ro,sk,sl,sv | da,et,ga,hr,lt,lv,mt,ro,sk,sl,sv |
91
- |[OpenSubtitles](https://opus.nlpl.eu/OpenSubtitles/corpus/version/OpenSubtitles) |bg,cs,da,de,el ,et,eu,fi,gl,hr,hu,lt,lv,nl,pl,pt,ro,sk,sl,sv |da,de,fi,fr,hr,hu,it,lv,nl | bg,cs,de,el,et,hr,fi,fr,gl_pt,hr,hu,no,sl,sr|
92
  |[OPUS-100](https://opus.nlpl.eu/opus-100.php) | en | | gl |
93
  |[StanfordNLP-NMT](https://opus.nlpl.eu/StanfordNLP-NMT/corpus/version/StanfordNLP-NMT) | | |cs |
94
  |[Tatoeba](https://opus.nlpl.eu/Tatoeba/corpus/version/Tatoeba) |de,pt |pt | |
95
  |[TildeModel](https://opus.nlpl.eu/TildeMODEL/corpus/version/TildeMODEL) | |bg | et,hr,lt,lv,mt |
96
  |[UNPC](https://opus.nlpl.eu/UNPC/corpus/version/UNPC) | |en,fr | ru |
97
- |[WikiMatrix](https://opus.nlpl.eu/WikiMatrix/corpus/version/WikiMatrix) |bg,cs,da,de,el ,et,eu,fi,fr,gl,hr,hu,it,lt,nl,pl,pt,ro,sk,sl,sv |bg,en,fr,hr,it,pt | gl,oc,sh |
 
 
98
  |[Wikimedia](https://opus.nlpl.eu/wikimedia/corpus/version/wikimedia) | | |cy,nn |
99
  |[XLENT](https://opus.nlpl.eu/XLEnt/corpus/version/XLEnt) |eu,ga,gl |ga |cy,et,ga,gl,hr,oc,sh|
100
 
 
51
  ### Pretraining Data
52
 
53
  The training corpus consists of 70+XX billion tokens of Catalan-, Spanish-centric, and English-centric parallel data, including all of the official European languages plus Catalan, Basque,
54
+ Galician, Asturian, Aragonese and Aranese. It amounts to 6,574,251,526 parallel sentence pairs.
55
 
56
+ This highly multilingual corpus is predominantly composed of data sourced from [OPUS](https://opus.nlpl.eu/), with additional data taken from the [NTEU project](https://nteu.eu/), Project Aina’s existing corpora, and our own unpublished datasets.
57
  Where little parallel Catalan <-> xx data could be found, synthetic Catalan data was generated from the Spanish side of the collected Spanish <-> xx corpora using
58
  [Projecte Aina’s Spanish-Catalan model](https://huggingface.co/projecte-aina/aina-translator-es-ca). The final distribution of languages was as below:
59
 
 
66
 
67
  | Dataset | Ca-xx Languages | Es-xx Langugages | En-xx Languages |
68
  |-----------------------------------------------|----------------------------------------------------------------|-----------------------------------------------|----------------------------------------------------------------|
69
+ |[AINA](https://huggingface.co/projecte-aina) | en | | |
70
+ |ARANESE-SYNTH-CORPUS-BSC | arn | | |
71
+ |BOUA-BSC | | val | |
72
+ |[BOUMH](https://github.com/transducens/PILAR/tree/main/valencian/BOUMH) | | val | |
73
+ |[BOUA-PILAR](https://github.com/transducens/PILAR/tree/main/valencian/BOUA) | | val | |
74
+ |[CCMatrix](https://opus.nlpl.eu/CCMatrix/corpus/version/CCMatrix) |eu | | ga |
75
  |[DGT](https://opus.nlpl.eu/DGT/corpus/version/DGT) | |bg,cs,da,de,el ,et,fi,fr,ga,hr,hu,lt,lv,mt,nl,pl,pt,ro,sk,sl,sv | da,et,ga,hr,hu,lt,lv,mt,sh,sl|
76
+ |DOGV-BSC | | val | |
77
+ |[DOGV-PILAR](https://github.com/transducens/PILAR/tree/main/valencian/DOGV-html) | | val | |
78
  |[ELRC-EMEA](https://opus.nlpl.eu/ELRC-EMEA/corpus/version/ELRC-EMEA) | |bg,cs,da,hu,lt,lv,mt,pl,ro,sk,sl | et,hr,lv,ro,sk,sl |
79
  |[EMEA](https://opus.nlpl.eu/EMEA/corpus/version/EMEA) | |bg,cs,da,el,fi,hu,lt,mt,nl,pl,ro,sk,sl,sv | et,mt |
80
  |[EUBookshop](https://opus.nlpl.eu/EUbookshop/corpus/version/EUbookshop) |lt,pl,pt |cs,da,de,el,fi,fr,ga,it,lv,mt,nl,pl,pt,ro,sk,sl,sv |cy,ga|
81
+ |[Europarl](https://opus.nlpl.eu/Europarl/corpus/version/Europarl) | |bg,cs,da,el,en,fi,fr,hu,lt,lv,nl,pl,pt ,ro,sk,sl,sv | |
82
  |[Europat](https://opus.nlpl.eu/EuroPat/corpus/version/EuroPat) | |en,hr | no |
83
  |[GAITU](https://gaitu.eus/) | | | eu|
84
  |[KDE4](https://opus.nlpl.eu/KDE4/corpus/version/KDE4) |bg,cs,da,de,el ,et,eu,fi,fr,ga,gl,hr,it,lt,lv,nl,pl,pt,ro,sk,sl,sv |bg,ga,hr |cy,ga,nn,oc |
85
  |[GlobalVoices](https://opus.nlpl.eu/GlobalVoices/corpus/version/GlobalVoices) | bg,de,fr,it,nl,pl,pt |bg,de,fr,pt | |
86
  |[GNOME](https://opus.nlpl.eu/GNOME/corpus/version/GNOME) |eu,fr,ga,gl,pt |ga |cy,ga,nn|
87
  |[JRC-Arquis](https://opus.nlpl.eu/JRC-Acquis/corpus/version/JRC-Acquis) | |cs,da,et,fr,lt,lv,mt,nl,pl ,ro,sv| et |
88
+ |LES-CORTS-VALENCIANES-BSC | | val | |
89
  |[MaCoCu](https://opus.nlpl.eu/MaCoCu/corpus/version/MaCoCu) | en | | hr,mt,uk |
90
  |[MultiCCAligned](https://opus.nlpl.eu/JRC-Acquis/corpus/version/JRC-Acquis) |bg,cs,de,el,et,fi,fr,hr,hu,it,lt,lv,nl,pl,ro,sk,sv |bg,fi,fr,hr,it,lv,nl,pt |bg,cy,da,et,fi,hr,hu,lt,lv,no,sl,sr,uk|
91
  |[MultiHPLT](https://opus.nlpl.eu/MultiHPLT/corpus/version/MultiHPLT) |en, et,fi,ga,hr,mt | |fi,ga,gl,hr,mt,nn,sr |
 
93
  |[MultiUN](https://opus.nlpl.eu/MultiUN/corpus/version/MultiUN) | |fr | |
94
  |[News-Commentary](https://opus.nlpl.eu/News-Commentary/corpus/version/News-Commentary) | |fr | |
95
  |[NLLB](https://opus.nlpl.eu/NLLB/corpus/version/NLLB) |bg,da,el,en,et,fi,fr,gl,hu,it ,lt,lv,pt,ro,sk,sl |bg,cs,da,de,el ,et,fi,fr,hu,it,lt,lv,nl,pl,pt ,ro,sk,sl,sv| bg,cs,cy,da,de,el,et,fi,fr,ga,hr,hu,it,lt,lv,mt,nl,no,oc,pl,pt,ro,ru,sk,sl,sr,sv,uk|
96
+ |[NÓS](https://zenodo.org/records/7675110) | | | gl |
97
+ |[NÓS-SYN](https://zenodo.org/records/7685180) | | | gl |
98
  |[NTEU](https://www.elrc-share.eu/repository/search/?q=NTEU) | |bg,cs,da,de,el,en,et,fi,fr,ga,hr,hu,it,lt,lv,mt,nl,pl,pt,ro,sk,sl,sv | da,et,ga,hr,lt,lv,mt,ro,sk,sl,sv |
99
+ |[OpenSubtitles](https://opus.nlpl.eu/OpenSubtitles/corpus/version/OpenSubtitles) |bg,cs,da,de,el ,et,eu,fi,gl,hr,hu,lt,lv,nl,pl,pt,ro,sk,sl,sv |da,de,fi,fr,hr,hu,it,lv,nl | bg,cs,de,el,et,hr,fi,fr,hr,hu,no,sl,sr|
100
  |[OPUS-100](https://opus.nlpl.eu/opus-100.php) | en | | gl |
101
  |[StanfordNLP-NMT](https://opus.nlpl.eu/StanfordNLP-NMT/corpus/version/StanfordNLP-NMT) | | |cs |
102
  |[Tatoeba](https://opus.nlpl.eu/Tatoeba/corpus/version/Tatoeba) |de,pt |pt | |
103
  |[TildeModel](https://opus.nlpl.eu/TildeMODEL/corpus/version/TildeMODEL) | |bg | et,hr,lt,lv,mt |
104
  |[UNPC](https://opus.nlpl.eu/UNPC/corpus/version/UNPC) | |en,fr | ru |
105
+ |[VALENCIAN-AUTH](https://github.com/transducens/PILAR/tree/main/valencian/Generalitat) | | val | |
106
+ |[VALENCIAN-SYNTH](https://github.com/transducens/PILAR/tree/main/valencian/Generalitat) | | val | |
107
+ |[WikiMatrix](https://opus.nlpl.eu/WikiMatrix/corpus/version/WikiMatrix) |bg,cs,da,de,el ,et,eu,fi,fr,gl,hr,hu,it,lt,nl,pl,pt,ro,sk,sl,sv |bg,en,fr,hr,it,pt | oc,sh |
108
  |[Wikimedia](https://opus.nlpl.eu/wikimedia/corpus/version/wikimedia) | | |cy,nn |
109
  |[XLENT](https://opus.nlpl.eu/XLEnt/corpus/version/XLEnt) |eu,ga,gl |ga |cy,et,ga,gl,hr,oc,sh|
110