Update README.md
Browse files
README.md
CHANGED
@@ -322,7 +322,7 @@ including all of the official European languages plus Catalan, Basque, Galician,
|
|
322 |
It amounts to 6,574,251,526 parallel sentence pairs.
|
323 |
|
324 |
This highly multilingual corpus is predominantly composed of data sourced from [OPUS](https://opus.nlpl.eu/),
|
325 |
-
with additional data taken from the [NTEU project](https://nteu.eu/),
|
326 |
Where little parallel Catalan <-> xx data could be found, synthetic Catalan data was generated from the Spanish side of the collected Spanish <-> xx corpora using
|
327 |
[Projecte Aina’s Spanish-Catalan model](https://huggingface.co/projecte-aina/aina-translator-es-ca). The final distribution of languages was as below:
|
328 |
|
@@ -330,7 +330,7 @@ Where little parallel Catalan <-> xx data could be found, synthetic Catalan data
|
|
330 |
|
331 |
Click the expand button below to see the full list of corpora included in the training data.
|
332 |
|
333 |
-
<details>
|
334 |
<summary>Data Sources</summary>
|
335 |
|
336 |
| Dataset | Ca-xx Languages | Es-xx Langugages | En-xx Languages |
|
@@ -389,7 +389,7 @@ To consult the data summary document with the respective licences, please send a
|
|
389 |
|
390 |
|
391 |
|
392 |
-
<details>
|
393 |
<summary>References</summary>
|
394 |
|
395 |
- Aulamo, M., Sulubacak, U., Virpioja, S., & Tiedemann, J. (2020). OpusTools and Parallel Corpus Diagnostics. In N. Calzolari, F. Béchet, P. Blache, K. Choukri, C. Cieri, T. Declerck, S. Goggi, H. Isahara, B. Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk, & S. Piperidis (Eds.), Proceedings of the Twelfth Language Resources and Evaluation Conference (pp. 3782–3789). European Language Resources Association. https://aclanthology.org/2020.lrec-1.467
|
@@ -428,7 +428,7 @@ While tasks relating to machine translation are included, it’s important to no
|
|
428 |
|
429 |
Click the expand button below to see the full list of tasks included in the finetuning data.
|
430 |
|
431 |
-
<details>
|
432 |
<summary>Data Sources</summary>
|
433 |
|
434 |
|
@@ -476,7 +476,7 @@ please contact <langtech@bsc.es>.
|
|
476 |
|
477 |
</details>
|
478 |
|
479 |
-
<details>
|
480 |
<summary>References</summary>
|
481 |
|
482 |
- Alves, D. M., Pombal, J., Guerreiro, N. M., Martins, P. H., Alves, J., Farajian, A., Peters, B., Rei, R., Fernandes, P., Agrawal, S., Colombo, P., de Souza, J. G. C., & Martins, A. F. T. (2024). Tower: An open multilingual large language model for translation-related tasks (No. arXiv: 2402.17733). arXiv. https://arxiv.org/abs/2402.17733
|
|
|
322 |
It amounts to 6,574,251,526 parallel sentence pairs.
|
323 |
|
324 |
This highly multilingual corpus is predominantly composed of data sourced from [OPUS](https://opus.nlpl.eu/),
|
325 |
+
with additional data taken from the [NTEU project](https://nteu.eu/), [Aina Project](https://projecteaina.cat/), and other sources (see: [Data Sources#](#pre-data-sources) and [References below](#pre-references)).
|
326 |
Where little parallel Catalan <-> xx data could be found, synthetic Catalan data was generated from the Spanish side of the collected Spanish <-> xx corpora using
|
327 |
[Projecte Aina’s Spanish-Catalan model](https://huggingface.co/projecte-aina/aina-translator-es-ca). The final distribution of languages was as below:
|
328 |
|
|
|
330 |
|
331 |
Click the expand button below to see the full list of corpora included in the training data.
|
332 |
|
333 |
+
<details id="pre-data-sources">
|
334 |
<summary>Data Sources</summary>
|
335 |
|
336 |
| Dataset | Ca-xx Languages | Es-xx Langugages | En-xx Languages |
|
|
|
389 |
|
390 |
|
391 |
|
392 |
+
<details id="pre-references">
|
393 |
<summary>References</summary>
|
394 |
|
395 |
- Aulamo, M., Sulubacak, U., Virpioja, S., & Tiedemann, J. (2020). OpusTools and Parallel Corpus Diagnostics. In N. Calzolari, F. Béchet, P. Blache, K. Choukri, C. Cieri, T. Declerck, S. Goggi, H. Isahara, B. Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk, & S. Piperidis (Eds.), Proceedings of the Twelfth Language Resources and Evaluation Conference (pp. 3782–3789). European Language Resources Association. https://aclanthology.org/2020.lrec-1.467
|
|
|
428 |
|
429 |
Click the expand button below to see the full list of tasks included in the finetuning data.
|
430 |
|
431 |
+
<details id="instr-data-sources">
|
432 |
<summary>Data Sources</summary>
|
433 |
|
434 |
|
|
|
476 |
|
477 |
</details>
|
478 |
|
479 |
+
<details id="instr-references">
|
480 |
<summary>References</summary>
|
481 |
|
482 |
- Alves, D. M., Pombal, J., Guerreiro, N. M., Martins, P. H., Alves, J., Farajian, A., Peters, B., Rei, R., Fernandes, P., Agrawal, S., Colombo, P., de Souza, J. G. C., & Martins, A. F. T. (2024). Tower: An open multilingual large language model for translation-related tasks (No. arXiv: 2402.17733). arXiv. https://arxiv.org/abs/2402.17733
|