fdelucaf commited on
Commit
87c6b12
·
verified ·
1 Parent(s): 290f442

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +5 -5
README.md CHANGED
@@ -322,7 +322,7 @@ including all of the official European languages plus Catalan, Basque, Galician,
322
  It amounts to 6,574,251,526 parallel sentence pairs.
323
 
324
  This highly multilingual corpus is predominantly composed of data sourced from [OPUS](https://opus.nlpl.eu/),
325
- with additional data taken from the [NTEU project](https://nteu.eu/), Project Aina’s corpora, and other sources (see: Data Sources and References below).
326
  Where little parallel Catalan <-> xx data could be found, synthetic Catalan data was generated from the Spanish side of the collected Spanish <-> xx corpora using
327
  [Projecte Aina’s Spanish-Catalan model](https://huggingface.co/projecte-aina/aina-translator-es-ca). The final distribution of languages was as below:
328
 
@@ -330,7 +330,7 @@ Where little parallel Catalan <-> xx data could be found, synthetic Catalan data
330
 
331
  Click the expand button below to see the full list of corpora included in the training data.
332
 
333
- <details>
334
  <summary>Data Sources</summary>
335
 
336
  | Dataset | Ca-xx Languages | Es-xx Langugages | En-xx Languages |
@@ -389,7 +389,7 @@ To consult the data summary document with the respective licences, please send a
389
 
390
 
391
 
392
- <details>
393
  <summary>References</summary>
394
 
395
  - Aulamo, M., Sulubacak, U., Virpioja, S., & Tiedemann, J. (2020). OpusTools and Parallel Corpus Diagnostics. In N. Calzolari, F. Béchet, P. Blache, K. Choukri, C. Cieri, T. Declerck, S. Goggi, H. Isahara, B. Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk, & S. Piperidis (Eds.), Proceedings of the Twelfth Language Resources and Evaluation Conference (pp. 3782–3789). European Language Resources Association. https://aclanthology.org/2020.lrec-1.467
@@ -428,7 +428,7 @@ While tasks relating to machine translation are included, it’s important to no
428
 
429
  Click the expand button below to see the full list of tasks included in the finetuning data.
430
 
431
- <details>
432
  <summary>Data Sources</summary>
433
 
434
 
@@ -476,7 +476,7 @@ please contact <langtech@bsc.es>.
476
 
477
  </details>
478
 
479
- <details>
480
  <summary>References</summary>
481
 
482
  - Alves, D. M., Pombal, J., Guerreiro, N. M., Martins, P. H., Alves, J., Farajian, A., Peters, B., Rei, R., Fernandes, P., Agrawal, S., Colombo, P., de Souza, J. G. C., & Martins, A. F. T. (2024). Tower: An open multilingual large language model for translation-related tasks (No. arXiv: 2402.17733). arXiv. https://arxiv.org/abs/2402.17733
 
322
  It amounts to 6,574,251,526 parallel sentence pairs.
323
 
324
  This highly multilingual corpus is predominantly composed of data sourced from [OPUS](https://opus.nlpl.eu/),
325
+ with additional data taken from the [NTEU project](https://nteu.eu/), [Aina Project](https://projecteaina.cat/), and other sources (see: [Data Sources#](#pre-data-sources) and [References below](#pre-references)).
326
  Where little parallel Catalan <-> xx data could be found, synthetic Catalan data was generated from the Spanish side of the collected Spanish <-> xx corpora using
327
  [Projecte Aina’s Spanish-Catalan model](https://huggingface.co/projecte-aina/aina-translator-es-ca). The final distribution of languages was as below:
328
 
 
330
 
331
  Click the expand button below to see the full list of corpora included in the training data.
332
 
333
+ <details id="pre-data-sources">
334
  <summary>Data Sources</summary>
335
 
336
  | Dataset | Ca-xx Languages | Es-xx Langugages | En-xx Languages |
 
389
 
390
 
391
 
392
+ <details id="pre-references">
393
  <summary>References</summary>
394
 
395
  - Aulamo, M., Sulubacak, U., Virpioja, S., & Tiedemann, J. (2020). OpusTools and Parallel Corpus Diagnostics. In N. Calzolari, F. Béchet, P. Blache, K. Choukri, C. Cieri, T. Declerck, S. Goggi, H. Isahara, B. Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk, & S. Piperidis (Eds.), Proceedings of the Twelfth Language Resources and Evaluation Conference (pp. 3782–3789). European Language Resources Association. https://aclanthology.org/2020.lrec-1.467
 
428
 
429
  Click the expand button below to see the full list of tasks included in the finetuning data.
430
 
431
+ <details id="instr-data-sources">
432
  <summary>Data Sources</summary>
433
 
434
 
 
476
 
477
  </details>
478
 
479
+ <details id="instr-references">
480
  <summary>References</summary>
481
 
482
  - Alves, D. M., Pombal, J., Guerreiro, N. M., Martins, P. H., Alves, J., Farajian, A., Peters, B., Rei, R., Fernandes, P., Agrawal, S., Colombo, P., de Souza, J. G. C., & Martins, A. F. T. (2024). Tower: An open multilingual large language model for translation-related tasks (No. arXiv: 2402.17733). arXiv. https://arxiv.org/abs/2402.17733