fdelucaf commited on
Commit
14b8abd
verified
1 Parent(s): 3ae0570

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +16 -14
README.md CHANGED
@@ -51,7 +51,7 @@ base_model:
51
 
52
  SalamandraTA-7b-instruct is a translation LLM that has been instruction-tuned from SalamandraTA-7b-base.
53
  The base model results from continually pre-training [Salamandra-7b](https://huggingface.co/BSC-LT/salamandra-7b) on parallel data and has not been published, but is reserved for internal use.
54
- SalamandraTA-7b-instruct is proficent in 37 european languages and supportS translation-related tasks, namely: sentence-level-translation, paragraph-level-translation, document-level-translation, automatic post-editing, machine translation evaluation, multi-reference-translation, named-entity-recognition and context-aware translation.
55
 
56
  > [!WARNING]
57
  > **DISCLAIMER:** This version of Salamandra is tailored exclusively for translation tasks. It lacks chat capabilities and has not been trained with any chat instructions.
@@ -135,7 +135,7 @@ Aragonese, Aranese, Asturian, Basque, Bulgarian, Croatian, Czech, Danish, Dutch,
135
  Irish, Italian, Latvian, Lithuanian, Maltese, Norwegian Bokm氓l, Norwegian Nynorsk, Occitan, Polish, Portuguese, Romanian, Russian, Serbian, Slovak,
136
  Slovenian, Spanish, Swedish, Ukrainian, Valencian, Welsh.
137
 
138
- The instruction-following model use the commonly adopted ChatML template:
139
 
140
  ```
141
  <|im_start|>system
@@ -322,7 +322,7 @@ including all of the official European languages plus Catalan, Basque, Galician,
322
  It amounts to 6,574,251,526 parallel sentence pairs.
323
 
324
  This highly multilingual corpus is predominantly composed of data sourced from [OPUS](https://opus.nlpl.eu/),
325
- with additional data taken from the [NTEU project](https://nteu.eu/), [Aina Project](https://projecteaina.cat/), and other sources
326
  (see: [Data Sources](#pre-data-sources) and [References](#pre-references)).
327
  Where little parallel Catalan <-> xx data could be found, synthetic Catalan data was generated from the Spanish side of the collected Spanish <-> xx corpora using
328
  [Projecte Aina鈥檚 Spanish-Catalan model](https://huggingface.co/projecte-aina/aina-translator-es-ca). The final distribution of languages was as below:
@@ -475,7 +475,7 @@ Click the expand button below to see the full list of tasks included in the fine
475
  | Context-Aware Translation | [TowerBlocks](https://huggingface.co/datasets/Unbabel/TowerBlocks-v0.2): [MT-GenEval](https://github.com/amazon-science/machine-translation-gender-eval) | en-de | 558 |
476
  |**Total** | | | **135,404** |
477
 
478
- The non-public portion of this dataset was jointly created by the [ILENIA](https://proyectoilenia.es/) partners BSC, [HiTZ](http://hitz.ehu.eus/es),
479
  and [CiTIUS](https://citius.gal/es/). For further information regarding the instruction-tuning data,
480
  please contact <langtech@bsc.es>.
481
 
@@ -506,9 +506,9 @@ please contact <langtech@bsc.es>.
506
 
507
  Below are the evaluation results on the [Flores+200 devtest set](https://huggingface.co/datasets/openlanguagedata/flores_plus),
508
  compared against the state-of-the-art MADLAD400-7B model ([Kudugunta, S., et al.](https://arxiv.org/abs/2309.04662)) and SalamandraTA-7b-base model.
509
- These results cover translation directions between CA-XX, ES-XX, EN-XX, as well as XX-CA, XX-ES, and XX-EN.
510
  The metrics have been computed excluding Asturian, Aranese, and Aragonese, as we report them separately.
511
- The evaluation was conducted using [MT Lens](https://github.com/langtech-bsc/mt-evaluation) following the standard setting (beam search with beam size 5, limiting the translation length to 500 tokens). We report the following metrics:
512
 
513
  <details>
514
  <summary>Click to show metrics details</summary>
@@ -651,7 +651,7 @@ This section presents the evaluation metrics for Basque translation tasks.
651
 
652
  The tables below summarize the performance metrics for English, Spanish, and Catalan to Asturian, Aranese and Aragonese compared
653
  against [Transducens/IbRo-nllb](https://huggingface.co/Transducens/IbRo-nllb) [(Galiano Jimenez, et al.)](https://aclanthology.org/2024.wmt-1.85/),
654
- NLLB-3.3 ([Costa-juss脿 et al., 2022](https://arxiv.org/abs/2207.04672)) and [SalamandraTA-2B](https://huggingface.co/BSC-LT/salamandraTA-2B).
655
 
656
  <details>
657
  <summary>English evaluation</summary>
@@ -662,7 +662,7 @@ NLLB-3.3 ([Costa-juss脿 et al., 2022](https://arxiv.org/abs/2207.04672)) and [Sa
662
  |:---------------------------------|:---------|:---------|-------:|-------:|-------:|
663
  | SalamandraTA-7b-instruct | en | ast | **31.49** | **54.01** | **60.65** |
664
  | SalamandraTA-7b-base | en | ast | 26.4 | 64.02 | 57.35 |
665
- | nllb-3.3B | en | ast | 22.02 | 77.26 | 51.4 |
666
  | Transducens/IbRo-nllb | en | ast | 20.56 | 63.92 | 53.32 |
667
  | | | | | | |
668
  | SalamandraTA-7b-instruct | en | arn | **13.04** | **87.13** | **37.56** |
@@ -687,7 +687,7 @@ NLLB-3.3 ([Costa-juss脿 et al., 2022](https://arxiv.org/abs/2207.04672)) and [Sa
687
  | SalamandraTA-7b-base | es | ast | 17.65 | 75.78 | 51.05 |
688
  | Transducens/IbRo-nllb | es | ast | 16.79 | 76.36 | 50.89 |
689
  | SalamandraTA-2B | es | ast | 16.68 | 77.29 | 49.46 |
690
- | nllb-3.3B | es | ast | 11.85 | 100.86 | 40.27 |
691
  | | | | | | |
692
  | SalamandraTA-7b-base | es | arn | **29.19** | **71.85** | **49.42** |
693
  | Transducens/IbRo-nllb | es | arn | 28.45 | 72.56 | 49.28 |
@@ -715,7 +715,7 @@ NLLB-3.3 ([Costa-juss脿 et al., 2022](https://arxiv.org/abs/2207.04672)) and [Sa
715
  | SalamandraTA-7b-base | ca | ast | 26.11 | 63.63 | **58.08** |
716
  | SalamandraTA-2B | ca | ast | 25.32 | 62.59 | 55.98 |
717
  | Transducens/IbRo-nllb | ca | ast | 24.77 | 61.60 | 57.49 |
718
- | nllb-3.3B | ca | ast | 17.17 | 91.47 | 45.83 |
719
  | | | | | | |
720
  | SalamandraTA-7b-base | ca | arn | **17.77** | **80.88** | **42.12** |
721
  | Transducens/IbRo-nllb | ca | arn | 17.51 | 81.18 | 41.91 |
@@ -805,13 +805,15 @@ within the framework of [ILENIA Project](https://proyectoilenia.es/) with refere
805
 
806
  ### Acknowledgements
807
 
808
- The success of this project has been made possible thanks to the invaluable contributions of numerous research centers, teams, and projects that provided access to their data.
809
- Their efforts have been instrumental in advancing our work, and we sincerely appreciate their support.
810
- We would like to thank, among others:
811
- [CENID](https://cenid.es/), [CiTIUS](https://citius.gal/es/), [Gaitu proiektua](https://gaitu.eus/), [Helsinki NLP](https://github.com/Helsinki-NLP), [HiTZ](http://hitz.ehu.eus/es), [Institut d鈥橢studis Aranesi](http://www.institutestudisaranesi.cat/), [MaCoCu Project](https://macocu.eu/), [Machine Translate Foundation](https://machinetranslate.org/about), [NTEU Project](https://nteu.eu/), [Orai NLP technologies](https://huggingface.co/orai-nlp), [Proxecto N贸s](https://nos.gal/es/proxecto-nos), [Softcatal脿](https://www.softcatala.org/), [Tatoeba Project](https://tatoeba.org/), [TILDE Project](https://tilde.ai/tildelm/), [Transducens - Departament de Llenguatges i Sistemes Inform脿tics Universitat d鈥橝lacant](https://transducens.dlsi.ua.es/), [Unbabel](https://huggingface.co/Unbabel).
812
 
813
 
814
 
 
 
 
815
  ### Disclaimer
816
  Be aware that the model may contain biases or other unintended distortions.
817
  When third parties deploy systems or provide services based on this model, or use the model themselves,
 
51
 
52
  SalamandraTA-7b-instruct is a translation LLM that has been instruction-tuned from SalamandraTA-7b-base.
53
  The base model results from continually pre-training [Salamandra-7b](https://huggingface.co/BSC-LT/salamandra-7b) on parallel data and has not been published, but is reserved for internal use.
54
+ SalamandraTA-7b-instruct is proficent in 37 european languages and supports translation-related tasks, namely: sentence-level-translation, paragraph-level-translation, document-level-translation, automatic post-editing, machine translation evaluation, multi-reference-translation, named-entity-recognition and context-aware translation.
55
 
56
  > [!WARNING]
57
  > **DISCLAIMER:** This version of Salamandra is tailored exclusively for translation tasks. It lacks chat capabilities and has not been trained with any chat instructions.
 
135
  Irish, Italian, Latvian, Lithuanian, Maltese, Norwegian Bokm氓l, Norwegian Nynorsk, Occitan, Polish, Portuguese, Romanian, Russian, Serbian, Slovak,
136
  Slovenian, Spanish, Swedish, Ukrainian, Valencian, Welsh.
137
 
138
+ The instruction-following model uses the commonly adopted ChatML template:
139
 
140
  ```
141
  <|im_start|>system
 
322
  It amounts to 6,574,251,526 parallel sentence pairs.
323
 
324
  This highly multilingual corpus is predominantly composed of data sourced from [OPUS](https://opus.nlpl.eu/),
325
+ with additional data taken from the [NTEU Project](https://nteu.eu/), [Aina Project](https://projecteaina.cat/), and other sources
326
  (see: [Data Sources](#pre-data-sources) and [References](#pre-references)).
327
  Where little parallel Catalan <-> xx data could be found, synthetic Catalan data was generated from the Spanish side of the collected Spanish <-> xx corpora using
328
  [Projecte Aina鈥檚 Spanish-Catalan model](https://huggingface.co/projecte-aina/aina-translator-es-ca). The final distribution of languages was as below:
 
475
  | Context-Aware Translation | [TowerBlocks](https://huggingface.co/datasets/Unbabel/TowerBlocks-v0.2): [MT-GenEval](https://github.com/amazon-science/machine-translation-gender-eval) | en-de | 558 |
476
  |**Total** | | | **135,404** |
477
 
478
+ The non-public portion of this dataset was jointly created by the [ILENIA](https://proyectoilenia.es/) partners: BSC-LT, [HiTZ](http://hitz.ehu.eus/es),
479
  and [CiTIUS](https://citius.gal/es/). For further information regarding the instruction-tuning data,
480
  please contact <langtech@bsc.es>.
481
 
 
506
 
507
  Below are the evaluation results on the [Flores+200 devtest set](https://huggingface.co/datasets/openlanguagedata/flores_plus),
508
  compared against the state-of-the-art MADLAD400-7B model ([Kudugunta, S., et al.](https://arxiv.org/abs/2309.04662)) and SalamandraTA-7b-base model.
509
+ These results cover the translation directions CA-XX, ES-XX, EN-XX, as well as XX-CA, XX-ES, and XX-EN.
510
  The metrics have been computed excluding Asturian, Aranese, and Aragonese, as we report them separately.
511
+ The evaluation was conducted using [MT Lens](https://github.com/langtech-bsc/mt-evaluation), following the standard setting (beam search with beam size 5, limiting the translation length to 500 tokens). We report the following metrics:
512
 
513
  <details>
514
  <summary>Click to show metrics details</summary>
 
651
 
652
  The tables below summarize the performance metrics for English, Spanish, and Catalan to Asturian, Aranese and Aragonese compared
653
  against [Transducens/IbRo-nllb](https://huggingface.co/Transducens/IbRo-nllb) [(Galiano Jimenez, et al.)](https://aclanthology.org/2024.wmt-1.85/),
654
+ [NLLB-200-3.3B](https://huggingface.co/facebook/nllb-200-3.3B) ([Costa-juss脿 et al., 2022](https://arxiv.org/abs/2207.04672)) and [SalamandraTA-2B](https://huggingface.co/BSC-LT/salamandraTA-2B).
655
 
656
  <details>
657
  <summary>English evaluation</summary>
 
662
  |:---------------------------------|:---------|:---------|-------:|-------:|-------:|
663
  | SalamandraTA-7b-instruct | en | ast | **31.49** | **54.01** | **60.65** |
664
  | SalamandraTA-7b-base | en | ast | 26.4 | 64.02 | 57.35 |
665
+ | nllb-200-3.3B | en | ast | 22.02 | 77.26 | 51.4 |
666
  | Transducens/IbRo-nllb | en | ast | 20.56 | 63.92 | 53.32 |
667
  | | | | | | |
668
  | SalamandraTA-7b-instruct | en | arn | **13.04** | **87.13** | **37.56** |
 
687
  | SalamandraTA-7b-base | es | ast | 17.65 | 75.78 | 51.05 |
688
  | Transducens/IbRo-nllb | es | ast | 16.79 | 76.36 | 50.89 |
689
  | SalamandraTA-2B | es | ast | 16.68 | 77.29 | 49.46 |
690
+ | nllb-200-3.3B | es | ast | 11.85 | 100.86 | 40.27 |
691
  | | | | | | |
692
  | SalamandraTA-7b-base | es | arn | **29.19** | **71.85** | **49.42** |
693
  | Transducens/IbRo-nllb | es | arn | 28.45 | 72.56 | 49.28 |
 
715
  | SalamandraTA-7b-base | ca | ast | 26.11 | 63.63 | **58.08** |
716
  | SalamandraTA-2B | ca | ast | 25.32 | 62.59 | 55.98 |
717
  | Transducens/IbRo-nllb | ca | ast | 24.77 | 61.60 | 57.49 |
718
+ | nllb-200-3.3B | ca | ast | 17.17 | 91.47 | 45.83 |
719
  | | | | | | |
720
  | SalamandraTA-7b-base | ca | arn | **17.77** | **80.88** | **42.12** |
721
  | Transducens/IbRo-nllb | ca | arn | 17.51 | 81.18 | 41.91 |
 
805
 
806
  ### Acknowledgements
807
 
808
+ The success of this project has been made possible thanks to the invaluable contributions of our partners in the [ILENIA Project](https://proyectoilenia.es/):
809
+ [HiTZ](http://hitz.ehu.eus/es), and [CiTIUS](https://citius.gal/es/).
810
+ Their efforts have been instrumental in advancing our work, and we sincerely appreciate their help and support.
 
811
 
812
 
813
 
814
+ ### Disclaimer
815
+
816
+
817
  ### Disclaimer
818
  Be aware that the model may contain biases or other unintended distortions.
819
  When third parties deploy systems or provide services based on this model, or use the model themselves,