Update README.md
Browse files
README.md
CHANGED
@@ -51,7 +51,7 @@ base_model:
|
|
51 |
|
52 |
SalamandraTA-7b-instruct is a translation LLM that has been instruction-tuned from SalamandraTA-7b-base.
|
53 |
The base model results from continually pre-training [Salamandra-7b](https://huggingface.co/BSC-LT/salamandra-7b) on parallel data and has not been published, but is reserved for internal use.
|
54 |
-
SalamandraTA-7b-instruct is proficent in 37 european languages and
|
55 |
|
56 |
> [!WARNING]
|
57 |
> **DISCLAIMER:** This version of Salamandra is tailored exclusively for translation tasks. It lacks chat capabilities and has not been trained with any chat instructions.
|
@@ -135,7 +135,7 @@ Aragonese, Aranese, Asturian, Basque, Bulgarian, Croatian, Czech, Danish, Dutch,
|
|
135 |
Irish, Italian, Latvian, Lithuanian, Maltese, Norwegian Bokm氓l, Norwegian Nynorsk, Occitan, Polish, Portuguese, Romanian, Russian, Serbian, Slovak,
|
136 |
Slovenian, Spanish, Swedish, Ukrainian, Valencian, Welsh.
|
137 |
|
138 |
-
The instruction-following model
|
139 |
|
140 |
```
|
141 |
<|im_start|>system
|
@@ -322,7 +322,7 @@ including all of the official European languages plus Catalan, Basque, Galician,
|
|
322 |
It amounts to 6,574,251,526 parallel sentence pairs.
|
323 |
|
324 |
This highly multilingual corpus is predominantly composed of data sourced from [OPUS](https://opus.nlpl.eu/),
|
325 |
-
with additional data taken from the [NTEU
|
326 |
(see: [Data Sources](#pre-data-sources) and [References](#pre-references)).
|
327 |
Where little parallel Catalan <-> xx data could be found, synthetic Catalan data was generated from the Spanish side of the collected Spanish <-> xx corpora using
|
328 |
[Projecte Aina鈥檚 Spanish-Catalan model](https://huggingface.co/projecte-aina/aina-translator-es-ca). The final distribution of languages was as below:
|
@@ -475,7 +475,7 @@ Click the expand button below to see the full list of tasks included in the fine
|
|
475 |
| Context-Aware Translation | [TowerBlocks](https://huggingface.co/datasets/Unbabel/TowerBlocks-v0.2): [MT-GenEval](https://github.com/amazon-science/machine-translation-gender-eval) | en-de | 558 |
|
476 |
|**Total** | | | **135,404** |
|
477 |
|
478 |
-
The non-public portion of this dataset was jointly created by the [ILENIA](https://proyectoilenia.es/) partners BSC, [HiTZ](http://hitz.ehu.eus/es),
|
479 |
and [CiTIUS](https://citius.gal/es/). For further information regarding the instruction-tuning data,
|
480 |
please contact <langtech@bsc.es>.
|
481 |
|
@@ -506,9 +506,9 @@ please contact <langtech@bsc.es>.
|
|
506 |
|
507 |
Below are the evaluation results on the [Flores+200 devtest set](https://huggingface.co/datasets/openlanguagedata/flores_plus),
|
508 |
compared against the state-of-the-art MADLAD400-7B model ([Kudugunta, S., et al.](https://arxiv.org/abs/2309.04662)) and SalamandraTA-7b-base model.
|
509 |
-
These results cover translation directions
|
510 |
The metrics have been computed excluding Asturian, Aranese, and Aragonese, as we report them separately.
|
511 |
-
The evaluation was conducted using [MT Lens](https://github.com/langtech-bsc/mt-evaluation) following the standard setting (beam search with beam size 5, limiting the translation length to 500 tokens). We report the following metrics:
|
512 |
|
513 |
<details>
|
514 |
<summary>Click to show metrics details</summary>
|
@@ -651,7 +651,7 @@ This section presents the evaluation metrics for Basque translation tasks.
|
|
651 |
|
652 |
The tables below summarize the performance metrics for English, Spanish, and Catalan to Asturian, Aranese and Aragonese compared
|
653 |
against [Transducens/IbRo-nllb](https://huggingface.co/Transducens/IbRo-nllb) [(Galiano Jimenez, et al.)](https://aclanthology.org/2024.wmt-1.85/),
|
654 |
-
NLLB-3.3 ([Costa-juss脿 et al., 2022](https://arxiv.org/abs/2207.04672)) and [SalamandraTA-2B](https://huggingface.co/BSC-LT/salamandraTA-2B).
|
655 |
|
656 |
<details>
|
657 |
<summary>English evaluation</summary>
|
@@ -662,7 +662,7 @@ NLLB-3.3 ([Costa-juss脿 et al., 2022](https://arxiv.org/abs/2207.04672)) and [Sa
|
|
662 |
|:---------------------------------|:---------|:---------|-------:|-------:|-------:|
|
663 |
| SalamandraTA-7b-instruct | en | ast | **31.49** | **54.01** | **60.65** |
|
664 |
| SalamandraTA-7b-base | en | ast | 26.4 | 64.02 | 57.35 |
|
665 |
-
| nllb-3.3B | en | ast | 22.02 | 77.26 | 51.4 |
|
666 |
| Transducens/IbRo-nllb | en | ast | 20.56 | 63.92 | 53.32 |
|
667 |
| | | | | | |
|
668 |
| SalamandraTA-7b-instruct | en | arn | **13.04** | **87.13** | **37.56** |
|
@@ -687,7 +687,7 @@ NLLB-3.3 ([Costa-juss脿 et al., 2022](https://arxiv.org/abs/2207.04672)) and [Sa
|
|
687 |
| SalamandraTA-7b-base | es | ast | 17.65 | 75.78 | 51.05 |
|
688 |
| Transducens/IbRo-nllb | es | ast | 16.79 | 76.36 | 50.89 |
|
689 |
| SalamandraTA-2B | es | ast | 16.68 | 77.29 | 49.46 |
|
690 |
-
| nllb-3.3B | es | ast | 11.85 | 100.86 | 40.27 |
|
691 |
| | | | | | |
|
692 |
| SalamandraTA-7b-base | es | arn | **29.19** | **71.85** | **49.42** |
|
693 |
| Transducens/IbRo-nllb | es | arn | 28.45 | 72.56 | 49.28 |
|
@@ -715,7 +715,7 @@ NLLB-3.3 ([Costa-juss脿 et al., 2022](https://arxiv.org/abs/2207.04672)) and [Sa
|
|
715 |
| SalamandraTA-7b-base | ca | ast | 26.11 | 63.63 | **58.08** |
|
716 |
| SalamandraTA-2B | ca | ast | 25.32 | 62.59 | 55.98 |
|
717 |
| Transducens/IbRo-nllb | ca | ast | 24.77 | 61.60 | 57.49 |
|
718 |
-
| nllb-3.3B | ca | ast | 17.17 | 91.47 | 45.83 |
|
719 |
| | | | | | |
|
720 |
| SalamandraTA-7b-base | ca | arn | **17.77** | **80.88** | **42.12** |
|
721 |
| Transducens/IbRo-nllb | ca | arn | 17.51 | 81.18 | 41.91 |
|
@@ -805,13 +805,15 @@ within the framework of [ILENIA Project](https://proyectoilenia.es/) with refere
|
|
805 |
|
806 |
### Acknowledgements
|
807 |
|
808 |
-
The success of this project has been made possible thanks to the invaluable contributions of
|
809 |
-
|
810 |
-
|
811 |
-
[CENID](https://cenid.es/), [CiTIUS](https://citius.gal/es/), [Gaitu proiektua](https://gaitu.eus/), [Helsinki NLP](https://github.com/Helsinki-NLP), [HiTZ](http://hitz.ehu.eus/es), [Institut d鈥橢studis Aranesi](http://www.institutestudisaranesi.cat/), [MaCoCu Project](https://macocu.eu/), [Machine Translate Foundation](https://machinetranslate.org/about), [NTEU Project](https://nteu.eu/), [Orai NLP technologies](https://huggingface.co/orai-nlp), [Proxecto N贸s](https://nos.gal/es/proxecto-nos), [Softcatal脿](https://www.softcatala.org/), [Tatoeba Project](https://tatoeba.org/), [TILDE Project](https://tilde.ai/tildelm/), [Transducens - Departament de Llenguatges i Sistemes Inform脿tics Universitat d鈥橝lacant](https://transducens.dlsi.ua.es/), [Unbabel](https://huggingface.co/Unbabel).
|
812 |
|
813 |
|
814 |
|
|
|
|
|
|
|
815 |
### Disclaimer
|
816 |
Be aware that the model may contain biases or other unintended distortions.
|
817 |
When third parties deploy systems or provide services based on this model, or use the model themselves,
|
|
|
51 |
|
52 |
SalamandraTA-7b-instruct is a translation LLM that has been instruction-tuned from SalamandraTA-7b-base.
|
53 |
The base model results from continually pre-training [Salamandra-7b](https://huggingface.co/BSC-LT/salamandra-7b) on parallel data and has not been published, but is reserved for internal use.
|
54 |
+
SalamandraTA-7b-instruct is proficent in 37 european languages and supports translation-related tasks, namely: sentence-level-translation, paragraph-level-translation, document-level-translation, automatic post-editing, machine translation evaluation, multi-reference-translation, named-entity-recognition and context-aware translation.
|
55 |
|
56 |
> [!WARNING]
|
57 |
> **DISCLAIMER:** This version of Salamandra is tailored exclusively for translation tasks. It lacks chat capabilities and has not been trained with any chat instructions.
|
|
|
135 |
Irish, Italian, Latvian, Lithuanian, Maltese, Norwegian Bokm氓l, Norwegian Nynorsk, Occitan, Polish, Portuguese, Romanian, Russian, Serbian, Slovak,
|
136 |
Slovenian, Spanish, Swedish, Ukrainian, Valencian, Welsh.
|
137 |
|
138 |
+
The instruction-following model uses the commonly adopted ChatML template:
|
139 |
|
140 |
```
|
141 |
<|im_start|>system
|
|
|
322 |
It amounts to 6,574,251,526 parallel sentence pairs.
|
323 |
|
324 |
This highly multilingual corpus is predominantly composed of data sourced from [OPUS](https://opus.nlpl.eu/),
|
325 |
+
with additional data taken from the [NTEU Project](https://nteu.eu/), [Aina Project](https://projecteaina.cat/), and other sources
|
326 |
(see: [Data Sources](#pre-data-sources) and [References](#pre-references)).
|
327 |
Where little parallel Catalan <-> xx data could be found, synthetic Catalan data was generated from the Spanish side of the collected Spanish <-> xx corpora using
|
328 |
[Projecte Aina鈥檚 Spanish-Catalan model](https://huggingface.co/projecte-aina/aina-translator-es-ca). The final distribution of languages was as below:
|
|
|
475 |
| Context-Aware Translation | [TowerBlocks](https://huggingface.co/datasets/Unbabel/TowerBlocks-v0.2): [MT-GenEval](https://github.com/amazon-science/machine-translation-gender-eval) | en-de | 558 |
|
476 |
|**Total** | | | **135,404** |
|
477 |
|
478 |
+
The non-public portion of this dataset was jointly created by the [ILENIA](https://proyectoilenia.es/) partners: BSC-LT, [HiTZ](http://hitz.ehu.eus/es),
|
479 |
and [CiTIUS](https://citius.gal/es/). For further information regarding the instruction-tuning data,
|
480 |
please contact <langtech@bsc.es>.
|
481 |
|
|
|
506 |
|
507 |
Below are the evaluation results on the [Flores+200 devtest set](https://huggingface.co/datasets/openlanguagedata/flores_plus),
|
508 |
compared against the state-of-the-art MADLAD400-7B model ([Kudugunta, S., et al.](https://arxiv.org/abs/2309.04662)) and SalamandraTA-7b-base model.
|
509 |
+
These results cover the translation directions CA-XX, ES-XX, EN-XX, as well as XX-CA, XX-ES, and XX-EN.
|
510 |
The metrics have been computed excluding Asturian, Aranese, and Aragonese, as we report them separately.
|
511 |
+
The evaluation was conducted using [MT Lens](https://github.com/langtech-bsc/mt-evaluation), following the standard setting (beam search with beam size 5, limiting the translation length to 500 tokens). We report the following metrics:
|
512 |
|
513 |
<details>
|
514 |
<summary>Click to show metrics details</summary>
|
|
|
651 |
|
652 |
The tables below summarize the performance metrics for English, Spanish, and Catalan to Asturian, Aranese and Aragonese compared
|
653 |
against [Transducens/IbRo-nllb](https://huggingface.co/Transducens/IbRo-nllb) [(Galiano Jimenez, et al.)](https://aclanthology.org/2024.wmt-1.85/),
|
654 |
+
[NLLB-200-3.3B](https://huggingface.co/facebook/nllb-200-3.3B) ([Costa-juss脿 et al., 2022](https://arxiv.org/abs/2207.04672)) and [SalamandraTA-2B](https://huggingface.co/BSC-LT/salamandraTA-2B).
|
655 |
|
656 |
<details>
|
657 |
<summary>English evaluation</summary>
|
|
|
662 |
|:---------------------------------|:---------|:---------|-------:|-------:|-------:|
|
663 |
| SalamandraTA-7b-instruct | en | ast | **31.49** | **54.01** | **60.65** |
|
664 |
| SalamandraTA-7b-base | en | ast | 26.4 | 64.02 | 57.35 |
|
665 |
+
| nllb-200-3.3B | en | ast | 22.02 | 77.26 | 51.4 |
|
666 |
| Transducens/IbRo-nllb | en | ast | 20.56 | 63.92 | 53.32 |
|
667 |
| | | | | | |
|
668 |
| SalamandraTA-7b-instruct | en | arn | **13.04** | **87.13** | **37.56** |
|
|
|
687 |
| SalamandraTA-7b-base | es | ast | 17.65 | 75.78 | 51.05 |
|
688 |
| Transducens/IbRo-nllb | es | ast | 16.79 | 76.36 | 50.89 |
|
689 |
| SalamandraTA-2B | es | ast | 16.68 | 77.29 | 49.46 |
|
690 |
+
| nllb-200-3.3B | es | ast | 11.85 | 100.86 | 40.27 |
|
691 |
| | | | | | |
|
692 |
| SalamandraTA-7b-base | es | arn | **29.19** | **71.85** | **49.42** |
|
693 |
| Transducens/IbRo-nllb | es | arn | 28.45 | 72.56 | 49.28 |
|
|
|
715 |
| SalamandraTA-7b-base | ca | ast | 26.11 | 63.63 | **58.08** |
|
716 |
| SalamandraTA-2B | ca | ast | 25.32 | 62.59 | 55.98 |
|
717 |
| Transducens/IbRo-nllb | ca | ast | 24.77 | 61.60 | 57.49 |
|
718 |
+
| nllb-200-3.3B | ca | ast | 17.17 | 91.47 | 45.83 |
|
719 |
| | | | | | |
|
720 |
| SalamandraTA-7b-base | ca | arn | **17.77** | **80.88** | **42.12** |
|
721 |
| Transducens/IbRo-nllb | ca | arn | 17.51 | 81.18 | 41.91 |
|
|
|
805 |
|
806 |
### Acknowledgements
|
807 |
|
808 |
+
The success of this project has been made possible thanks to the invaluable contributions of our partners in the [ILENIA Project](https://proyectoilenia.es/):
|
809 |
+
[HiTZ](http://hitz.ehu.eus/es), and [CiTIUS](https://citius.gal/es/).
|
810 |
+
Their efforts have been instrumental in advancing our work, and we sincerely appreciate their help and support.
|
|
|
811 |
|
812 |
|
813 |
|
814 |
+
### Disclaimer
|
815 |
+
|
816 |
+
|
817 |
### Disclaimer
|
818 |
Be aware that the model may contain biases or other unintended distortions.
|
819 |
When third parties deploy systems or provide services based on this model, or use the model themselves,
|