BSC-LT
/

salamandraTA-7b-instruct

@@ -49,7 +49,9 @@ base_model:
 # Salamandra Model Card
-SalamandraTA-7b-instruct is a translation LLM that has been instruction-tuned from SalamandraTA-7b-base. The base model results from continually pre-training [Salamandra-7b](https://huggingface.co/BSC-LT/salamandra-7b) on parallel data. The model is proficent in 37 european languages and support translation-related tasks, namely: sentence-level-translation, paragraph-level-translation, document-level-translation, automatic post-editing, machine translation evaluation, multi-reference-translation, named-entity-recognition and context-aware translation.
 > [!WARNING]
 > **DISCLAIMER:** This version of Salamandra is tailored exclusively for translation tasks. It lacks chat capabilities and has not been trained with any chat instructions.
@@ -129,7 +131,9 @@ The accelerated partition is composed of 1,120 nodes with the following specific
 You can translate between the following 37 languages:
-Aragonese, Aranese, Asturian, Basque, Bulgarian, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, Galician, German, Greek, Hungarian, Irish, Italian, Latvian, Lithuanian, Maltese, Norwegian Bokmål, Norwegian Nynorsk, Occitan, Polish, Portuguese, Romanian, Russian, Serbian, Slovak, Slovenian, Spanish, Swedish, Ukrainian, Valencian, Welsh.
 The instruction-following model use the commonly adopted ChatML template:
@@ -194,7 +198,7 @@ Using this template, each turn is preceded by a `<|im_start|>` delimiter and the
 #### General translation
-For machine translation tasks you can use the following prompt template:
 ```
 Translate the following text from {source} into {target}.
@@ -217,7 +221,7 @@ text = f"Translate the following text from {source} into {target}.\n{source}: {s
 ### Post-editing
-For post-editing tasks you can use the following prompt template:
 ```
 Please fix any mistakes in the following {source}-{target} machine translation or keep it unedited if it's correct.
@@ -244,7 +248,7 @@ text = f"Please fix any mistakes in the following {source}-{target} machine tran
 ### Document-level translation
-For document-level translation tasks you can use the following prompt template:
 ```
 Please translate this text from {source} into {target}.
@@ -274,7 +278,7 @@ The Farm Workforce Modernization Act of 2023, which could grant legal status to
 ### Named-entity recognition
-For named-entity recognition tasks you can use the following prompt template:
 ```
 Analyse the following tokenized text and mark the tokens containing named entities.
@@ -313,10 +317,12 @@ Marked: """
 ### Pretraining Data
-The training corpus consists of 424 billion tokens of Catalan-, Spanish-centric, and English-centric parallel data, including all of the official European languages plus Catalan, Basque,
-Galician, Asturian, Aragonese and Aranese. It amounts to 6,574,251,526 parallel sentence pairs.
-This highly multilingual corpus is predominantly composed of data sourced from [OPUS](https://opus.nlpl.eu/), with additional data taken from the [NTEU project](https://nteu.eu/), Project Aina’s existing corpora, and our own curated datasets..
 Where little parallel Catalan <-> xx data could be found, synthetic Catalan data was generated from the Spanish side of the collected Spanish <-> xx corpora using
 [Projecte Aina’s Spanish-Catalan model](https://huggingface.co/projecte-aina/aina-translator-es-ca). The final distribution of languages was as below:
@@ -331,24 +337,24 @@ Click the expand button below to see the full list of corpora included in the tr
 |-----------------------------------------------|----------------------------------------------------------------|-----------------------------------------------|----------------------------------------------------------------|
 |[AINA](https://huggingface.co/projecte-aina) | en     |       |       |
 |ARANESE-SYNTH-CORPUS-BSC                         | arn   |      |         |
-|BOUA-BSC             |     | val |        |
 |[BOUMH](https://github.com/transducens/PILAR/tree/main/valencian/BOUMH) |          | val   |           |
 |[BOUA-PILAR](https://github.com/transducens/PILAR/tree/main/valencian/BOUA)  |          | val |       |
 |[CCMatrix](https://opus.nlpl.eu/CCMatrix/corpus/version/CCMatrix)		|eu			|		| ga |
 |[DGT](https://opus.nlpl.eu/DGT/corpus/version/DGT)			|			|bg,cs,da,de,el	,et,fi,fr,ga,hr,hu,lt,lv,mt,nl,pl,pt,ro,sk,sl,sv	|    da,et,ga,hr,hu,lt,lv,mt,sh,sl|
-|DOGV-BSC                 |      |  val    |         |
 |[DOGV-PILAR](https://github.com/transducens/PILAR/tree/main/valencian/DOGV-html) |           | val |            |
 |[ELRC-EMEA](https://opus.nlpl.eu/ELRC-EMEA/corpus/version/ELRC-EMEA)		|			|bg,cs,da,hu,lt,lv,mt,pl,ro,sk,sl		| et,hr,lv,ro,sk,sl |
 |[EMEA](https://opus.nlpl.eu/EMEA/corpus/version/EMEA)			|			|bg,cs,da,el,fi,hu,lt,mt,nl,pl,ro,sk,sl,sv		|    et,mt  |
 |[EUBookshop](https://opus.nlpl.eu/EUbookshop/corpus/version/EUbookshop)		|lt,pl,pt			|cs,da,de,el,fi,fr,ga,it,lv,mt,nl,pl,pt,ro,sk,sl,sv		|cy,ga|
 |[Europarl](https://opus.nlpl.eu/Europarl/corpus/version/Europarl)		|			|bg,cs,da,el,en,fi,fr,hu,lt,lv,nl,pl,pt	,ro,sk,sl,sv	| |
 |[Europat](https://opus.nlpl.eu/EuroPat/corpus/version/EuroPat)		|			|en,hr		| no  |
-|[GAITU](https://gaitu.eus/) | | | eu|
 |[KDE4](https://opus.nlpl.eu/KDE4/corpus/version/KDE4)			|bg,cs,da,de,el	,et,eu,fi,fr,ga,gl,hr,it,lt,lv,nl,pl,pt,ro,sk,sl,sv	|bg,ga,hr	|cy,ga,nn,oc |
 |[GlobalVoices](https://opus.nlpl.eu/GlobalVoices/corpus/version/GlobalVoices)		| bg,de,fr,it,nl,pl,pt	|bg,de,fr,pt		|  |
 |[GNOME](https://opus.nlpl.eu/GNOME/corpus/version/GNOME)		|eu,fr,ga,gl,pt		|ga		|cy,ga,nn|
 |[JRC-Arquis](https://opus.nlpl.eu/JRC-Acquis/corpus/version/JRC-Acquis)		|			|cs,da,et,fr,lt,lv,mt,nl,pl	,ro,sv|	 et  |
-|LES-CORTS-VALENCIANES-BSC  |            | val            |           |
 |[MaCoCu](https://opus.nlpl.eu/MaCoCu/corpus/version/MaCoCu)                    | en     |     | hr,mt,uk   |
 |[MultiCCAligned](https://opus.nlpl.eu/JRC-Acquis/corpus/version/JRC-Acquis)	|bg,cs,de,el,et,fi,fr,hr,hu,it,lt,lv,nl,pl,ro,sk,sv	|bg,fi,fr,hr,it,lv,nl,pt		|bg,cy,da,et,fi,hr,hu,lt,lv,no,sl,sr,uk|
 |[MultiHPLT](https://opus.nlpl.eu/MultiHPLT/corpus/version/MultiHPLT)		|en, et,fi,ga,hr,mt		|		|fi,ga,gl,hr,mt,nn,sr |
@@ -356,7 +362,7 @@ Click the expand button below to see the full list of corpora included in the tr
 |[MultiUN](https://opus.nlpl.eu/MultiUN/corpus/version/MultiUN)		|			|fr	|	|
 |[News-Commentary](https://opus.nlpl.eu/News-Commentary/corpus/version/News-Commentary) 	|		|fr		|  |
 |[NLLB](https://opus.nlpl.eu/NLLB/corpus/version/NLLB)			|bg,da,el,en,et,fi,fr,gl,hu,it	,lt,lv,pt,ro,sk,sl	|bg,cs,da,de,el	,et,fi,fr,hu,it,lt,lv,nl,pl,pt	,ro,sk,sl,sv| bg,cs,cy,da,de,el,et,fi,fr,ga,hr,hu,it,lt,lv,mt,nl,no,oc,pl,pt,ro,ru,sk,sl,sr,sv,uk|
-|[NÓS](https://zenodo.org/records/7675110)                 |               |               |    gl      |
 |[NÓS-SYN](https://zenodo.org/records/7685180)            |                |               |   gl       |
 |[NTEU](https://www.elrc-share.eu/repository/search/?q=NTEU)			|			|bg,cs,da,de,el,en,et,fi,fr,ga,hr,hu,it,lt,lv,mt,nl,pl,pt,ro,sk,sl,sv	|        da,et,ga,hr,lt,lv,mt,ro,sk,sl,sv     |
 |[OpenSubtitles](https://opus.nlpl.eu/OpenSubtitles/corpus/version/OpenSubtitles) 	|bg,cs,da,de,el	,et,eu,fi,gl,hr,hu,lt,lv,nl,pl,pt,ro,sk,sl,sv	|da,de,fi,fr,hr,hu,it,lv,nl		| bg,cs,de,el,et,hr,fi,fr,hr,hu,no,sl,sr|
@@ -365,14 +371,15 @@ Click the expand button below to see the full list of corpora included in the tr
 |[Tatoeba](https://opus.nlpl.eu/Tatoeba/corpus/version/Tatoeba)		|de,pt			|pt		|   |
 |[TildeModel](https://opus.nlpl.eu/TildeMODEL/corpus/version/TildeMODEL)		|			|bg		| et,hr,lt,lv,mt |
 |[UNPC](https://opus.nlpl.eu/UNPC/corpus/version/UNPC)			|			|en,fr		| ru  |
-|[VALENCIAN-AUTH](https://github.com/transducens/PILAR/tree/main/valencian/Generalitat)  |          |    val     |          |
-|[VALENCIAN-SYNTH](https://github.com/transducens/PILAR/tree/main/valencian/Generalitat)  |      | val |     |
 |[WikiMatrix](https://opus.nlpl.eu/WikiMatrix/corpus/version/WikiMatrix)		|bg,cs,da,de,el	,et,eu,fi,fr,gl,hr,hu,it,lt,nl,pl,pt,ro,sk,sl,sv	|bg,en,fr,hr,it,pt		| oc,sh |
 |[Wikimedia](https://opus.nlpl.eu/wikimedia/corpus/version/wikimedia) | | |cy,nn |
 |[XLENT](https://opus.nlpl.eu/XLEnt/corpus/version/XLEnt)		|eu,ga,gl			|ga		|cy,et,ga,gl,hr,oc,sh|
-Datasets marked with "BSC" (e.g., BOUA-BSC, DOGV-BSC) are synthetic data generated using our own seq-to-seq models and are for internal use only.
 To consult the data summary document with the respective licences, please send an e-mail to ipr@bsc.es.
@@ -411,9 +418,13 @@ To consult the data summary document with the respective licences, please send a
 ### Instruction Tuning Data
-This model has been fine-tuned on ~135k instructions, primarily targeting machine translation performance for Catalan, English, and Spanish. Additional instruction data for other European and closely related Iberian languages was also included, as it yielded a positive impact on the languages of interest. That said, the performance in these additional languages is not guaranteed due to the limited amount of available data and the lack of resources for thorough testing.
-A portion of our fine-tuning data comes directly from, or is sampled from [TowerBlocks](https://huggingface.co/datasets/Unbabel/TowerBlocks-v0.2). We also created additional datasets for our main languages of interest. While tasks relating to machine translation are included, it’s important to note that no chat data was used in the fine-tuning process.
 Click the expand button below to see the full list of tasks included in the finetuning data.
@@ -459,7 +470,8 @@ Click the expand button below to see the full list of tasks included in the fine
 | Context-Aware Translation   | [TowerBlocks](https://huggingface.co/datasets/Unbabel/TowerBlocks-v0.2): [MT-GenEval](https://github.com/amazon-science/machine-translation-gender-eval)                     | en-de                                                          | 558    |
 |**Total**                  |                    |          |     **135,404**       |
-The non-public portion of this dataset was jointly created by BSC, HiTZ, and CiTIUS. For further information regarding the instruction-tuning data, please contact <langtech@bsc.es>.
 </details>

 # Salamandra Model Card
+SalamandraTA-7b-instruct is a translation LLM that has been instruction-tuned from SalamandraTA-7b-base.
+The base model results from continually pre-training [Salamandra-7b](https://huggingface.co/BSC-LT/salamandra-7b) on parallel data and has not been published, but is reserved for internal use.
+The model is proficent in 37 european languages and support translation-related tasks, namely: sentence-level-translation, paragraph-level-translation, document-level-translation, automatic post-editing, machine translation evaluation, multi-reference-translation, named-entity-recognition and context-aware translation.
 > [!WARNING]
 > **DISCLAIMER:** This version of Salamandra is tailored exclusively for translation tasks. It lacks chat capabilities and has not been trained with any chat instructions.
 You can translate between the following 37 languages:
+Aragonese, Aranese, Asturian, Basque, Bulgarian, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, Galician, German, Greek, Hungarian,
+Irish, Italian, Latvian, Lithuanian, Maltese, Norwegian Bokmål, Norwegian Nynorsk, Occitan, Polish, Portuguese, Romanian, Russian, Serbian, Slovak,
+Slovenian, Spanish, Swedish, Ukrainian, Valencian, Welsh.
 The instruction-following model use the commonly adopted ChatML template:
 #### General translation
+For machine translation tasks, you can use the following prompt template:
 ```
 Translate the following text from {source} into {target}.
 ### Post-editing
+For post-editing tasks, you can use the following prompt template:
 ```
 Please fix any mistakes in the following {source}-{target} machine translation or keep it unedited if it's correct.
 ### Document-level translation
+For document-level translation tasks, you can use the following prompt template:
 ```
 Please translate this text from {source} into {target}.
 ### Named-entity recognition
+For named-entity recognition tasks, you can use the following prompt template:
 ```
 Analyse the following tokenized text and mark the tokens containing named entities.
 ### Pretraining Data
+The pretraining corpus consists of 424 billion tokens of Catalan-centric, Spanish-centric, and English-centric parallel data,
+including all of the official European languages plus Catalan, Basque, Galician, Asturian, Aragonese and Aranese.
+It amounts to 6,574,251,526 parallel sentence pairs.
+This highly multilingual corpus is predominantly composed of data sourced from [OPUS](https://opus.nlpl.eu/),
+with additional data taken from the [NTEU project](https://nteu.eu/), Project Aina’s corpora, and other sources (see: Data Sources and References below).
 Where little parallel Catalan <-> xx data could be found, synthetic Catalan data was generated from the Spanish side of the collected Spanish <-> xx corpora using
 [Projecte Aina’s Spanish-Catalan model](https://huggingface.co/projecte-aina/aina-translator-es-ca). The final distribution of languages was as below:
 |-----------------------------------------------|----------------------------------------------------------------|-----------------------------------------------|----------------------------------------------------------------|
 |[AINA](https://huggingface.co/projecte-aina) | en     |       |       |
 |ARANESE-SYNTH-CORPUS-BSC                         | arn   |      |         |
+|BOUA-SYNTH-BSC             |     | val |        |
 |[BOUMH](https://github.com/transducens/PILAR/tree/main/valencian/BOUMH) |          | val   |           |
 |[BOUA-PILAR](https://github.com/transducens/PILAR/tree/main/valencian/BOUA)  |          | val |       |
 |[CCMatrix](https://opus.nlpl.eu/CCMatrix/corpus/version/CCMatrix)		|eu			|		| ga |
 |[DGT](https://opus.nlpl.eu/DGT/corpus/version/DGT)			|			|bg,cs,da,de,el	,et,fi,fr,ga,hr,hu,lt,lv,mt,nl,pl,pt,ro,sk,sl,sv	|    da,et,ga,hr,hu,lt,lv,mt,sh,sl|
+|DOGV-SYNTH-BSC                 |      |  val    |         |
 |[DOGV-PILAR](https://github.com/transducens/PILAR/tree/main/valencian/DOGV-html) |           | val |            |
 |[ELRC-EMEA](https://opus.nlpl.eu/ELRC-EMEA/corpus/version/ELRC-EMEA)		|			|bg,cs,da,hu,lt,lv,mt,pl,ro,sk,sl		| et,hr,lv,ro,sk,sl |
 |[EMEA](https://opus.nlpl.eu/EMEA/corpus/version/EMEA)			|			|bg,cs,da,el,fi,hu,lt,mt,nl,pl,ro,sk,sl,sv		|    et,mt  |
 |[EUBookshop](https://opus.nlpl.eu/EUbookshop/corpus/version/EUbookshop)		|lt,pl,pt			|cs,da,de,el,fi,fr,ga,it,lv,mt,nl,pl,pt,ro,sk,sl,sv		|cy,ga|
 |[Europarl](https://opus.nlpl.eu/Europarl/corpus/version/Europarl)		|			|bg,cs,da,el,en,fi,fr,hu,lt,lv,nl,pl,pt	,ro,sk,sl,sv	| |
 |[Europat](https://opus.nlpl.eu/EuroPat/corpus/version/EuroPat)		|			|en,hr		| no  |
+|[GAITU Corpus](https://gaitu.eus/) | | | eu|
 |[KDE4](https://opus.nlpl.eu/KDE4/corpus/version/KDE4)			|bg,cs,da,de,el	,et,eu,fi,fr,ga,gl,hr,it,lt,lv,nl,pl,pt,ro,sk,sl,sv	|bg,ga,hr	|cy,ga,nn,oc |
 |[GlobalVoices](https://opus.nlpl.eu/GlobalVoices/corpus/version/GlobalVoices)		| bg,de,fr,it,nl,pl,pt	|bg,de,fr,pt		|  |
 |[GNOME](https://opus.nlpl.eu/GNOME/corpus/version/GNOME)		|eu,fr,ga,gl,pt		|ga		|cy,ga,nn|
 |[JRC-Arquis](https://opus.nlpl.eu/JRC-Acquis/corpus/version/JRC-Acquis)		|			|cs,da,et,fr,lt,lv,mt,nl,pl	,ro,sv|	 et  |
+|LES-CORTS-VALENCIANES-SYNTH-BSC  |            | val            |           |
 |[MaCoCu](https://opus.nlpl.eu/MaCoCu/corpus/version/MaCoCu)                    | en     |     | hr,mt,uk   |
 |[MultiCCAligned](https://opus.nlpl.eu/JRC-Acquis/corpus/version/JRC-Acquis)	|bg,cs,de,el,et,fi,fr,hr,hu,it,lt,lv,nl,pl,ro,sk,sv	|bg,fi,fr,hr,it,lv,nl,pt		|bg,cy,da,et,fi,hr,hu,lt,lv,no,sl,sr,uk|
 |[MultiHPLT](https://opus.nlpl.eu/MultiHPLT/corpus/version/MultiHPLT)		|en, et,fi,ga,hr,mt		|		|fi,ga,gl,hr,mt,nn,sr |
 |[MultiUN](https://opus.nlpl.eu/MultiUN/corpus/version/MultiUN)		|			|fr	|	|
 |[News-Commentary](https://opus.nlpl.eu/News-Commentary/corpus/version/News-Commentary) 	|		|fr		|  |
 |[NLLB](https://opus.nlpl.eu/NLLB/corpus/version/NLLB)			|bg,da,el,en,et,fi,fr,gl,hu,it	,lt,lv,pt,ro,sk,sl	|bg,cs,da,de,el	,et,fi,fr,hu,it,lt,lv,nl,pl,pt	,ro,sk,sl,sv| bg,cs,cy,da,de,el,et,fi,fr,ga,hr,hu,it,lt,lv,mt,nl,no,oc,pl,pt,ro,ru,sk,sl,sr,sv,uk|
+|[NÓS Corpus](https://zenodo.org/records/7675110)                 |               |               |    gl      |
 |[NÓS-SYN](https://zenodo.org/records/7685180)            |                |               |   gl       |
 |[NTEU](https://www.elrc-share.eu/repository/search/?q=NTEU)			|			|bg,cs,da,de,el,en,et,fi,fr,ga,hr,hu,it,lt,lv,mt,nl,pl,pt,ro,sk,sl,sv	|        da,et,ga,hr,lt,lv,mt,ro,sk,sl,sv     |
 |[OpenSubtitles](https://opus.nlpl.eu/OpenSubtitles/corpus/version/OpenSubtitles) 	|bg,cs,da,de,el	,et,eu,fi,gl,hr,hu,lt,lv,nl,pl,pt,ro,sk,sl,sv	|da,de,fi,fr,hr,hu,it,lv,nl		| bg,cs,de,el,et,hr,fi,fr,hr,hu,no,sl,sr|
 |[Tatoeba](https://opus.nlpl.eu/Tatoeba/corpus/version/Tatoeba)		|de,pt			|pt		|   |
 |[TildeModel](https://opus.nlpl.eu/TildeMODEL/corpus/version/TildeMODEL)		|			|bg		| et,hr,lt,lv,mt |
 |[UNPC](https://opus.nlpl.eu/UNPC/corpus/version/UNPC)			|			|en,fr		| ru  |
+|[PILAR-VALENCIAN-AUTH](https://github.com/transducens/PILAR/tree/main/valencian/Generalitat)  |          |    val     |          |
+|[PILAR-VALENCIAN-SYNTH](https://github.com/transducens/PILAR/tree/main/valencian/Generalitat)  |      | val |     |
 |[WikiMatrix](https://opus.nlpl.eu/WikiMatrix/corpus/version/WikiMatrix)		|bg,cs,da,de,el	,et,eu,fi,fr,gl,hr,hu,it,lt,nl,pl,pt,ro,sk,sl,sv	|bg,en,fr,hr,it,pt		| oc,sh |
 |[Wikimedia](https://opus.nlpl.eu/wikimedia/corpus/version/wikimedia) | | |cy,nn |
 |[XLENT](https://opus.nlpl.eu/XLEnt/corpus/version/XLEnt)		|eu,ga,gl			|ga		|cy,et,ga,gl,hr,oc,sh|
+Datasets with "-BSC" in their names (e.g., BOUA-SYNTH-BSC, DOGV-SYNTH-BSC) are synthetic datasets obtained by machine translating
+pre-existing monolingual corpora with our own seq-to-seq models. These datasets were generated internally for model training and are not published.
 To consult the data summary document with the respective licences, please send an e-mail to ipr@bsc.es.
 ### Instruction Tuning Data
+This model has been fine-tuned on ~135k instructions, primarily targeting machine translation performance for Catalan, English, and Spanish.
+Additional instruction data for other European and closely related Iberian languages was also included, as it yielded a positive impact on the languages of interest.
+That said, the performance in these additional languages is not guaranteed due to the limited amount of available data and the lack of resources for thorough testing.
+A portion of our fine-tuning data comes directly from, or is sampled from [TowerBlocks](https://huggingface.co/datasets/Unbabel/TowerBlocks-v0.2).
+We also created additional datasets for our main languages of interest.
+While tasks relating to machine translation are included, it’s important to note that no chat data was used in the fine-tuning process.
 Click the expand button below to see the full list of tasks included in the finetuning data.
 | Context-Aware Translation   | [TowerBlocks](https://huggingface.co/datasets/Unbabel/TowerBlocks-v0.2): [MT-GenEval](https://github.com/amazon-science/machine-translation-gender-eval)                     | en-de                                                          | 558    |
 |**Total**                  |                    |          |     **135,404**       |
+The non-public portion of this dataset was jointly created by the [ILENIA](https://proyectoilenia.es/) partners BSC, HiTZ, and CiTIUS. For further information regarding the instruction-tuning data,
+please contact <langtech@bsc.es>.
 </details>