Update README.md
Browse files
README.md
CHANGED
@@ -49,7 +49,9 @@ base_model:
|
|
49 |
|
50 |
# Salamandra Model Card
|
51 |
|
52 |
-
SalamandraTA-7b-instruct is a translation LLM that has been instruction-tuned from SalamandraTA-7b-base.
|
|
|
|
|
53 |
|
54 |
> [!WARNING]
|
55 |
> **DISCLAIMER:** This version of Salamandra is tailored exclusively for translation tasks. It lacks chat capabilities and has not been trained with any chat instructions.
|
@@ -129,7 +131,9 @@ The accelerated partition is composed of 1,120 nodes with the following specific
|
|
129 |
|
130 |
You can translate between the following 37 languages:
|
131 |
|
132 |
-
Aragonese, Aranese, Asturian, Basque, Bulgarian, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, Galician, German, Greek, Hungarian,
|
|
|
|
|
133 |
|
134 |
The instruction-following model use the commonly adopted ChatML template:
|
135 |
|
@@ -194,7 +198,7 @@ Using this template, each turn is preceded by a `<|im_start|>` delimiter and the
|
|
194 |
|
195 |
#### General translation
|
196 |
|
197 |
-
For machine translation tasks you can use the following prompt template:
|
198 |
|
199 |
```
|
200 |
Translate the following text from {source} into {target}.
|
@@ -217,7 +221,7 @@ text = f"Translate the following text from {source} into {target}.\n{source}: {s
|
|
217 |
|
218 |
### Post-editing
|
219 |
|
220 |
-
For post-editing tasks you can use the following prompt template:
|
221 |
|
222 |
```
|
223 |
Please fix any mistakes in the following {source}-{target} machine translation or keep it unedited if it's correct.
|
@@ -244,7 +248,7 @@ text = f"Please fix any mistakes in the following {source}-{target} machine tran
|
|
244 |
|
245 |
### Document-level translation
|
246 |
|
247 |
-
For document-level translation tasks you can use the following prompt template:
|
248 |
|
249 |
```
|
250 |
Please translate this text from {source} into {target}.
|
@@ -274,7 +278,7 @@ The Farm Workforce Modernization Act of 2023, which could grant legal status to
|
|
274 |
|
275 |
### Named-entity recognition
|
276 |
|
277 |
-
For named-entity recognition tasks you can use the following prompt template:
|
278 |
|
279 |
```
|
280 |
Analyse the following tokenized text and mark the tokens containing named entities.
|
@@ -313,10 +317,12 @@ Marked: """
|
|
313 |
|
314 |
### Pretraining Data
|
315 |
|
316 |
-
The
|
317 |
-
|
|
|
318 |
|
319 |
-
This highly multilingual corpus is predominantly composed of data sourced from [OPUS](https://opus.nlpl.eu/),
|
|
|
320 |
Where little parallel Catalan <-> xx data could be found, synthetic Catalan data was generated from the Spanish side of the collected Spanish <-> xx corpora using
|
321 |
[Projecte Aina’s Spanish-Catalan model](https://huggingface.co/projecte-aina/aina-translator-es-ca). The final distribution of languages was as below:
|
322 |
|
@@ -331,24 +337,24 @@ Click the expand button below to see the full list of corpora included in the tr
|
|
331 |
|-----------------------------------------------|----------------------------------------------------------------|-----------------------------------------------|----------------------------------------------------------------|
|
332 |
|[AINA](https://huggingface.co/projecte-aina) | en | | |
|
333 |
|ARANESE-SYNTH-CORPUS-BSC | arn | | |
|
334 |
-
|BOUA-BSC | | val | |
|
335 |
|[BOUMH](https://github.com/transducens/PILAR/tree/main/valencian/BOUMH) | | val | |
|
336 |
|[BOUA-PILAR](https://github.com/transducens/PILAR/tree/main/valencian/BOUA) | | val | |
|
337 |
|[CCMatrix](https://opus.nlpl.eu/CCMatrix/corpus/version/CCMatrix) |eu | | ga |
|
338 |
|[DGT](https://opus.nlpl.eu/DGT/corpus/version/DGT) | |bg,cs,da,de,el ,et,fi,fr,ga,hr,hu,lt,lv,mt,nl,pl,pt,ro,sk,sl,sv | da,et,ga,hr,hu,lt,lv,mt,sh,sl|
|
339 |
-
|DOGV-BSC | | val | |
|
340 |
|[DOGV-PILAR](https://github.com/transducens/PILAR/tree/main/valencian/DOGV-html) | | val | |
|
341 |
|[ELRC-EMEA](https://opus.nlpl.eu/ELRC-EMEA/corpus/version/ELRC-EMEA) | |bg,cs,da,hu,lt,lv,mt,pl,ro,sk,sl | et,hr,lv,ro,sk,sl |
|
342 |
|[EMEA](https://opus.nlpl.eu/EMEA/corpus/version/EMEA) | |bg,cs,da,el,fi,hu,lt,mt,nl,pl,ro,sk,sl,sv | et,mt |
|
343 |
|[EUBookshop](https://opus.nlpl.eu/EUbookshop/corpus/version/EUbookshop) |lt,pl,pt |cs,da,de,el,fi,fr,ga,it,lv,mt,nl,pl,pt,ro,sk,sl,sv |cy,ga|
|
344 |
|[Europarl](https://opus.nlpl.eu/Europarl/corpus/version/Europarl) | |bg,cs,da,el,en,fi,fr,hu,lt,lv,nl,pl,pt ,ro,sk,sl,sv | |
|
345 |
|[Europat](https://opus.nlpl.eu/EuroPat/corpus/version/EuroPat) | |en,hr | no |
|
346 |
-
|[GAITU](https://gaitu.eus/) | | | eu|
|
347 |
|[KDE4](https://opus.nlpl.eu/KDE4/corpus/version/KDE4) |bg,cs,da,de,el ,et,eu,fi,fr,ga,gl,hr,it,lt,lv,nl,pl,pt,ro,sk,sl,sv |bg,ga,hr |cy,ga,nn,oc |
|
348 |
|[GlobalVoices](https://opus.nlpl.eu/GlobalVoices/corpus/version/GlobalVoices) | bg,de,fr,it,nl,pl,pt |bg,de,fr,pt | |
|
349 |
|[GNOME](https://opus.nlpl.eu/GNOME/corpus/version/GNOME) |eu,fr,ga,gl,pt |ga |cy,ga,nn|
|
350 |
|[JRC-Arquis](https://opus.nlpl.eu/JRC-Acquis/corpus/version/JRC-Acquis) | |cs,da,et,fr,lt,lv,mt,nl,pl ,ro,sv| et |
|
351 |
-
|LES-CORTS-VALENCIANES-BSC | | val | |
|
352 |
|[MaCoCu](https://opus.nlpl.eu/MaCoCu/corpus/version/MaCoCu) | en | | hr,mt,uk |
|
353 |
|[MultiCCAligned](https://opus.nlpl.eu/JRC-Acquis/corpus/version/JRC-Acquis) |bg,cs,de,el,et,fi,fr,hr,hu,it,lt,lv,nl,pl,ro,sk,sv |bg,fi,fr,hr,it,lv,nl,pt |bg,cy,da,et,fi,hr,hu,lt,lv,no,sl,sr,uk|
|
354 |
|[MultiHPLT](https://opus.nlpl.eu/MultiHPLT/corpus/version/MultiHPLT) |en, et,fi,ga,hr,mt | |fi,ga,gl,hr,mt,nn,sr |
|
@@ -356,7 +362,7 @@ Click the expand button below to see the full list of corpora included in the tr
|
|
356 |
|[MultiUN](https://opus.nlpl.eu/MultiUN/corpus/version/MultiUN) | |fr | |
|
357 |
|[News-Commentary](https://opus.nlpl.eu/News-Commentary/corpus/version/News-Commentary) | |fr | |
|
358 |
|[NLLB](https://opus.nlpl.eu/NLLB/corpus/version/NLLB) |bg,da,el,en,et,fi,fr,gl,hu,it ,lt,lv,pt,ro,sk,sl |bg,cs,da,de,el ,et,fi,fr,hu,it,lt,lv,nl,pl,pt ,ro,sk,sl,sv| bg,cs,cy,da,de,el,et,fi,fr,ga,hr,hu,it,lt,lv,mt,nl,no,oc,pl,pt,ro,ru,sk,sl,sr,sv,uk|
|
359 |
-
|[NÓS](https://zenodo.org/records/7675110) | | | gl |
|
360 |
|[NÓS-SYN](https://zenodo.org/records/7685180) | | | gl |
|
361 |
|[NTEU](https://www.elrc-share.eu/repository/search/?q=NTEU) | |bg,cs,da,de,el,en,et,fi,fr,ga,hr,hu,it,lt,lv,mt,nl,pl,pt,ro,sk,sl,sv | da,et,ga,hr,lt,lv,mt,ro,sk,sl,sv |
|
362 |
|[OpenSubtitles](https://opus.nlpl.eu/OpenSubtitles/corpus/version/OpenSubtitles) |bg,cs,da,de,el ,et,eu,fi,gl,hr,hu,lt,lv,nl,pl,pt,ro,sk,sl,sv |da,de,fi,fr,hr,hu,it,lv,nl | bg,cs,de,el,et,hr,fi,fr,hr,hu,no,sl,sr|
|
@@ -365,14 +371,15 @@ Click the expand button below to see the full list of corpora included in the tr
|
|
365 |
|[Tatoeba](https://opus.nlpl.eu/Tatoeba/corpus/version/Tatoeba) |de,pt |pt | |
|
366 |
|[TildeModel](https://opus.nlpl.eu/TildeMODEL/corpus/version/TildeMODEL) | |bg | et,hr,lt,lv,mt |
|
367 |
|[UNPC](https://opus.nlpl.eu/UNPC/corpus/version/UNPC) | |en,fr | ru |
|
368 |
-
|[VALENCIAN-AUTH](https://github.com/transducens/PILAR/tree/main/valencian/Generalitat) | | val | |
|
369 |
-
|[VALENCIAN-SYNTH](https://github.com/transducens/PILAR/tree/main/valencian/Generalitat) | | val | |
|
370 |
|[WikiMatrix](https://opus.nlpl.eu/WikiMatrix/corpus/version/WikiMatrix) |bg,cs,da,de,el ,et,eu,fi,fr,gl,hr,hu,it,lt,nl,pl,pt,ro,sk,sl,sv |bg,en,fr,hr,it,pt | oc,sh |
|
371 |
|[Wikimedia](https://opus.nlpl.eu/wikimedia/corpus/version/wikimedia) | | |cy,nn |
|
372 |
|[XLENT](https://opus.nlpl.eu/XLEnt/corpus/version/XLEnt) |eu,ga,gl |ga |cy,et,ga,gl,hr,oc,sh|
|
373 |
|
374 |
|
375 |
-
Datasets
|
|
|
376 |
|
377 |
To consult the data summary document with the respective licences, please send an e-mail to ipr@bsc.es.
|
378 |
|
@@ -411,9 +418,13 @@ To consult the data summary document with the respective licences, please send a
|
|
411 |
|
412 |
### Instruction Tuning Data
|
413 |
|
414 |
-
This model has been fine-tuned on ~135k instructions, primarily targeting machine translation performance for Catalan, English, and Spanish.
|
|
|
|
|
415 |
|
416 |
-
A portion of our fine-tuning data comes directly from, or is sampled from [TowerBlocks](https://huggingface.co/datasets/Unbabel/TowerBlocks-v0.2).
|
|
|
|
|
417 |
|
418 |
Click the expand button below to see the full list of tasks included in the finetuning data.
|
419 |
|
@@ -459,7 +470,8 @@ Click the expand button below to see the full list of tasks included in the fine
|
|
459 |
| Context-Aware Translation | [TowerBlocks](https://huggingface.co/datasets/Unbabel/TowerBlocks-v0.2): [MT-GenEval](https://github.com/amazon-science/machine-translation-gender-eval) | en-de | 558 |
|
460 |
|**Total** | | | **135,404** |
|
461 |
|
462 |
-
The non-public portion of this dataset was jointly created by BSC, HiTZ, and CiTIUS. For further information regarding the instruction-tuning data,
|
|
|
463 |
|
464 |
</details>
|
465 |
|
|
|
49 |
|
50 |
# Salamandra Model Card
|
51 |
|
52 |
+
SalamandraTA-7b-instruct is a translation LLM that has been instruction-tuned from SalamandraTA-7b-base.
|
53 |
+
The base model results from continually pre-training [Salamandra-7b](https://huggingface.co/BSC-LT/salamandra-7b) on parallel data and has not been published, but is reserved for internal use.
|
54 |
+
The model is proficent in 37 european languages and support translation-related tasks, namely: sentence-level-translation, paragraph-level-translation, document-level-translation, automatic post-editing, machine translation evaluation, multi-reference-translation, named-entity-recognition and context-aware translation.
|
55 |
|
56 |
> [!WARNING]
|
57 |
> **DISCLAIMER:** This version of Salamandra is tailored exclusively for translation tasks. It lacks chat capabilities and has not been trained with any chat instructions.
|
|
|
131 |
|
132 |
You can translate between the following 37 languages:
|
133 |
|
134 |
+
Aragonese, Aranese, Asturian, Basque, Bulgarian, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, Galician, German, Greek, Hungarian,
|
135 |
+
Irish, Italian, Latvian, Lithuanian, Maltese, Norwegian Bokmål, Norwegian Nynorsk, Occitan, Polish, Portuguese, Romanian, Russian, Serbian, Slovak,
|
136 |
+
Slovenian, Spanish, Swedish, Ukrainian, Valencian, Welsh.
|
137 |
|
138 |
The instruction-following model use the commonly adopted ChatML template:
|
139 |
|
|
|
198 |
|
199 |
#### General translation
|
200 |
|
201 |
+
For machine translation tasks, you can use the following prompt template:
|
202 |
|
203 |
```
|
204 |
Translate the following text from {source} into {target}.
|
|
|
221 |
|
222 |
### Post-editing
|
223 |
|
224 |
+
For post-editing tasks, you can use the following prompt template:
|
225 |
|
226 |
```
|
227 |
Please fix any mistakes in the following {source}-{target} machine translation or keep it unedited if it's correct.
|
|
|
248 |
|
249 |
### Document-level translation
|
250 |
|
251 |
+
For document-level translation tasks, you can use the following prompt template:
|
252 |
|
253 |
```
|
254 |
Please translate this text from {source} into {target}.
|
|
|
278 |
|
279 |
### Named-entity recognition
|
280 |
|
281 |
+
For named-entity recognition tasks, you can use the following prompt template:
|
282 |
|
283 |
```
|
284 |
Analyse the following tokenized text and mark the tokens containing named entities.
|
|
|
317 |
|
318 |
### Pretraining Data
|
319 |
|
320 |
+
The pretraining corpus consists of 424 billion tokens of Catalan-centric, Spanish-centric, and English-centric parallel data,
|
321 |
+
including all of the official European languages plus Catalan, Basque, Galician, Asturian, Aragonese and Aranese.
|
322 |
+
It amounts to 6,574,251,526 parallel sentence pairs.
|
323 |
|
324 |
+
This highly multilingual corpus is predominantly composed of data sourced from [OPUS](https://opus.nlpl.eu/),
|
325 |
+
with additional data taken from the [NTEU project](https://nteu.eu/), Project Aina’s corpora, and other sources (see: Data Sources and References below).
|
326 |
Where little parallel Catalan <-> xx data could be found, synthetic Catalan data was generated from the Spanish side of the collected Spanish <-> xx corpora using
|
327 |
[Projecte Aina’s Spanish-Catalan model](https://huggingface.co/projecte-aina/aina-translator-es-ca). The final distribution of languages was as below:
|
328 |
|
|
|
337 |
|-----------------------------------------------|----------------------------------------------------------------|-----------------------------------------------|----------------------------------------------------------------|
|
338 |
|[AINA](https://huggingface.co/projecte-aina) | en | | |
|
339 |
|ARANESE-SYNTH-CORPUS-BSC | arn | | |
|
340 |
+
|BOUA-SYNTH-BSC | | val | |
|
341 |
|[BOUMH](https://github.com/transducens/PILAR/tree/main/valencian/BOUMH) | | val | |
|
342 |
|[BOUA-PILAR](https://github.com/transducens/PILAR/tree/main/valencian/BOUA) | | val | |
|
343 |
|[CCMatrix](https://opus.nlpl.eu/CCMatrix/corpus/version/CCMatrix) |eu | | ga |
|
344 |
|[DGT](https://opus.nlpl.eu/DGT/corpus/version/DGT) | |bg,cs,da,de,el ,et,fi,fr,ga,hr,hu,lt,lv,mt,nl,pl,pt,ro,sk,sl,sv | da,et,ga,hr,hu,lt,lv,mt,sh,sl|
|
345 |
+
|DOGV-SYNTH-BSC | | val | |
|
346 |
|[DOGV-PILAR](https://github.com/transducens/PILAR/tree/main/valencian/DOGV-html) | | val | |
|
347 |
|[ELRC-EMEA](https://opus.nlpl.eu/ELRC-EMEA/corpus/version/ELRC-EMEA) | |bg,cs,da,hu,lt,lv,mt,pl,ro,sk,sl | et,hr,lv,ro,sk,sl |
|
348 |
|[EMEA](https://opus.nlpl.eu/EMEA/corpus/version/EMEA) | |bg,cs,da,el,fi,hu,lt,mt,nl,pl,ro,sk,sl,sv | et,mt |
|
349 |
|[EUBookshop](https://opus.nlpl.eu/EUbookshop/corpus/version/EUbookshop) |lt,pl,pt |cs,da,de,el,fi,fr,ga,it,lv,mt,nl,pl,pt,ro,sk,sl,sv |cy,ga|
|
350 |
|[Europarl](https://opus.nlpl.eu/Europarl/corpus/version/Europarl) | |bg,cs,da,el,en,fi,fr,hu,lt,lv,nl,pl,pt ,ro,sk,sl,sv | |
|
351 |
|[Europat](https://opus.nlpl.eu/EuroPat/corpus/version/EuroPat) | |en,hr | no |
|
352 |
+
|[GAITU Corpus](https://gaitu.eus/) | | | eu|
|
353 |
|[KDE4](https://opus.nlpl.eu/KDE4/corpus/version/KDE4) |bg,cs,da,de,el ,et,eu,fi,fr,ga,gl,hr,it,lt,lv,nl,pl,pt,ro,sk,sl,sv |bg,ga,hr |cy,ga,nn,oc |
|
354 |
|[GlobalVoices](https://opus.nlpl.eu/GlobalVoices/corpus/version/GlobalVoices) | bg,de,fr,it,nl,pl,pt |bg,de,fr,pt | |
|
355 |
|[GNOME](https://opus.nlpl.eu/GNOME/corpus/version/GNOME) |eu,fr,ga,gl,pt |ga |cy,ga,nn|
|
356 |
|[JRC-Arquis](https://opus.nlpl.eu/JRC-Acquis/corpus/version/JRC-Acquis) | |cs,da,et,fr,lt,lv,mt,nl,pl ,ro,sv| et |
|
357 |
+
|LES-CORTS-VALENCIANES-SYNTH-BSC | | val | |
|
358 |
|[MaCoCu](https://opus.nlpl.eu/MaCoCu/corpus/version/MaCoCu) | en | | hr,mt,uk |
|
359 |
|[MultiCCAligned](https://opus.nlpl.eu/JRC-Acquis/corpus/version/JRC-Acquis) |bg,cs,de,el,et,fi,fr,hr,hu,it,lt,lv,nl,pl,ro,sk,sv |bg,fi,fr,hr,it,lv,nl,pt |bg,cy,da,et,fi,hr,hu,lt,lv,no,sl,sr,uk|
|
360 |
|[MultiHPLT](https://opus.nlpl.eu/MultiHPLT/corpus/version/MultiHPLT) |en, et,fi,ga,hr,mt | |fi,ga,gl,hr,mt,nn,sr |
|
|
|
362 |
|[MultiUN](https://opus.nlpl.eu/MultiUN/corpus/version/MultiUN) | |fr | |
|
363 |
|[News-Commentary](https://opus.nlpl.eu/News-Commentary/corpus/version/News-Commentary) | |fr | |
|
364 |
|[NLLB](https://opus.nlpl.eu/NLLB/corpus/version/NLLB) |bg,da,el,en,et,fi,fr,gl,hu,it ,lt,lv,pt,ro,sk,sl |bg,cs,da,de,el ,et,fi,fr,hu,it,lt,lv,nl,pl,pt ,ro,sk,sl,sv| bg,cs,cy,da,de,el,et,fi,fr,ga,hr,hu,it,lt,lv,mt,nl,no,oc,pl,pt,ro,ru,sk,sl,sr,sv,uk|
|
365 |
+
|[NÓS Corpus](https://zenodo.org/records/7675110) | | | gl |
|
366 |
|[NÓS-SYN](https://zenodo.org/records/7685180) | | | gl |
|
367 |
|[NTEU](https://www.elrc-share.eu/repository/search/?q=NTEU) | |bg,cs,da,de,el,en,et,fi,fr,ga,hr,hu,it,lt,lv,mt,nl,pl,pt,ro,sk,sl,sv | da,et,ga,hr,lt,lv,mt,ro,sk,sl,sv |
|
368 |
|[OpenSubtitles](https://opus.nlpl.eu/OpenSubtitles/corpus/version/OpenSubtitles) |bg,cs,da,de,el ,et,eu,fi,gl,hr,hu,lt,lv,nl,pl,pt,ro,sk,sl,sv |da,de,fi,fr,hr,hu,it,lv,nl | bg,cs,de,el,et,hr,fi,fr,hr,hu,no,sl,sr|
|
|
|
371 |
|[Tatoeba](https://opus.nlpl.eu/Tatoeba/corpus/version/Tatoeba) |de,pt |pt | |
|
372 |
|[TildeModel](https://opus.nlpl.eu/TildeMODEL/corpus/version/TildeMODEL) | |bg | et,hr,lt,lv,mt |
|
373 |
|[UNPC](https://opus.nlpl.eu/UNPC/corpus/version/UNPC) | |en,fr | ru |
|
374 |
+
|[PILAR-VALENCIAN-AUTH](https://github.com/transducens/PILAR/tree/main/valencian/Generalitat) | | val | |
|
375 |
+
|[PILAR-VALENCIAN-SYNTH](https://github.com/transducens/PILAR/tree/main/valencian/Generalitat) | | val | |
|
376 |
|[WikiMatrix](https://opus.nlpl.eu/WikiMatrix/corpus/version/WikiMatrix) |bg,cs,da,de,el ,et,eu,fi,fr,gl,hr,hu,it,lt,nl,pl,pt,ro,sk,sl,sv |bg,en,fr,hr,it,pt | oc,sh |
|
377 |
|[Wikimedia](https://opus.nlpl.eu/wikimedia/corpus/version/wikimedia) | | |cy,nn |
|
378 |
|[XLENT](https://opus.nlpl.eu/XLEnt/corpus/version/XLEnt) |eu,ga,gl |ga |cy,et,ga,gl,hr,oc,sh|
|
379 |
|
380 |
|
381 |
+
Datasets with "-BSC" in their names (e.g., BOUA-SYNTH-BSC, DOGV-SYNTH-BSC) are synthetic datasets obtained by machine translating
|
382 |
+
pre-existing monolingual corpora with our own seq-to-seq models. These datasets were generated internally for model training and are not published.
|
383 |
|
384 |
To consult the data summary document with the respective licences, please send an e-mail to ipr@bsc.es.
|
385 |
|
|
|
418 |
|
419 |
### Instruction Tuning Data
|
420 |
|
421 |
+
This model has been fine-tuned on ~135k instructions, primarily targeting machine translation performance for Catalan, English, and Spanish.
|
422 |
+
Additional instruction data for other European and closely related Iberian languages was also included, as it yielded a positive impact on the languages of interest.
|
423 |
+
That said, the performance in these additional languages is not guaranteed due to the limited amount of available data and the lack of resources for thorough testing.
|
424 |
|
425 |
+
A portion of our fine-tuning data comes directly from, or is sampled from [TowerBlocks](https://huggingface.co/datasets/Unbabel/TowerBlocks-v0.2).
|
426 |
+
We also created additional datasets for our main languages of interest.
|
427 |
+
While tasks relating to machine translation are included, it’s important to note that no chat data was used in the fine-tuning process.
|
428 |
|
429 |
Click the expand button below to see the full list of tasks included in the finetuning data.
|
430 |
|
|
|
470 |
| Context-Aware Translation | [TowerBlocks](https://huggingface.co/datasets/Unbabel/TowerBlocks-v0.2): [MT-GenEval](https://github.com/amazon-science/machine-translation-gender-eval) | en-de | 558 |
|
471 |
|**Total** | | | **135,404** |
|
472 |
|
473 |
+
The non-public portion of this dataset was jointly created by the [ILENIA](https://proyectoilenia.es/) partners BSC, HiTZ, and CiTIUS. For further information regarding the instruction-tuning data,
|
474 |
+
please contact <langtech@bsc.es>.
|
475 |
|
476 |
</details>
|
477 |
|