---
license: mit
language:
- en
- ar
- ca
- de
- et
- fa
- id
- ja
- lv
- mn
- sl
- sv
- ta
- tr
- zh
metrics:
- bleu
pipeline_tag: translation
datasets:
- facebook/covost2
---
# Model Name

This is a multilingually fine-tuned version of [NLLB](https://arxiv.org/abs/2207.04672) based on [nllb-200-distilled-1.3B](https://huggingface.co/facebook/nllb-200-distilled-1.3B) using the text data of CoVoST2 (En -> 15).

It is part of the paper [Pushing the Limits of Zero-shot End-to-end Speech Translation](https://arxiv.org/abs/2402.10422). Details for the fine-tuning process are available at Appendix D.

## Usage

```python
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("johntsi/nllb-200-distilled-1.3B_covost2_en-to-15")
model = AutoModelForSeq2SeqLM.from_pretrained("johntsi/nllb-200-distilled-1.3B_covost2_en-to-15")

model.eval()
model.to("cuda")

text = "Translate this text to German."
inputs = tokenizer(text, return_tensors="pt").to("cuda")
outputs = model.generate(
    **inputs,
    num_beams=5,
    forced_bos_token_id=tokenizer.lang_code_to_id["deu_Latn"]
)
translated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(translated_text)
```

## Results

#### BLEU scores on CoVoST2 test

| Model                    |   Ar   |   Ca   |   Cy   |   De   |   Et   |   Fa   |   Id   |   Ja   |   Lv   |   Mn   |   Sl   |   Sv   |   Ta   |   Tr   |   Zh   | Average |
|:------------------------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|:-------:|
| [nllb-200-distilled-600M](https://huggingface.co/facebook/nllb-200-distilled-600M) (original)        |  20.0  |  39.0  |  26.3  |  35.5  |  23.4  |  15.7  |  39.6  |  21.8  |  14.8  |  10.4  |  30.3  |  41.1  |  20.2  |  21.1  |  34.8  |  26.3   |
| [nllb-200-distilled-600M_covost2_en-to-15](https://huggingface.co/johntsi/nllb-200-distilled-600M_covost2_en-to-15)     |  28.5  |  46.3  |  35.5  |  37.1  |  31.5  |  29.2  |  45.2  |  38.4  |  29.1  |  22.0  |  37.7  |  45.4  |  29.9  |  23.0  |  46.7  |  35.0   |
| [nllb-200-distilled-1.3B](https://huggingface.co/facebook/nllb-200-distilled-1.3B) (original)        |  23.3  |  43.5  |  33.5  |  37.9  |  27.9  |  16.6  |  41.9  |  23.0  |  20.0  |  13.1  |  35.1  |  43.8  |  21.7  |  23.8  |  37.5  |  29.5   |
| [nllb-200-distilled-1.3B_covost2_en-to-15](https://huggingface.co/johntsi/nllb-200-distilled-1.3B_covost2_en-to-15)     |  29.9  |  47.8  |  35.6  |  38.8  |  32.7  |  29.9  |  46.4  |  39.5  |  29.9  |  21.7  |  39.3  |  46.8  |  31.0  |  24.4  |  48.2  |  36.1   |

## Citation

If you find these models useful for your research, please cite our paper :)

```
@misc{tsiamas2024pushing,
      title={{Pushing the Limits of Zero-shot End-to-End Speech Translation}}, 
      author={Ioannis Tsiamas and Gerard I. Gállego and José A. R. Fonollosa and Marta R. Costa-jussà},
      year={2024},
      eprint={2402.10422},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}
```