File size: 8,024 Bytes
7f0e666
 
50e8e96
7f0e666
 
 
 
50e8e96
7f0e666
 
50e8e96
 
7f0e666
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
00df4d5
7f0e666
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
09da103
7f0e666
 
50e8e96
7f0e666
50e8e96
7f0e666
 
 
 
 
09da103
7f0e666
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
50e8e96
7f0e666
 
 
50e8e96
 
 
 
 
7f0e666
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
---
language:
- en
- ar
- ca
- de
- et
- fa
- id
- ja
- lv
- mn
- sl
- sv
- ta
- tr
- zh
license: mit
metrics:
- bleu
datasets:
- mozilla-foundation/common_voice_8_0
pipeline_tag: automatic-speech-recognition
tags:
- zeroswot
- speech translation
- zero-shot
- end-to-end
- nllb
- wav2vec2
---

# ZeroSwot ✨🤖✨

<!-- <div style='display:flex; gap: 0.25rem; '>
<a href='https://arxiv.org/abs/2402.10422'><img src='https://img.shields.io/badge/paper-PDF-green'></a>
<a href='https://github.com/mt-upc/ZeroSwot/blob/main/LICENSE'><img src='https://img.shields.io/badge/License-MIT-blue.svg'></a>
<a href='https://github.com/mt-upc/ZeroSwot'><img src='https://img.shields.io/badge/github-%23121011.svg?style=for-the-badge&logo=github&logoColor=white'></a>
</div> -->

ZeroSwot is a state-of-the-art zero-shot end-to-end Speech Translation system.

<div align=center><img src="resources/intro.png" height="65%" width="65%"/></div>

The model is created by adapting a wav2vec2.0-based encoder to the embedding space of NLLB, using a novel subword compression module and Optimal Transport, while only utilizing ASR data. It thus enables **Zero-shot E2E Speech Translation to all the 200 languages supported by NLLB**.

For more details please refer to our [paper](https://arxiv.org/abs/2402.10422) and the [original repo](https://github.com/mt-upc/ZeroSwot) build on fairseq.

## Architecture

The compression module is a light-weight transformer that takes as input the hidden state of wav2vec2.0 and the corresponding CTC predictions, and compresses them to subword-like embeddings similar to those expected from NLLB and aligns them using Optimal Transport. For inference we simply pass the output of the speech encoder to NLLB encoder.

<div align=center><img src="resources/methodology.png" height="120%" width="120%"/></div>

## Version

This version of ZeroSwot is trained with ASR data from CommonVoice. It adapts [wav2vec2.0-large](https://huggingface.co/facebook/wav2vec2-large-960h-lv60-self) to the embedding space of the [nllb-200-distilled-1.3B_covost2](https://huggingface.co/johntsi/nllb-200-distilled-1.3B_covost2_en-to-15) model, which is a multilingually finetuned NLLB on MuST-C MT data.

We have more versions available:

| Models | ASR data | NLLB version |
|:------:|:--------:|:------------:|
| [ZeroSwot-Medium_asr-mustc](https://huggingface.co/johntsi/ZeroSwot-Medium_asr-mustc_en-to-200) | MuST-C v1.0 | [distilled-600M original](https://huggingface.co/facebook/nllb-200-distilled-600M)| 
| [ZeroSwot-Medium_asr-mustc_mt-mustc](https://huggingface.co/johntsi/ZeroSwot-Medium_asr-mustc_mt-mustc_en-to-8)  | MuST-C v1.0 | [distilled-600M finetuned w/ MuST-C](https://huggingface.co/johntsi/nllb-200-distilled-600M_mustc_en-to-8) |
| [ZeroSwot-Large_asr-mustc](https://huggingface.co/johntsi/ZeroSwot-Large_asr-mustc_en-to-200)  | MuST-C v1.0 | [distilled-1.3B original](https://huggingface.co/facebook/nllb-200-distilled-1.3B) | 
| [ZeroSwot-Large_asr-mustc_mt-mustc](https://huggingface.co/johntsi/ZeroSwot-Large_asr-mustc_mt-mustc_en-to-8) | MuST-C v1.0 | [distilled-1.3B finetuned w/ MuST-C](https://huggingface.co/johntsi/nllb-200-distilled-1.3B_mustc_en-to-8) |
| [ZeroSwot-Medium_asr-cv](https://huggingface.co/johntsi/ZeroSwot-Medium_asr-cv_en-to-200) | CommonVoice | [distilled-600M original](https://huggingface.co/facebook/nllb-200-distilled-600M)| 
| [ZeroSwot-Medium_asr-cv_mt-covost2](https://huggingface.co/johntsi/ZeroSwot-Medium_asr-cv_mt-covost2_en-to-15) | CommonVoice  | [distilled-600M finetuned w/ CoVoST2](https://huggingface.co/johntsi/nllb-200-distilled-600M_covost2_en-to-15) |
| [ZeroSwot-Large_asr-cv](https://huggingface.co/johntsi/ZeroSwot-Large_asr-cv_en-to-200) | CommonVoice  | [distilled-1.3B original](https://huggingface.co/facebook/nllb-200-distilled-1.3B) | 
| [ZeroSwot-Large_asr-cv_mt-covost2](https://huggingface.co/johntsi/ZeroSwot-Large_asr-cv_mt-covost2_en-to-15) | CommonVoice  | [distilled-1.3B finetuned w/ CoVoST2](https://huggingface.co/johntsi/nllb-200-distilled-1.3B_covost2_en-to-15) | 

## Usage

The model is tested with python 3.9.16 and Transformer v4.41.2. Install also torchaudio and sentencepiece for processing.

```bash
pip install transformers torchaudio sentencepiece
```


```python
from transformers import Wav2Vec2Processor, NllbTokenizer, AutoModel, AutoModelForSeq2SeqLM
import torchaudio

def load_and_resample_audio(audio_path, target_sr=16000):
    audio, orig_freq = torchaudio.load(audio_path)
    if orig_freq != target_sr:
        audio = torchaudio.functional.resample(audio, orig_freq=orig_freq, new_freq=target_sr)
    audio = audio.squeeze(0).numpy()
    return audio

# Load processors and tokenizers
processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-large-960h-lv60-self")
tokenizer = NllbTokenizer.from_pretrained("johntsi/nllb-200-distilled-1.3B_covost2_en-to-15")

# Load ZeroSwot Encoder
commit_hash = "762878c55bf91406318983c724db22590a828e96"
zeroswot_encoder = AutoModel.from_pretrained(
    "johntsi/ZeroSwot-Large_asr-cv_mt-covost2_en-to-15", trust_remote_code=True, revision=commit_hash,
)
zeroswot_encoder.eval()
zeroswot_encoder.to("cuda")

# Load NLLB Model
nllb_model = AutoModelForSeq2SeqLM.from_pretrained("johntsi/nllb-200-distilled-1.3B_covost2_en-to-15")
nllb_model.eval()
nllb_model.to("cuda")

# Load audio file
audio = load_and_resample_audio(path_to_audio_file) # you can use "resources/sample.wav" for testing
input_values = processor(audio, sampling_rate=16000, return_tensors="pt").to("cuda")

# translation to German
compressed_embeds, attention_mask = zeroswot_encoder(**input_values)
predicted_ids = nllb_model.generate(
    inputs_embeds=compressed_embeds,
    attention_mask=attention_mask,
    forced_bos_token_id=tokenizer.lang_code_to_id["deu_Latn"],
    num_beams=5,
)
translation = tokenizer.decode(predicted_ids[0], skip_special_tokens=True)
print(translation)
```

## Results

BLEU scores on CoVoST-2 test compared to supervised SOTA models XLS-R-2B and SeamlessM4T-Large. You can refer to Table 5 of the Results section in the paper for more details.

|     Models     |  ZS  |  Size (B)  |  Ar  |  Ca  |  Cy  |  De  |  Et  |  Fa  |  Id  |  Ja  |  Lv  |  Mn  |  Sl  |  Sv  |  Ta  |  Tr  |  Zh  | Average |
|:--------------:|:----:|:----------:|:----:|:----:|:----:|:----:|:----:|:----:|:----:|:----:|:----:|:----:|:----:|:----:|:----:|:----:|:----:|:-------:|
|    [XLS-R-2B](https://huggingface.co/facebook/wav2vec2-xls-r-2b-en-to-15)    |  ✗   |    2.0     | 20.7 | 34.2 | 33.8 | 28.3 | 24.1 | 22.9 | 32.5 | 41.5 | 23.5 | 16.2 | 27.6 | 34.5 | 19.8 | 18.6 | 38.5 |   27.8  |
| [SeamlessM4T-L-v1](https://huggingface.co/facebook/seamless-m4t-large)  |  ✗   |    2.3     | 24.5 | 41.6 | 33.6 | 35.9 | 28.5 | 19.3 | 39.0 | 39.4 | 23.8 | 15.7 | 35.0 | 42.5 | 22.7 | 23.9 | 33.1 |   30.6  |
|   [SeamlessM4T-L-v2](https://huggingface.co/facebook/seamless-m4t-v2-large)      |  ✗   |    2.3     | 25.4 | **43.6** | **35.5** | **37.0** | **29.3** | 19.2 | **40.2** | 39.7 | 24.8 | 16.4 | **36.2** | **43.7** | 23.4 | **24.7** | 35.9 |   **31.7**  |
| [ZeroSwot-Large_asr-cv](https://huggingface.co/johntsi/ZeroSwot-Large_asr-cv_en-to-200) |  ✓   | 0.35/1.65  | 19.8 | 36.1 | 22.6 | 31.8 | 23.6 | 16.8 | 34.2 | 33.6 | 17.5 | 11.8 | 28.9 | 36.8 | 19.1 | 17.5 | 32.2 |   25.5  |
| [ZeroSwot-Large_asr-cv_mt-covost2](https://huggingface.co/johntsi/ZeroSwot-Large_asr-cv_mt-covost2_en-to-15) |  ✓   | 0.35/1.65  | **25.7** | 40.0 | 29.0 | 32.8 | 27.2 | **26.6** | 37.1 | **47.1** | **25.7** | **18.9** | 33.2 | 39.3 | **25.3** | 19.8 | **40.5** |  31.2  |

## Citation

If you find ZeroSwot useful for your research, please cite our paper :)

```
@misc{tsiamas2024pushing,
      title={{Pushing the Limits of Zero-shot End-to-End Speech Translation}}, 
      author={Ioannis Tsiamas and Gerard I. Gállego and José A. R. Fonollosa and Marta R. Costa-jussà},
      year={2024},
      eprint={2402.10422},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}
```