|
--- |
|
language: |
|
- en |
|
- ar |
|
- ca |
|
- de |
|
- et |
|
- fa |
|
- id |
|
- ja |
|
- lv |
|
- mn |
|
- sl |
|
- sv |
|
- ta |
|
- tr |
|
- zh |
|
license: mit |
|
metrics: |
|
- bleu |
|
datasets: |
|
- mozilla-foundation/common_voice_8_0 |
|
pipeline_tag: automatic-speech-recognition |
|
tags: |
|
- zeroswot |
|
- speech translation |
|
- zero-shot |
|
- end-to-end |
|
- nllb |
|
- wav2vec2 |
|
--- |
|
|
|
# ZeroSwot ✨🤖✨ |
|
|
|
<!-- <div style='display:flex; gap: 0.25rem; '> |
|
<a href='https://arxiv.org/abs/2402.10422'><img src='https://img.shields.io/badge/paper-PDF-green'></a> |
|
<a href='https://github.com/mt-upc/ZeroSwot/blob/main/LICENSE'><img src='https://img.shields.io/badge/License-MIT-blue.svg'></a> |
|
<a href='https://github.com/mt-upc/ZeroSwot'><img src='https://img.shields.io/badge/github-%23121011.svg?style=for-the-badge&logo=github&logoColor=white'></a> |
|
</div> --> |
|
|
|
ZeroSwot is a state-of-the-art zero-shot end-to-end Speech Translation system. |
|
|
|
<div align=center><img src="resources/intro.png" height="65%" width="65%"/></div> |
|
|
|
The model is created by adapting a wav2vec2.0-based encoder to the embedding space of NLLB, using a novel subword compression module and Optimal Transport, while only utilizing ASR data. It thus enables **Zero-shot E2E Speech Translation to all the 200 languages supported by NLLB**. |
|
|
|
For more details please refer to our [paper](https://arxiv.org/abs/2402.10422) and the [original repo](https://github.com/mt-upc/ZeroSwot) build on fairseq. |
|
|
|
## Architecture |
|
|
|
The compression module is a light-weight transformer that takes as input the hidden state of wav2vec2.0 and the corresponding CTC predictions, and compresses them to subword-like embeddings similar to those expected from NLLB and aligns them using Optimal Transport. For inference we simply pass the output of the speech encoder to NLLB encoder. |
|
|
|
<div align=center><img src="resources/methodology.png" height="120%" width="120%"/></div> |
|
|
|
## Version |
|
|
|
This version of ZeroSwot is trained with ASR data from CommonVoice. It adapts [wav2vec2.0-large](https://huggingface.co/facebook/wav2vec2-large-960h-lv60-self) to the embedding space of the [nllb-200-distilled-1.3B_covost2](https://huggingface.co/johntsi/nllb-200-distilled-1.3B_covost2_en-to-15) model, which is a multilingually finetuned NLLB on MuST-C MT data. |
|
|
|
We have more versions available: |
|
|
|
| Models | ASR data | NLLB version | |
|
|:------:|:--------:|:------------:| |
|
| [ZeroSwot-Medium_asr-mustc](https://huggingface.co/johntsi/ZeroSwot-Medium_asr-mustc_en-to-200) | MuST-C v1.0 | [distilled-600M original](https://huggingface.co/facebook/nllb-200-distilled-600M)| |
|
| [ZeroSwot-Medium_asr-mustc_mt-mustc](https://huggingface.co/johntsi/ZeroSwot-Medium_asr-mustc_mt-mustc_en-to-8) | MuST-C v1.0 | [distilled-600M finetuned w/ MuST-C](https://huggingface.co/johntsi/nllb-200-distilled-600M_mustc_en-to-8) | |
|
| [ZeroSwot-Large_asr-mustc](https://huggingface.co/johntsi/ZeroSwot-Large_asr-mustc_en-to-200) | MuST-C v1.0 | [distilled-1.3B original](https://huggingface.co/facebook/nllb-200-distilled-1.3B) | |
|
| [ZeroSwot-Large_asr-mustc_mt-mustc](https://huggingface.co/johntsi/ZeroSwot-Large_asr-mustc_mt-mustc_en-to-8) | MuST-C v1.0 | [distilled-1.3B finetuned w/ MuST-C](https://huggingface.co/johntsi/nllb-200-distilled-1.3B_mustc_en-to-8) | |
|
| [ZeroSwot-Medium_asr-cv](https://huggingface.co/johntsi/ZeroSwot-Medium_asr-cv_en-to-200) | CommonVoice | [distilled-600M original](https://huggingface.co/facebook/nllb-200-distilled-600M)| |
|
| [ZeroSwot-Medium_asr-cv_mt-covost2](https://huggingface.co/johntsi/ZeroSwot-Medium_asr-cv_mt-covost2_en-to-15) | CommonVoice | [distilled-600M finetuned w/ CoVoST2](https://huggingface.co/johntsi/nllb-200-distilled-600M_covost2_en-to-15) | |
|
| [ZeroSwot-Large_asr-cv](https://huggingface.co/johntsi/ZeroSwot-Large_asr-cv_en-to-200) | CommonVoice | [distilled-1.3B original](https://huggingface.co/facebook/nllb-200-distilled-1.3B) | |
|
| [ZeroSwot-Large_asr-cv_mt-covost2](https://huggingface.co/johntsi/ZeroSwot-Large_asr-cv_mt-covost2_en-to-15) | CommonVoice | [distilled-1.3B finetuned w/ CoVoST2](https://huggingface.co/johntsi/nllb-200-distilled-1.3B_covost2_en-to-15) | |
|
|
|
## Usage |
|
|
|
The model is tested with python 3.9.16 and Transformer v4.41.2. Install also torchaudio and sentencepiece for processing. |
|
|
|
```bash |
|
pip install transformers torchaudio sentencepiece |
|
``` |
|
|
|
|
|
```python |
|
from transformers import Wav2Vec2Processor, NllbTokenizer, AutoModel, AutoModelForSeq2SeqLM |
|
import torchaudio |
|
|
|
def load_and_resample_audio(audio_path, target_sr=16000): |
|
audio, orig_freq = torchaudio.load(audio_path) |
|
if orig_freq != target_sr: |
|
audio = torchaudio.functional.resample(audio, orig_freq=orig_freq, new_freq=target_sr) |
|
audio = audio.squeeze(0).numpy() |
|
return audio |
|
|
|
# Load processors and tokenizers |
|
processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-large-960h-lv60-self") |
|
tokenizer = NllbTokenizer.from_pretrained("johntsi/nllb-200-distilled-1.3B_covost2_en-to-15") |
|
|
|
# Load ZeroSwot Encoder |
|
commit_hash = "762878c55bf91406318983c724db22590a828e96" |
|
zeroswot_encoder = AutoModel.from_pretrained( |
|
"johntsi/ZeroSwot-Large_asr-cv_mt-covost2_en-to-15", trust_remote_code=True, revision=commit_hash, |
|
) |
|
zeroswot_encoder.eval() |
|
zeroswot_encoder.to("cuda") |
|
|
|
# Load NLLB Model |
|
nllb_model = AutoModelForSeq2SeqLM.from_pretrained("johntsi/nllb-200-distilled-1.3B_covost2_en-to-15") |
|
nllb_model.eval() |
|
nllb_model.to("cuda") |
|
|
|
# Load audio file |
|
audio = load_and_resample_audio(path_to_audio_file) # you can use "resources/sample.wav" for testing |
|
input_values = processor(audio, sampling_rate=16000, return_tensors="pt").to("cuda") |
|
|
|
# translation to German |
|
compressed_embeds, attention_mask = zeroswot_encoder(**input_values) |
|
predicted_ids = nllb_model.generate( |
|
inputs_embeds=compressed_embeds, |
|
attention_mask=attention_mask, |
|
forced_bos_token_id=tokenizer.lang_code_to_id["deu_Latn"], |
|
num_beams=5, |
|
) |
|
translation = tokenizer.decode(predicted_ids[0], skip_special_tokens=True) |
|
print(translation) |
|
``` |
|
|
|
## Results |
|
|
|
BLEU scores on CoVoST-2 test compared to supervised SOTA models XLS-R-2B and SeamlessM4T-Large. You can refer to Table 5 of the Results section in the paper for more details. |
|
|
|
| Models | ZS | Size (B) | Ar | Ca | Cy | De | Et | Fa | Id | Ja | Lv | Mn | Sl | Sv | Ta | Tr | Zh | Average | |
|
|:--------------:|:----:|:----------:|:----:|:----:|:----:|:----:|:----:|:----:|:----:|:----:|:----:|:----:|:----:|:----:|:----:|:----:|:----:|:-------:| |
|
| [XLS-R-2B](https://huggingface.co/facebook/wav2vec2-xls-r-2b-en-to-15) | ✗ | 2.0 | 20.7 | 34.2 | 33.8 | 28.3 | 24.1 | 22.9 | 32.5 | 41.5 | 23.5 | 16.2 | 27.6 | 34.5 | 19.8 | 18.6 | 38.5 | 27.8 | |
|
| [SeamlessM4T-L-v1](https://huggingface.co/facebook/seamless-m4t-large) | ✗ | 2.3 | 24.5 | 41.6 | 33.6 | 35.9 | 28.5 | 19.3 | 39.0 | 39.4 | 23.8 | 15.7 | 35.0 | 42.5 | 22.7 | 23.9 | 33.1 | 30.6 | |
|
| [SeamlessM4T-L-v2](https://huggingface.co/facebook/seamless-m4t-v2-large) | ✗ | 2.3 | 25.4 | **43.6** | **35.5** | **37.0** | **29.3** | 19.2 | **40.2** | 39.7 | 24.8 | 16.4 | **36.2** | **43.7** | 23.4 | **24.7** | 35.9 | **31.7** | |
|
| [ZeroSwot-Large_asr-cv](https://huggingface.co/johntsi/ZeroSwot-Large_asr-cv_en-to-200) | ✓ | 0.35/1.65 | 19.8 | 36.1 | 22.6 | 31.8 | 23.6 | 16.8 | 34.2 | 33.6 | 17.5 | 11.8 | 28.9 | 36.8 | 19.1 | 17.5 | 32.2 | 25.5 | |
|
| [ZeroSwot-Large_asr-cv_mt-covost2](https://huggingface.co/johntsi/ZeroSwot-Large_asr-cv_mt-covost2_en-to-15) | ✓ | 0.35/1.65 | **25.7** | 40.0 | 29.0 | 32.8 | 27.2 | **26.6** | 37.1 | **47.1** | **25.7** | **18.9** | 33.2 | 39.3 | **25.3** | 19.8 | **40.5** | 31.2 | |
|
|
|
## Citation |
|
|
|
If you find ZeroSwot useful for your research, please cite our paper :) |
|
|
|
``` |
|
@misc{tsiamas2024pushing, |
|
title={{Pushing the Limits of Zero-shot End-to-End Speech Translation}}, |
|
author={Ioannis Tsiamas and Gerard I. Gállego and José A. R. Fonollosa and Marta R. Costa-jussà}, |
|
year={2024}, |
|
eprint={2402.10422}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.CL} |
|
} |
|
``` |