Speech2S

Joint Pre-Training with Speech and Bilingual Text for Direct Speech to Speech Translation

(Updating) Nov. 2022: release the code and models
Nov. 2022: release preprint in arXiv

Pre-Trained and Fine-tuned Models

Model	Pre-training Dataset	Fine-tuning Dataset	Model
Speech2S_enes	Voxpopuli_en_v2	-	Google Drive
Speech2S_enes	Voxpopuli_en_v2	Voxpopuli_s2s	Google Drive
Speech2S_esen	Voxpopuli_es_v2	-	Google Drive
Speech2S_esen	Voxpopuli_es_v2	Voxpopuli_s2s	Google Drive

Setup

cd Speech2S/speech2s
pip install --editable fairseq/

Data Preparation

Please follow the steps of data preparation for S2ST in here.

Pre-Training

cd speech2s/stpretrain_scripts
base_sc2c_enes.sh

Finetune

cd speech2s/stpretrain_scripts
finetune_enes.sh

Inference

cd speech2s/stpretrain_scripts
inference_ed.sh

Results on Voxpopuli and Covst

License

This project is licensed under the license found in the LICENSE file in the root directory of this source tree. Portions of the source code are based on the FAIRSEQ.

Microsoft Open Source Code of Conduct

Reference

If you find our work is useful in your research, please cite the following paper:

@article{wei2022joint,
  title={Joint Pre-Training with Speech and Bilingual Text for Direct Speech to Speech Translation},
  author={Wei, Kun and Zhou, Long and Zhang, Ziqiang and Chen, Liping and Liu, Shujie and He, Lei and Li, Jinyu and Wei, Furu},
  journal={arXiv preprint arXiv:2210.17027},
  year={2022}
}