--- library_name: transformers license: mit datasets: - h-j-han/SpeechQE-CoVoST2 language: - de - en base_model: - Unbabel/TowerInstruct-7B-v0.2 - openai/whisper-large-v2 --- # [SpeechQE: Estimating the Quality of Direct Speech Translation](https://aclanthology.org/2024.emnlp-main.1218) This is End-to-End model for the task of quality estimation for speech translation (SpeechQE). |Task | E2E Model | Trained Domain |---|---|---| |SpeechQE for English-to-German Speech Translation |[h-j-han/SpeechQE-TowerInstruct-7B-en2de](https://huggingface.co/h-j-han/SpeechQE-TowerInstruct-7B-en2de)| CoVoST2| |SpeechQE for Spanish-to-English Speech Translation |[h-j-han/SpeechQE-TowerInstruct-7B-es2en](https://huggingface.co/h-j-han/SpeechQE-TowerInstruct-7B-es2en)|CoVoST2| ## Architecture and Training Our design incorporates a pretrained speech encoder (whisper-large-v2) and a large language model (TowerInstruct-7B-v0.2) to leverage their existing capabilities in extracting high-quality audio features and handling translation-related tasks. The model is trained with two-phase approach where we first train only an adapter with ASR and ST tasks while freezing textLLM to focus solely on mapping between text and speech modality. Then, we continue training with the SpeechQE task to let the LLM learn the unseen task of QE. In the second phase, the adapter pre-trained in the previous phase is frozen, while text-LLM is trained with LoRA ## Setup We provide code in Github repo : https://github.com/h-j-han/SpeechQE ```bash $ git clone https://github.com/h-j-han/SpeechQE.git $ cd SpeechQE ``` ```bash $ conda create -n speechqe Python=3.11 pytorch=2.0.1 pytorch-cuda=11.7 torchvision torchaudio -c pytorch -c nvidia $ conda activate speechqe $ pip install -r requirements.txt ``` ## Download Audio Data Download the audio data from Common Voice. Here, we use mozilla-foundation/common_voice_4_0. ``` import datasets cv4en = datasets.load_dataset( "mozilla-foundation/common_voice_4_0", "en", cache_dir='path/to/cv4/download', ) ``` ## Evaluation We provide SpeechQE benchmark: [h-j-han/SpeechQE-CoVoST2](https://huggingface.co/datasets/h-j-han/SpeechQE-CoVoST2). BASE_AUDIO_PATH is the path of downloaded Common Voice dataset. ```bash $ python speechqe/score_speechqe.py \ --speechqe_model=h-j-han/SpeechQE-TowerInstruct-7B-en2de \ --dataset_name=h-j-han/SpeechQE-CoVoST2 \ --base_audio_path=$BASE_AUDIO_PATH \ --dataset_config_name=en2de \ --test_split_name=test \ ``` ## Reference Please find details in [this EMNLP24 paper](https://aclanthology.org/2024.emnlp-main.1218) : ``` @misc{han2024speechqe, title={SpeechQE: Estimating the Quality of Direct Speech Translation}, author={HyoJung Han and Kevin Duh and Marine Carpuat}, year={2024}, eprint={2410.21485}, archivePrefix={arXiv}, primaryClass={cs.CL} } ```