Investigating Decoder-only Large Language Models for Speech-to-text Translation
Abstract
Large language models (LLMs), known for their exceptional reasoning capabilities, generalizability, and fluency across diverse domains, present a promising avenue for enhancing speech-related tasks. In this paper, we focus on integrating decoder-only LLMs to the task of speech-to-text translation (S2TT). We propose a decoder-only architecture that enables the LLM to directly consume the encoded speech representation and generate the text translation. Additionally, we investigate the effects of different parameter-efficient fine-tuning techniques and task formulation. Our model achieves state-of-the-art performance on CoVoST 2 and FLEURS among models trained without proprietary data. We also conduct analyses to validate the design choices of our proposed model and bring insights to the integration of LLMs to S2TT.
Community
In this paper, we investigate design choices for LLM-based speech-to-text translation (S2TT). Our architecture achieves state-of-the-art performance on CoVoST2 among models trained with only public S2TT datasets. The key findings are:
- We showed that the decoder-only architecture outperforms encoder-decoder architecture when using a decoder-only LLM (LLaMA-2). Our hypothesis is that the newly-initialized cross-attention layers in the encoder-decoder architecture make training harder.
- We demonstrated that LNA fine-tuning, where we fine-tune the attention layers and layernorm layers, outperforms LoRA significantly.
- Fine-tuning the parameters of the speech encoder along with the text LLM is crucial for good performance. This indicates that discrete token-based speech LLMs might be harder to train due to the need to also update the speech encoder.
- Incorporating different training formulations and instructions could boost performance.
This work was my internship project at Meta AI (FAIR).
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper