arxiv:2412.09501

Lyra: An Efficient and Speech-Centric Framework for Omni-Cognition

Published on Dec 12

· Submitted by

wcy1122 on Dec 13

Upvote

Authors:

Chengyao Wang ,

Abstract

As Multi-modal Large Language Models (MLLMs) evolve, expanding beyond single-domain capabilities is essential to meet the demands for more versatile and efficient AI. However, previous omni-models have insufficiently explored speech, neglecting its integration with multi-modality. We introduce Lyra, an efficient MLLM that enhances multimodal abilities, including advanced long-speech comprehension, sound understanding, cross-modality efficiency, and seamless speech interaction. To achieve efficiency and speech-centric capabilities, Lyra employs three strategies: (1) leveraging existing open-source large models and a proposed multi-modality LoRA to reduce training costs and data requirements; (2) using a latent multi-modality regularizer and extractor to strengthen the relationship between speech and other modalities, thereby enhancing model performance; and (3) constructing a high-quality, extensive dataset that includes 1.5M multi-modal (language, vision, audio) data samples and 12K long speech samples, enabling Lyra to handle complex long speech inputs and achieve more robust omni-cognition. Compared to other omni-methods, Lyra achieves state-of-the-art performance on various vision-language, vision-speech, and speech-language benchmarks, while also using fewer computational resources and less training data.

View arXiv page View PDF Add to collection

Community

wcy1122

Paper author Paper submitter 2 days ago

We introduce Lyra, an efficient MLLM that enhances multimodal abilities, including advanced long-speech comprehension, sound understanding, cross-modality efficiency, and seamless speech interaction.

Project: https://lyra-omni.github.io/
Demo: https://103.170.5.190:17860/
Code: https://github.com/dvlab-research/Lyra
Model: https://huggingface.co/collections/zszhong/lyra-model-674ea5bb3b39ff8f15de75fc