File size: 5,110 Bytes
cacf3f9 4844429 cacf3f9 2f3de3a cacf3f9 e4e2d95 149bdaf 887709d 149bdaf 887709d 149bdaf 887709d 149bdaf 887709d 149bdaf 887709d 149bdaf 887709d 149bdaf 2c21386 149bdaf 2c21386 149bdaf c130713 4844429 c130713 4844429 c130713 149bdaf 887709d 149bdaf |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 |
---
license: other
language:
- en
base_model: meta-llama/Meta-Llama-3.1-8B-Instruct
pipeline_tag: audio-to-audio
tags:
- large language models
- speech-language models
- speech interaction
- speech-to-speech
library_name: llama-omni
---
# π¦π§ LLaMA-Omni: Seamless Speech Interaction with Large Language Models
> **Authors: [Qingkai Fang](https://fangqingkai.github.io/), [Shoutao Guo](https://scholar.google.com/citations?hl=en&user=XwHtPyAAAAAJ), [Yan Zhou](https://zhouyan19.github.io/zhouyan/), [Zhengrui Ma](https://scholar.google.com.hk/citations?user=dUgq6tEAAAAJ), [Shaolei Zhang](https://zhangshaolei1998.github.io/), [Yang Feng*](https://people.ucas.edu.cn/~yangfeng?language=en)**
[[Paper]](https://arxiv.org/abs/2409.06666) [[Model]](https://huggingface.co/ICTNLP/Llama-3.1-8B-Omni) [[Code]](https://github.com/ictnlp/LLaMA-Omni)
LLaMA-Omni is a speech-language model built upon Llama-3.1-8B-Instruct. It supports low-latency and high-quality speech interactions, simultaneously generating both text and speech responses based on speech instructions.
![](images/model.png)
## π‘ Highlights
- πͺ **Built on Llama-3.1-8B-Instruct, ensuring high-quality responses.**
- π **Low-latency speech interaction with a latency as low as 226ms.**
- π§ **Simultaneous generation of both text and speech responses.**
- β»οΈ **Trained in less than 3 days using just 4 GPUs.**
<video controls autoplay src="https://cdn-uploads.huggingface.co/production/uploads/65b7573482d384513443875e/dr4XWUxzuVQ52lBuzNBTt.mp4"></video>
## Install
1. Clone this repository.
```shell
git clone https://github.com/ictnlp/LLaMA-Omni
cd LLaMA-Omni
```
2. Install packages.
```shell
conda create -n llama-omni python=3.10
conda activate llama-omni
pip install pip==24.0
pip install -e .
```
3. Install `fairseq`.
```shell
git clone https://github.com/pytorch/fairseq
cd fairseq
pip install -e . --no-build-isolation
```
4. Install `flash-attention`.
```shell
pip install flash-attn --no-build-isolation
```
## Quick Start
1. Download the `Llama-3.1-8B-Omni` model from π€[Huggingface](https://huggingface.co/ICTNLP/Llama-3.1-8B-Omni).
2. Download the `Whisper-large-v3` model.
```shell
import whisper
model = whisper.load_model("large-v3", download_root="models/speech_encoder/")
```
3. Download the unit-based HiFi-GAN vocoder.
```shell
wget https://dl.fbaipublicfiles.com/fairseq/speech_to_speech/vocoder/code_hifigan/mhubert_vp_en_es_fr_it3_400k_layer11_km1000_lj/g_00500000 -P vocoder/
wget https://dl.fbaipublicfiles.com/fairseq/speech_to_speech/vocoder/code_hifigan/mhubert_vp_en_es_fr_it3_400k_layer11_km1000_lj/config.json -P vocoder/
```
## Gradio Demo
1. Launch a controller.
```shell
python -m omni_speech.serve.controller --host 0.0.0.0 --port 10000
```
2. Launch a gradio web server.
```shell
python -m omni_speech.serve.gradio_web_server --controller http://localhost:10000 --port 8000 --model-list-mode reload --vocoder vocoder/g_00500000 --vocoder-cfg vocoder/config.json
```
3. Launch a model worker.
```shell
python -m omni_speech.serve.model_worker --host 0.0.0.0 --controller http://localhost:10000 --port 40000 --worker http://localhost:40000 --model-path Llama-3.1-8B-Omni --model-name Llama-3.1-8B-Omni --s2s
```
4. Visit [http://localhost:8000/](http://localhost:8000/) and interact with LLaMA-3.1-8B-Omni!
**Note: Due to the instability of streaming audio playback in Gradio, we have only implemented streaming audio synthesis without enabling autoplay. If you have a good solution, feel free to submit a PR. Thanks!**
## Local Inference
To run inference locally, please organize the speech instruction files according to the format in the `omni_speech/infer/examples` directory, then refer to the following script.
```shell
bash omni_speech/infer/run.sh omni_speech/infer/examples
```
## LICENSE
Our code is released under the Apache-2.0 License. Our model is intended for academic research purposes only and may **NOT** be used for commercial purposes.
You are free to use, modify, and distribute this model in academic settings, provided that the following conditions are met:
- **Non-commercial use**: The model may not be used for any commercial purposes.
- **Citation**: If you use this model in your research, please cite the original work.
### Commercial Use Restriction
For any commercial use inquiries or to obtain a commercial license, please contact `fengyang@ict.ac.cn`.
## Acknowledgements
- [LLaVA](https://github.com/haotian-liu/LLaVA): The codebase we built upon.
- [SLAM-LLM](https://github.com/X-LANCE/SLAM-LLM): We borrow some code about speech encoder and speech adaptor.
## Citation
If you have any questions, please feel free to submit an issue or contact `fangqingkai21b@ict.ac.cn`.
If our work is useful for you, please cite as:
```
@article{fang-etal-2024-llama-omni,
title={LLaMA-Omni: Seamless Speech Interaction with Large Language Models},
author={Fang, Qingkai and Guo, Shoutao and Zhou, Yan and Ma, Zhengrui and Zhang, Shaolei and Feng, Yang},
journal={arXiv preprint arXiv:2409.06666},
year={2024}
}
``` |