unseenmars
commited on
Update README.md
Browse files
README.md
CHANGED
@@ -11,24 +11,19 @@ tags:
|
|
11 |
<img src="https://cdn-uploads.huggingface.co/production/uploads/6618e0424dbef6bd3c72f89a/d7Rzpm0cgCToXjtE7_U2u.png" alt="Example" style="width:200px;"/>
|
12 |
|
13 |
# OmniAudio-2.6B
|
14 |
-
OmniAudio is the world's fastest and most efficient audio-language model for on-device deployment - a 2.6B-parameter multimodal model that processes both text and audio inputs. It integrates three components:
|
15 |
Unlike traditional approaches that chain ASR and LLM models together, OmniAudio-2.6B unifies both capabilities in a single efficient architecture for minimal latency and resource overhead.
|
16 |
-
|
17 |
-
On a 2024 Mac Mini M4 Pro using Q4_K_M quantized GGUF model, **Qwen2-Audio-7B** processes 1.69 tokens/second while OmniAudio-2.6B achieves 4.97 tokens/second, demonstrating nearly **3x faster performance** on consumer hardware.
|
18 |
-
|
19 |
## Quick Links
|
20 |
1. Interactive Demo in our [HuggingFace Space]().
|
21 |
2. [Quickstart for local setup]()
|
22 |
3. Learn more in our [Blogs]()
|
23 |
-
4. **Feedback**: Send questions or suggestions about the model in our [Discord](https://discord.gg/nexa-ai)
|
24 |
-
|
25 |
## Use Cases
|
26 |
* **Voice QA without Internet**: Process offline voice queries like "I am at camping, how do I start a fire without fire starter?" OmniAudio provides practical guidance even without network connectivity.
|
27 |
* **Voice-in Conversation**: Have conversations about personal experiences. When you say "I am having a rough day at work," OmniAudio engages in supportive talk and active listening.
|
28 |
* **Creative Content Generation**: Transform voice prompts into creative pieces. Ask "Write a haiku about autumn leaves" and receive poetic responses inspired by your voice input.
|
29 |
* **Recording Summary**: Simply ask "Can you summarize this meeting note?" to convert lengthy recordings into concise, actionable summaries.
|
30 |
* **Voice Tone Modification**: Transform casual voice memos into professional communications. When you request "Can you make this voice memo more professional?" OmniAudio adjusts the tone while preserving the core message.
|
31 |
-
|
32 |
## Run OmniAudio-2.6B on Your Device
|
33 |
**Step 1: Install Nexa-SDK (local on-device inference framework)**
|
34 |
[Install Nexa-SDK](https://github.com/NexaAI/nexa-sdk?tab=readme-ov-file#install-option-1-executable-installer)
|
@@ -38,14 +33,11 @@ On a 2024 Mac Mini M4 Pro using Q4_K_M quantized GGUF model, **Qwen2-Audio-7B**
|
|
38 |
nexa run omniaudio -st
|
39 |
```
|
40 |
💻 OmniAudio-2.6B q4_K_M version requires 1.30GB RAM and 1.60GB storage space.
|
41 |
-
|
42 |
## Training
|
43 |
We developed OmniAudio through a three-stage training pipeline:
|
44 |
-
|
45 |
-
|
46 |
-
|
47 |
-
* **Direct Preference Optimization (DPO):** The final stage refines model quality using GPT-4o API as a reference. The process identifies and corrects inaccurate responses while maintaining semantic alignment. We additionally leverage Gemma2's text responses as a gold standard to ensure consistent quality across both audio and text inputs.
|
48 |
-
|
49 |
## What's Next for OmniAudio?
|
50 |
OmniAudio is in active development and we are working to advance its capabilities:
|
51 |
* Building direct audio generation for two-way voice communication
|
|
|
11 |
<img src="https://cdn-uploads.huggingface.co/production/uploads/6618e0424dbef6bd3c72f89a/d7Rzpm0cgCToXjtE7_U2u.png" alt="Example" style="width:200px;"/>
|
12 |
|
13 |
# OmniAudio-2.6B
|
14 |
+
OmniAudio is the world's fastest and most efficient audio-language model for on-device deployment - a 2.6B-parameter multimodal model that processes both text and audio inputs. It integrates three components: Gemma-2-2b, Whisper turbo, and a custom projector module, enabling secure, responsive audio-text processing directly on edge devices.
|
15 |
Unlike traditional approaches that chain ASR and LLM models together, OmniAudio-2.6B unifies both capabilities in a single efficient architecture for minimal latency and resource overhead.
|
16 |
+
On a 2024 Mac Mini M4 Pro, **Qwen2-Audio-7B-Instruct** running on 🤗 Transformers achieves an average decoding speed of 6.38 tokens/second, while **Omni-Audio-2.6B** through Nexa SDK reaches 35.23 tokens/second in FP16 GGUF version and 66 tokens/second in Q4_K_M quantized GGUF version - delivering **5.5x to 10.3x faster performance** on consumer hardware.
|
|
|
|
|
17 |
## Quick Links
|
18 |
1. Interactive Demo in our [HuggingFace Space]().
|
19 |
2. [Quickstart for local setup]()
|
20 |
3. Learn more in our [Blogs]()
|
|
|
|
|
21 |
## Use Cases
|
22 |
* **Voice QA without Internet**: Process offline voice queries like "I am at camping, how do I start a fire without fire starter?" OmniAudio provides practical guidance even without network connectivity.
|
23 |
* **Voice-in Conversation**: Have conversations about personal experiences. When you say "I am having a rough day at work," OmniAudio engages in supportive talk and active listening.
|
24 |
* **Creative Content Generation**: Transform voice prompts into creative pieces. Ask "Write a haiku about autumn leaves" and receive poetic responses inspired by your voice input.
|
25 |
* **Recording Summary**: Simply ask "Can you summarize this meeting note?" to convert lengthy recordings into concise, actionable summaries.
|
26 |
* **Voice Tone Modification**: Transform casual voice memos into professional communications. When you request "Can you make this voice memo more professional?" OmniAudio adjusts the tone while preserving the core message.
|
|
|
27 |
## Run OmniAudio-2.6B on Your Device
|
28 |
**Step 1: Install Nexa-SDK (local on-device inference framework)**
|
29 |
[Install Nexa-SDK](https://github.com/NexaAI/nexa-sdk?tab=readme-ov-file#install-option-1-executable-installer)
|
|
|
33 |
nexa run omniaudio -st
|
34 |
```
|
35 |
💻 OmniAudio-2.6B q4_K_M version requires 1.30GB RAM and 1.60GB storage space.
|
|
|
36 |
## Training
|
37 |
We developed OmniAudio through a three-stage training pipeline:
|
38 |
+
**Pretraining:** The initial stage focuses on core audio-text alignment using MLS English 10k transcription dataset. We introduced a special <|transcribe|> token to enable the model to distinguish between transcription and completion tasks, ensuring consistent performance across use cases.
|
39 |
+
**Supervised Fine-tuning (SFT):** We enhance the model's conversation capabilities using synthetic datasets derived from MLS English 10k transcription. This stage leverages a proprietary model to generate contextually appropriate responses, creating rich audio-text pairs for effective dialogue understanding.
|
40 |
+
**Direct Preference Optimization (DPO):** The final stage refines model quality using GPT-4o API as a reference. The process identifies and corrects inaccurate responses while maintaining semantic alignment. We additionally leverage Gemma2's text responses as a gold standard to ensure consistent quality across both audio and text inputs.
|
|
|
|
|
41 |
## What's Next for OmniAudio?
|
42 |
OmniAudio is in active development and we are working to advance its capabilities:
|
43 |
* Building direct audio generation for two-way voice communication
|