Update README.md
Browse files
README.md
CHANGED
@@ -3,11 +3,16 @@ language:
|
|
3 |
- en
|
4 |
license: mit
|
5 |
library_name: transformers
|
|
|
|
|
|
|
|
|
|
|
6 |
---
|
7 |
|
8 |
# Model Card for Ultravox
|
9 |
|
10 |
-
Ultravox is a multimodal Speech LLM built around a pretrained
|
11 |
|
12 |
|
13 |
## Model Details
|
@@ -15,7 +20,7 @@ Ultravox is a multimodal Speech LLM built around a pretrained Whisper and Llama
|
|
15 |
### Model Description
|
16 |
|
17 |
Ultravox is a multimodal model that can consume both speech and text as input (e.g., a text system prompt and voice user message).
|
18 |
-
The input to the model is given as a text prompt with a special
|
19 |
Using the merged embeddings as input, the model will then generate output text as usual.
|
20 |
|
21 |
In a future revision of Ultravox, we plan to expand the token vocabulary to support generation of semantic and acoustic audio tokens, which can then be fed to a vocoder to produce voice output.
|
@@ -36,30 +41,31 @@ Voice agents, speech-to-speech translation, analysis of spoken audio
|
|
36 |
|
37 |
## Training Details
|
38 |
|
|
|
|
|
|
|
|
|
39 |
### Training Data
|
40 |
|
41 |
-
|
42 |
|
43 |
[More Information Needed]
|
44 |
|
45 |
### Training Procedure
|
46 |
|
47 |
-
|
48 |
-
|
49 |
-
#### Preprocessing [optional]
|
50 |
-
|
51 |
-
[More Information Needed]
|
52 |
|
53 |
|
54 |
#### Training Hyperparameters
|
55 |
|
56 |
-
- **Training regime:**
|
|
|
57 |
|
58 |
#### Speeds, Sizes, Times [optional]
|
59 |
|
60 |
-
|
61 |
|
62 |
-
[
|
63 |
|
64 |
## Evaluation
|
65 |
|
@@ -89,5 +95,4 @@ Voice agents, speech-to-speech translation, analysis of spoken audio
|
|
89 |
|
90 |
[More Information Needed]
|
91 |
|
92 |
-
#### Summary
|
93 |
-
|
|
|
3 |
- en
|
4 |
license: mit
|
5 |
library_name: transformers
|
6 |
+
datasets:
|
7 |
+
- fnlp/AnyInstruct
|
8 |
+
- fixie-ai/boolq-audio
|
9 |
+
- fixie-ai/soda-audio
|
10 |
+
- speechcolab/gigaspeech
|
11 |
---
|
12 |
|
13 |
# Model Card for Ultravox
|
14 |
|
15 |
+
Ultravox is a multimodal Speech LLM built around a pretrained [Llama3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B) and [Whisper-small](https://huggingface.co/openai/whisper-small) backbone. See https://ultravox.ai for the GitHub repo and more information.
|
16 |
|
17 |
|
18 |
## Model Details
|
|
|
20 |
### Model Description
|
21 |
|
22 |
Ultravox is a multimodal model that can consume both speech and text as input (e.g., a text system prompt and voice user message).
|
23 |
+
The input to the model is given as a text prompt with a special `<|audio|>` pseudo-token, and the model processor will replace this magic token with embeddings derived from the input audio.
|
24 |
Using the merged embeddings as input, the model will then generate output text as usual.
|
25 |
|
26 |
In a future revision of Ultravox, we plan to expand the token vocabulary to support generation of semantic and acoustic audio tokens, which can then be fed to a vocoder to produce voice output.
|
|
|
41 |
|
42 |
## Training Details
|
43 |
|
44 |
+
The model uses a pre-trained [Llama3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B) backbone as well as the encoder part of [Whisper-small](https://huggingface.co/openai/whisper-small).
|
45 |
+
|
46 |
+
The multi-modal projector is first trained (while keeping backbones frozen) in stage 1 and then in stage 2, Llama3 is also fine-tuned using LoRA.
|
47 |
+
|
48 |
### Training Data
|
49 |
|
50 |
+
Training dataset is a mix of ASR datasets (Gigaspeech), instruction-following and QA data (AnyInstruct and an extended version of BoolQ), and conversational data (SODA with alternative generations for last two turns).
|
51 |
|
52 |
[More Information Needed]
|
53 |
|
54 |
### Training Procedure
|
55 |
|
56 |
+
Supervised speech to audio finetuning. For more info, see [training code in Ultravox repo](https://github.com/fixie-ai/ultravox/blob/main/ultravox/training/train.py).
|
|
|
|
|
|
|
|
|
57 |
|
58 |
|
59 |
#### Training Hyperparameters
|
60 |
|
61 |
+
- **Training regime:** BF16 mixed precision training
|
62 |
+
- **LLM LoRA Rank:** 64
|
63 |
|
64 |
#### Speeds, Sizes, Times [optional]
|
65 |
|
66 |
+
The current version of Ultravox, when invoked with audio content, has a time-to-first-token (TTFT) of approximately 200ms, and a tokens-per-second rate of ~50-100 when using an A100-40GB GPU, all using a Llama 3 8B backbone.
|
67 |
|
68 |
+
Check out the audio tab on [thefastest.ai](https://thefastest.ai/?m=audio) for daily benchmarks and a comparison with other existing models.
|
69 |
|
70 |
## Evaluation
|
71 |
|
|
|
95 |
|
96 |
[More Information Needed]
|
97 |
|
98 |
+
#### Summary
|
|