Update README.md
Browse files
README.md
CHANGED
@@ -12,7 +12,8 @@ datasets:
|
|
12 |
|
13 |
# Model Card for Ultravox
|
14 |
|
15 |
-
Ultravox is a multimodal Speech LLM built around a pretrained [Llama3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B) and [Whisper-small](https://huggingface.co/openai/whisper-small) backbone
|
|
|
16 |
|
17 |
|
18 |
## Model Details
|
@@ -29,10 +30,10 @@ No preference tuning has been applied to this revision of the model.
|
|
29 |
- **Developed by:** Fixie.ai
|
30 |
- **License:** MIT
|
31 |
|
32 |
-
### Model Sources
|
33 |
|
34 |
- **Repository:** https://ultravox.ai
|
35 |
-
- **Demo
|
36 |
|
37 |
## Uses
|
38 |
|
@@ -49,7 +50,6 @@ The multi-modal projector is first trained (while keeping backbones frozen) in s
|
|
49 |
|
50 |
Training dataset is a mix of ASR datasets (Gigaspeech), instruction-following and QA data (AnyInstruct and an extended version of BoolQ), and conversational data (SODA with alternative generations for last two turns).
|
51 |
|
52 |
-
[More Information Needed]
|
53 |
|
54 |
### Training Procedure
|
55 |
|
@@ -59,13 +59,14 @@ Supervised speech to audio finetuning. For more info, see [training code in Ultr
|
|
59 |
#### Training Hyperparameters
|
60 |
|
61 |
- **Training regime:** BF16 mixed precision training
|
|
|
62 |
- **LLM LoRA Rank:** 64
|
63 |
|
64 |
-
#### Speeds, Sizes, Times
|
65 |
|
66 |
The current version of Ultravox, when invoked with audio content, has a time-to-first-token (TTFT) of approximately 200ms, and a tokens-per-second rate of ~50-100 when using an A100-40GB GPU, all using a Llama 3 8B backbone.
|
67 |
|
68 |
-
Check out the audio tab on [
|
69 |
|
70 |
## Evaluation
|
71 |
|
|
|
12 |
|
13 |
# Model Card for Ultravox
|
14 |
|
15 |
+
Ultravox is a multimodal Speech LLM built around a pretrained [Llama3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B) and [Whisper-small](https://huggingface.co/openai/whisper-small) backbone.\
|
16 |
+
See https://ultravox.ai for the GitHub repo and more information.
|
17 |
|
18 |
|
19 |
## Model Details
|
|
|
30 |
- **Developed by:** Fixie.ai
|
31 |
- **License:** MIT
|
32 |
|
33 |
+
### Model Sources
|
34 |
|
35 |
- **Repository:** https://ultravox.ai
|
36 |
+
- **Demo:** See repo
|
37 |
|
38 |
## Uses
|
39 |
|
|
|
50 |
|
51 |
Training dataset is a mix of ASR datasets (Gigaspeech), instruction-following and QA data (AnyInstruct and an extended version of BoolQ), and conversational data (SODA with alternative generations for last two turns).
|
52 |
|
|
|
53 |
|
54 |
### Training Procedure
|
55 |
|
|
|
59 |
#### Training Hyperparameters
|
60 |
|
61 |
- **Training regime:** BF16 mixed precision training
|
62 |
+
- **Hardward used:** 8x A100-40GB GPUs
|
63 |
- **LLM LoRA Rank:** 64
|
64 |
|
65 |
+
#### Speeds, Sizes, Times
|
66 |
|
67 |
The current version of Ultravox, when invoked with audio content, has a time-to-first-token (TTFT) of approximately 200ms, and a tokens-per-second rate of ~50-100 when using an A100-40GB GPU, all using a Llama 3 8B backbone.
|
68 |
|
69 |
+
Check out the audio tab on [TheFastest.ai](https://thefastest.ai/?m=audio) for daily benchmarks and a comparison with other existing models.
|
70 |
|
71 |
## Evaluation
|
72 |
|