Update README.md
#1
by
farzadab
- opened
README.md
CHANGED
@@ -15,7 +15,7 @@ Ultravox is a multimodal Speech LLM built around a pretrained Whisper and Llama
|
|
15 |
### Model Description
|
16 |
|
17 |
Ultravox is a multimodal model that can consume both speech and text as input (e.g., a text system prompt and voice user message).
|
18 |
-
The input to the model is given as a text prompt with a special <|audio|> token, and the model processor will replace this magic token with embeddings derived from the input audio.
|
19 |
Using the merged embeddings as input, the model will then generate output text as usual.
|
20 |
|
21 |
In a future revision of Ultravox, we plan to expand the token vocabulary to support generation of semantic and acoustic audio tokens, which can then be fed to a vocoder to produce voice output.
|
|
|
15 |
### Model Description
|
16 |
|
17 |
Ultravox is a multimodal model that can consume both speech and text as input (e.g., a text system prompt and voice user message).
|
18 |
+
The input to the model is given as a text prompt with a special <|audio|> pseudo-token, and the model processor will replace this magic token with embeddings derived from the input audio.
|
19 |
Using the merged embeddings as input, the model will then generate output text as usual.
|
20 |
|
21 |
In a future revision of Ultravox, we plan to expand the token vocabulary to support generation of semantic and acoustic audio tokens, which can then be fed to a vocoder to produce voice output.
|