Updated README
Browse files
README.md
CHANGED
@@ -12,76 +12,90 @@ base_model:
|
|
12 |
pipeline_tag: text-to-speech
|
13 |
---
|
14 |
|
15 |
-
# Model Card for
|
|
|
|
|
16 |
|
17 |
-
Indri is a series of audio models that can do TTS, ASR, and audio continuation. This is the smallest model in our series and supports TTS tasks in 2 languages:
|
18 |
1. English
|
19 |
2. Hindi
|
20 |
|
|
|
|
|
|
|
|
|
|
|
21 |
|
22 |
## Model Details
|
23 |
|
24 |
### Model Description
|
25 |
|
26 |
-
`indri-0.1-125m-tts` is a novel,
|
27 |
It models audio as tokens and can generate high-quality audio with consistent style cloning of the speaker.
|
28 |
|
29 |
### Key features
|
30 |
|
31 |
-
1. Based on GPT-2 architecture
|
32 |
-
2. Supports voice cloning with small prompts
|
33 |
-
3. Code mixing text input in 2 languages - English and Hindi
|
|
|
34 |
|
35 |
-
###
|
36 |
|
37 |
-
-
|
38 |
-
|
|
|
|
|
39 |
|
40 |
## Technical details
|
41 |
|
42 |
-
|
43 |
-
|
44 |
-
Here's a brief of how this model works:
|
45 |
|
46 |
1. Converts input text into tokens
|
47 |
2. Runs autoregressive decoding on GPT-2 based transformer model and generates audio tokens
|
48 |
-
3. Decodes audio tokens (from [
|
49 |
-
|
50 |
-
## How to Get Started with the Model
|
51 |
-
|
52 |
-
Use the code below to get started with the model.
|
53 |
|
54 |
-
|
55 |
|
56 |
-
|
57 |
-
|
58 |
-
<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
|
59 |
-
|
60 |
-
[More Information Needed]
|
61 |
-
|
62 |
-
### Training Procedure
|
63 |
-
|
64 |
-
<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
|
65 |
-
|
66 |
-
#### Preprocessing [optional]
|
67 |
|
68 |
-
|
69 |
|
|
|
|
|
|
|
|
|
70 |
|
71 |
-
|
|
|
72 |
|
73 |
-
|
|
|
|
|
|
|
|
|
|
|
74 |
|
75 |
-
|
76 |
|
77 |
-
|
|
|
78 |
|
79 |
-
|
80 |
|
81 |
-
|
|
|
82 |
|
83 |
-
|
84 |
|
85 |
-
|
86 |
|
87 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
12 |
pipeline_tag: text-to-speech
|
13 |
---
|
14 |
|
15 |
+
# Model Card for indri-0.1-125m-tts
|
16 |
+
|
17 |
+
Indri is a series of audio models that can do TTS, ASR, and audio continuation. This is the smallest model (125M) in our series and supports TTS tasks in 2 languages:
|
18 |
|
|
|
19 |
1. English
|
20 |
2. Hindi
|
21 |
|
22 |
+
We have open-sourced our training scripts, inference, and other details.
|
23 |
+
|
24 |
+
- **Repository:** [GitHub](https://github.com/cmeraki/indri)
|
25 |
+
- **Demo:** [Website](https://www.indrivoice.ai/)
|
26 |
+
- **Implementation details**: [Release Blog](#TODO)
|
27 |
|
28 |
## Model Details
|
29 |
|
30 |
### Model Description
|
31 |
|
32 |
+
`indri-0.1-125m-tts` is a novel, ultra-small, and lightweight TTS model based on the transformer architecture.
|
33 |
It models audio as tokens and can generate high-quality audio with consistent style cloning of the speaker.
|
34 |
|
35 |
### Key features
|
36 |
|
37 |
+
1. Based on GPT-2 architecture. The methodology can be extended to any transformer-based architecture.
|
38 |
+
2. Supports voice cloning with small prompts (<5s).
|
39 |
+
3. Code mixing text input in 2 languages - English and Hindi.
|
40 |
+
4. Ultra-fast. Can generate 5 seconds of audio per second on Amphere generation NVIDIA GPUs, and up to 10 seconds of audio per second on Ada generation NVIDIA GPUs.
|
41 |
|
42 |
+
### Details
|
43 |
|
44 |
+
1. Model Type: GPT-2 based language model
|
45 |
+
2. Size: 125M parameters
|
46 |
+
3. Language Support: English, Hindi
|
47 |
+
4. License: CC BY 4.0
|
48 |
|
49 |
## Technical details
|
50 |
|
51 |
+
Here's a brief of how the model works:
|
|
|
|
|
52 |
|
53 |
1. Converts input text into tokens
|
54 |
2. Runs autoregressive decoding on GPT-2 based transformer model and generates audio tokens
|
55 |
+
3. Decodes audio tokens (from [Kyutai/mimi](https://huggingface.co/kyutai/mimi)) to audio
|
|
|
|
|
|
|
|
|
56 |
|
57 |
+
Please read our blog [here](#TODO) for more technical details on how it was built.
|
58 |
|
59 |
+
## How to Get Started with the Model
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
60 |
|
61 |
+
Use the code below to get started with the model. Pipelines are the best way to get started with the model.
|
62 |
|
63 |
+
```python
|
64 |
+
import torch
|
65 |
+
import torchaudio
|
66 |
+
from transformers import pipeline
|
67 |
|
68 |
+
task = 'indri-tts'
|
69 |
+
model_id = '11mlabs/indri-0.1-125m-tts'
|
70 |
|
71 |
+
pipe = pipeline(
|
72 |
+
task,
|
73 |
+
model=model_id,
|
74 |
+
device=torch.device('cuda:0'), # Update this based on your hardware,
|
75 |
+
trust_remote_code=True
|
76 |
+
)
|
77 |
|
78 |
+
output = pipe(['Hi, my name is Indri and I like to talk.'])
|
79 |
|
80 |
+
torchaudio.save('output.wav', output[0]['audio'][0], sample_rate=24000)
|
81 |
+
```
|
82 |
|
83 |
+
## Credits
|
84 |
|
85 |
+
1. [Kyutai/mimi](https://huggingface.co/kyutai/mimi)
|
86 |
+
2. [nanoGPT](https://github.com/karpathy/nanoGPT)
|
87 |
|
88 |
+
## Citation
|
89 |
|
90 |
+
To cite our work
|
91 |
|
92 |
+
```
|
93 |
+
@misc{indri-0.1-125m-tts,
|
94 |
+
author = {11mlabs},
|
95 |
+
title = {indri-0.1-125m-tts},
|
96 |
+
year = 2024,
|
97 |
+
publisher = {Hugging Face},
|
98 |
+
journal = {GitHub Repository},
|
99 |
+
howpublished = {\url{https://github.com/cmeraki/indri}},
|
100 |
+
}
|
101 |
+
```
|