File size: 7,304 Bytes
e75d5c7 8ac6e4c 9e94911 dfa6d19 9e94911 0abccce 8ac6e4c e75d5c7 d3e8b89 6b87c3f 13c132c d3e8b89 215033b 6cb714c 215033b e75d5c7 9e94911 e75d5c7 215033b 9e94911 e75d5c7 d3e8b89 d7d63f2 ef229cc d7d63f2 d3e8b89 9e94911 e75d5c7 d3e8b89 c1e068f d54ff5b c1e068f e75d5c7 6cb714c e75d5c7 6cb714c 215033b 6cb714c d54ff5b e75d5c7 9e94911 e75d5c7 6cb714c e75d5c7 9e94911 215033b e75d5c7 4a5b008 e75d5c7 6cb714c e75d5c7 d3e8b89 6cb714c e75d5c7 6cb714c e75d5c7 215033b 6cb714c e75d5c7 6cb714c 215033b 6cb714c e75d5c7 c1e068f e75d5c7 6cb714c e75d5c7 c1e068f d3e8b89 6cb714c e75d5c7 215033b e75d5c7 215033b 6cb714c 215033b 6cb714c 215033b 6cb714c |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 |
---
base_model:
- openai-community/gpt2
datasets:
- speechcolab/gigaspeech
- parler-tts/mls_eng_10k
- reach-vb/jenny_tts_dataset
- MikhailT/hifi-tts
- ylacombe/expresso
- keithito/lj_speech
- collabora/ai4bharat-shrutilipi
language:
- en
- hi
library_name: transformers
license: cc-by-sa-4.0
pipeline_tag: text-to-speech
---
| Platform | Link |
|----------|------|
| 🌎 Live Demo | [indrivoice.ai](https://indrivoice.ai/) |
| 𝕏 Twitter | [@11mlabs](https://x.com/11mlabs) |
| 🐱 GitHub | [Indri Repository](https://github.com/cmeraki/indri) |
| 🤗 Hugging Face (Collection) | [Indri collection](https://huggingface.co/collections/11mlabs/indri-673dd4210b4369037c736bfe) |
| 🤗 Hugging Face (Spaces) | [Live Server](https://huggingface.co/spaces/11mlabs/IndriVoice)
| 📝 Release Blog | [Release Blog](https://www.indrivoice.ai/blog/2024-11-21-building-indri-tts) |
# Model Card for indri-0.1-124m-tts
Indri is a series of audio models that can do TTS, ASR, and audio continuation. This is the smallest model (124M) in our series and supports TTS tasks in 2 languages:
1. English
2. Hindi
## Model Details
### Model Description
`indri-0.1-124m-tts` is a novel, ultra-small, and lightweight TTS model based on the transformer architecture.
It models audio as tokens and can generate high-quality audio with consistent style cloning of the speaker.
### Samples
| Text | Sample |
| --- | --- |
|मित्रों, हम आज एक नया छोटा और शक्तिशाली मॉडल रिलीज कर रहे हैं।| <audio controls src="https://huggingface.co/11mlabs/indri-0.1-124m-tts/resolve/main/data/cebed668-62cb-4188-a2e1-3af8e017d3ba.wav" title="Title"></audio> |
|भाइयों और बहनों, ये हमारा सौभाग्य है कि हम सब मिलकर इस महान देश को नई ऊंचाइयों पर ले जाने का सपना देख रहे हैं।| <audio controls src="https://huggingface.co/11mlabs/indri-0.1-124m-tts/resolve/main/data/6e0a4879-0379-4166-a52c-03220a3f2922.wav" title="Title"></audio> |
|Hello दोस्तों, future of speech technology mein अपका स्वागत है | <audio controls src="https://huggingface.co/11mlabs/indri-0.1-124m-tts/resolve/main/data/5848b722-efe3-4e1f-a15e-5e7d431cd475.wav" title="Title"></audio> |
|In this model zoo, a new model called Indri has appeared.| <audio controls src="https://huggingface.co/11mlabs/indri-0.1-124m-tts/resolve/main/data/7ac0df93-edbd-47b2-b850-fb88e329998c.wav" title="Title"></audio> |
### Key features
1. Extremely small, based on GPT-2 small architecture. The methodology can be extended to any autoregressive transformer-based architecture.
2. Ultra-fast. Using our [self hosted service option](#self-hosted-service), on RTX6000Ada NVIDIA GPU the model can achieve speeds up to 400 toks/s (4s of audio generation per s) and under 20ms time to first token.
3. On RTX6000Ada, it can support a batch size of ~1000 sequences with full context length of 1024 tokens
4. Supports voice cloning with small prompts (<5s).
5. Code mixing text input in 2 languages - English and Hindi.
### Details
1. Model Type: GPT-2 based language model
2. Size: 124M parameters
3. Language Support: English, Hindi
4. License: This model is not for commercial usage. This is only a research showcase.
## Technical details
Here's a brief of how the model works:
1. Converts input text into tokens
2. Runs autoregressive decoding on GPT-2 based transformer model and generates audio tokens
3. Decodes audio tokens (using [Kyutai/mimi](https://huggingface.co/kyutai/mimi)) to audio
Please read our blog [here](https://www.indrivoice.ai/blog/2024-11-21-building-indri-tts) for more technical details on how it was built.
## How to Get Started with the Model
### 🤗 pipelines
Use the code below to get started with the model. Pipelines are the best way to get started with the model.
```python
import torch
import torchaudio
from transformers import pipeline
model_id = '11mlabs/indri-0.1-124m-tts'
task = 'indri-tts'
pipe = pipeline(
task,
model=model_id,
device=torch.device('cuda:0'), # Update this based on your hardware,
trust_remote_code=True
)
output = pipe(['Hi, my name is Indri and I like to talk.'], speaker = '[spkr_63]')
torchaudio.save('output.wav', output[0]['audio'][0], sample_rate=24000)
```
**Available speakers**
|Speaker ID|Speaker name|
|---|---|
|`[spkr_63]`|🇬🇧 👨 book reader|
|`[spkr_67]`|🇺🇸 👨 influencer|
|`[spkr_68]`|🇮🇳 👨 book reader|
|`[spkr_69]`|🇮🇳 👨 book reader|
|`[spkr_70]`|🇮🇳 👨 motivational speaker|
|`[spkr_62]`|🇮🇳 👨 book reader heavy|
|`[spkr_53]`|🇮🇳 👩 recipe reciter|
|`[spkr_60]`|🇮🇳 👩 book reader|
|`[spkr_74]`|🇺🇸 👨 book reader|
|`[spkr_75]`|🇮🇳 👨 entrepreneur|
|`[spkr_76]`|🇬🇧 👨 nature lover|
|`[spkr_77]`|🇮🇳 👨 influencer|
|`[spkr_66]`|🇮🇳 👨 politician|
### Self hosted service
```bash
git clone https://github.com/cmeraki/indri.git
cd indri
pip install -r requirements.txt
# Install ffmpeg (for Mac/Windows, refer here: https://www.ffmpeg.org/download.html)
sudo apt update -y
sudo apt upgrade -y
sudo apt install ffmpeg -y
python -m inference --model_path 11mlabs/indri-0.1-124m-tts --device cuda:0 --port 8000
```
## Citation
If you use this model in your research, please cite:
```bibtex
@misc{indri-multimodal-alm,
author = {11mlabs},
title = {Indri: Multimodal audio language model},
year = {2024},
publisher = {GitHub},
journal = {GitHub Repository},
howpublished = {\url{https://github.com/cmeraki/indri}},
email = {compute@merakilabs.com}
}
```
## BibTex
1. [nanoGPT](https://github.com/karpathy/nanoGPT)
2. [Kyutai/mimi](https://huggingface.co/kyutai/mimi)
```bibtex
@techreport{kyutai2024moshi,
title={Moshi: a speech-text foundation model for real-time dialogue},
author={Alexandre D\'efossez and Laurent Mazar\'e and Manu Orsini and
Am\'elie Royer and Patrick P\'erez and Herv\'e J\'egou and Edouard Grave and Neil Zeghidour},
year={2024},
eprint={2410.00037},
archivePrefix={arXiv},
primaryClass={eess.AS},
url={https://arxiv.org/abs/2410.00037},
}
```
3. [Whisper](https://github.com/openai/whisper)
```bibtex
@misc{radford2022whisper,
doi = {10.48550/ARXIV.2212.04356},
url = {https://arxiv.org/abs/2212.04356},
author = {Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
title = {Robust Speech Recognition via Large-Scale Weak Supervision},
publisher = {arXiv},
year = {2022},
copyright = {arXiv.org perpetual, non-exclusive license}
}
```
4. [silero-vad](https://github.com/snakers4/silero-vad)
```bibtex
@misc{Silero VAD,
author = {Silero Team},
title = {Silero VAD: pre-trained enterprise-grade Voice Activity Detector (VAD), Number Detector and Language Classifier},
year = {2024},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/snakers4/silero-vad}},
commit = {insert_some_commit_here},
email = {hello@silero.ai}
}
``` |