File size: 6,730 Bytes

e75d5c7
b5f7811
9e94911
 
 
 
dfa6d19
 
 
 
9e94911
 
 
 
 
 
0abccce
e75d5c7
 
d3e8b89
 
 
 
 
 
13c132c
d3e8b89
215033b
6cb714c
215033b
e75d5c7
9e94911
 
e75d5c7
 
 
 
 
215033b
9e94911
e75d5c7
d3e8b89
 
 
 
ef229cc
 
 
 
 
d3e8b89
 
9e94911
e75d5c7
d3e8b89
114d7e4
 
 
 
e75d5c7
6cb714c
e75d5c7
6cb714c
215033b
6cb714c
 
e75d5c7
9e94911
e75d5c7
6cb714c
e75d5c7
9e94911
 
215033b
e75d5c7
4a5b008
e75d5c7
6cb714c
e75d5c7
d3e8b89
6cb714c
e75d5c7
6cb714c
 
 
 
e75d5c7
215033b
6cb714c
e75d5c7
6cb714c
215033b
6cb714c
 
 
 
e75d5c7
6cb714c
e75d5c7
6cb714c
 
e75d5c7
d3e8b89
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6cb714c
e75d5c7
215033b
e75d5c7
215033b
 
6cb714c
215033b
 
 
6cb714c
 
215033b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6cb714c

---
license: cc-by-sa-4.0
datasets:
- speechcolab/gigaspeech
- parler-tts/mls_eng_10k
- reach-vb/jenny_tts_dataset
- MikhailT/hifi-tts
- ylacombe/expresso
- keithito/lj_speech
- collabora/ai4bharat-shrutilipi
language:
- en
- hi
base_model:
- openai-community/gpt2
pipeline_tag: text-to-speech
library_name: transformers
---

| Platform | Link |
|----------|------|
| 🌎 Live Demo | [indrivoice.ai](https://indrivoice.ai/) |
| 𝕏 Twitter | [@11mlabs](https://x.com/11mlabs) |
| 🐱 GitHub | [Indri Repository](https://github.com/cmeraki/indri) |
| 🤗 Hugging Face (Collection) | [Indri collection](https://huggingface.co/collections/11mlabs/indri-673dd4210b4369037c736bfe) |
| 📝 Release Blog | [Release Blog](https://www.indrivoice.ai/blog/2024-11-21-building-indri-tts) |

# Model Card for indri-0.1-124m-tts

Indri is a series of audio models that can do TTS, ASR, and audio continuation. This is the smallest model (124M) in our series and supports TTS tasks in 2 languages:

1. English
2. Hindi

## Model Details

### Model Description

`indri-0.1-124m-tts` is a novel, ultra-small, and lightweight TTS model based on the transformer architecture.
It models audio as tokens and can generate high-quality audio with consistent style cloning of the speaker.

### Samples

| Text | Sample |
| --- | --- |
|अतीत गौरवशाली, वर्तमान आशावादी, भविष्य उज्जवल| <audio controls src="https://huggingface.co/11mlabs/indri-0.1-124m-tts/resolve/main/data/417f5f1b-d641-4393-b922-9da9644dcd1b.wav" title="Title"></audio> |
|भाइयों और बहनों, ये हमारा सौभाग्य है कि हम सब मिलकर इस महान देश को नई ऊंचाइयों पर ले जाने का सपना देख रहे हैं।| <audio controls src="https://huggingface.co/11mlabs/indri-0.1-124m-tts/resolve/main/data/6e0a4879-0379-4166-a52c-03220a3f2922.wav" title="Title"></audio> |
|Hello दोस्तों, future of speech technology mein अपका स्वागत है | <audio controls src="https://huggingface.co/11mlabs/indri-0.1-124m-tts/resolve/main/data/5848b722-efe3-4e1f-a15e-5e7d431cd475.wav" title="Title"></audio> |
|Artificial Intelligence's collaborative hub: Transforming Machine Learning together| <audio controls src="https://huggingface.co/11mlabs/indri-0.1-124m-tts/resolve/main/data/12e5a00e-834b-4c3c-a8b8-7f545ba7088c.wav" title="Title"></audio> |
|Intelligent machines processing data at lightning-fast electronic speeds| <audio controls src="https://huggingface.co/11mlabs/indri-0.1-124m-tts/resolve/main/data/e21efa09-e179-42b7-982a-b686038a8f60.wav" title="Title"></audio> |


### Key features

1. Extremely small, based on GPT-2 small architecture. The methodology can be extended to any autoregressive transformer-based architecture.
2. Ultra-fast. Using our [self hosted service option](#self-hosted-service), the model can achieve speeds up to 400 toks/s (4s of audio generation per s) and under 20ms time to first token on RTX6000Ada NVIDIA GPU.
  1. On RTX6000Ada, it can support a batch size of 1k with full context length of 1024 tokens
3. Supports voice cloning with small prompts (<5s).
4. Code mixing text input in 2 languages - English and Hindi.

### Details

1. Model Type: GPT-2 based language model
2. Size: 124M parameters
3. Language Support: English, Hindi
4. License: CC BY 4.0

## Technical details

Here's a brief of how the model works:

1. Converts input text into tokens
2. Runs autoregressive decoding on GPT-2 based transformer model and generates audio tokens
3. Decodes audio tokens (using [Kyutai/mimi](https://huggingface.co/kyutai/mimi)) to audio

Please read our blog [here](https://www.indrivoice.ai/blog/2024-11-21-building-indri-tts) for more technical details on how it was built.

## How to Get Started with the Model

### 🤗 pipelines 
Use the code below to get started with the model. Pipelines are the best way to get started with the model.

```python
import torch
import torchaudio
from transformers import pipeline

model_id = '11mlabs/indri-0.1-124m-tts'
task = 'indri-tts'

pipe = pipeline(
    task,
    model=model_id,
    device=torch.device('cuda:0'), # Update this based on your hardware,
    trust_remote_code=True
)

output = pipe(['Hi, my name is Indri and I like to talk.'])

torchaudio.save('output.wav', output[0]['audio'][0], sample_rate=24000)
```

### Self hosted service

```bash
git clone https://github.com/cmeraki/indri.git
cd indri
pip install -r requirements.txt

# Install ffmpeg (for Mac/Windows, refer here: https://www.ffmpeg.org/download.html)
sudo apt update -y
sudo apt upgrade -y
sudo apt install ffmpeg -y

python -m inference --model_path 11mlabs/indri-0.1-124m-tts --device cuda:0 --port 8000
```

## Citation

If you use this model in your research, please cite:

```bibtex
@misc{indri-multimodal-alm,
  author       = {11mlabs},
  title        = {Indri: Multimodal audio language model},
  year         = {2024},
  publisher    = {GitHub},
  journal      = {GitHub Repository},
  howpublished = {\url{https://github.com/cmeraki/indri}},
  email        = {compute@merakilabs.com}
}
```

## BibTex
1. [nanoGPT](https://github.com/karpathy/nanoGPT)
2. [Kyutai/mimi](https://huggingface.co/kyutai/mimi)
```bibtex
@techreport{kyutai2024moshi,
      title={Moshi: a speech-text foundation model for real-time dialogue},
      author={Alexandre D\'efossez and Laurent Mazar\'e and Manu Orsini and
      Am\'elie Royer and Patrick P\'erez and Herv\'e J\'egou and Edouard Grave and Neil Zeghidour},
      year={2024},
      eprint={2410.00037},
      archivePrefix={arXiv},
      primaryClass={eess.AS},
      url={https://arxiv.org/abs/2410.00037},
}
```
3. [Whisper](https://github.com/openai/whisper)
```bibtex
@misc{radford2022whisper,
  doi = {10.48550/ARXIV.2212.04356},
  url = {https://arxiv.org/abs/2212.04356},
  author = {Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
  title = {Robust Speech Recognition via Large-Scale Weak Supervision},
  publisher = {arXiv},
  year = {2022},
  copyright = {arXiv.org perpetual, non-exclusive license}
}
```
4. [silero-vad](https://github.com/snakers4/silero-vad)
```bibtex
@misc{Silero VAD,
  author = {Silero Team},
  title = {Silero VAD: pre-trained enterprise-grade Voice Activity Detector (VAD), Number Detector and Language Classifier},
  year = {2024},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/snakers4/silero-vad}},
  commit = {insert_some_commit_here},
  email = {hello@silero.ai}
}
```