|
--- |
|
license: cc-by-sa-4.0 |
|
datasets: |
|
- speechcolab/gigaspeech |
|
- parler-tts/mls_eng_10k |
|
- reach-vb/jenny_tts_dataset |
|
- MikhailT/hifi-tts |
|
- ylacombe/expresso |
|
- keithito/lj_speech |
|
- collabora/ai4bharat-shrutilipi |
|
language: |
|
- en |
|
- hi |
|
base_model: |
|
- openai-community/gpt2 |
|
pipeline_tag: text-to-speech |
|
library_name: transformers |
|
--- |
|
|
|
| Platform | Link | |
|
|----------|------| |
|
| 🌎 Live Demo | [indrivoice.ai](https://indrivoice.ai/) | |
|
| 𝕏 Twitter | [@11mlabs](https://x.com/11mlabs) | |
|
| 🐱 GitHub | [Indri Repository](https://github.com/cmeraki/indri) | |
|
| 🤗 Hugging Face (Collection) | [Indri collection](https://huggingface.co/collections/11mlabs/indri-673dd4210b4369037c736bfe) | |
|
| 📝 Release Blog | [Release Blog](https://www.indrivoice.ai/blog/2024-11-21-building-indri-tts) | |
|
|
|
# Model Card for indri-0.1-124m-tts |
|
|
|
Indri is a series of audio models that can do TTS, ASR, and audio continuation. This is the smallest model (124M) in our series and supports TTS tasks in 2 languages: |
|
|
|
1. English |
|
2. Hindi |
|
|
|
## Model Details |
|
|
|
### Model Description |
|
|
|
`indri-0.1-124m-tts` is a novel, ultra-small, and lightweight TTS model based on the transformer architecture. |
|
It models audio as tokens and can generate high-quality audio with consistent style cloning of the speaker. |
|
|
|
### Samples |
|
|
|
| Text | Sample | |
|
| --- | --- | |
|
|अतीत गौरवशाली, वर्तमान आशावादी, भविष्य उज्जवल| <audio controls src="https://huggingface.co/11mlabs/indri-0.1-124m-tts/resolve/main/data/417f5f1b-d641-4393-b922-9da9644dcd1b.wav" title="Title"></audio> | |
|
|भाइयों और बहनों, ये हमारा सौभाग्य है कि हम सब मिलकर इस महान देश को नई ऊंचाइयों पर ले जाने का सपना देख रहे हैं।| <audio controls src="https://huggingface.co/11mlabs/indri-0.1-124m-tts/resolve/main/data/6e0a4879-0379-4166-a52c-03220a3f2922.wav" title="Title"></audio> | |
|
|Hello दोस्तों, future of speech technology mein अपका स्वागत है | <audio controls src="https://huggingface.co/11mlabs/indri-0.1-124m-tts/resolve/main/data/5848b722-efe3-4e1f-a15e-5e7d431cd475.wav" title="Title"></audio> | |
|
|Artificial Intelligence's collaborative hub: Transforming Machine Learning together| <audio controls src="https://huggingface.co/11mlabs/indri-0.1-124m-tts/resolve/main/data/12e5a00e-834b-4c3c-a8b8-7f545ba7088c.wav" title="Title"></audio> | |
|
|Intelligent machines processing data at lightning-fast electronic speeds| <audio controls src="https://huggingface.co/11mlabs/indri-0.1-124m-tts/resolve/main/data/e21efa09-e179-42b7-982a-b686038a8f60.wav" title="Title"></audio> | |
|
|
|
|
|
### Key features |
|
|
|
1. Extremely small, based on GPT-2 small architecture. The methodology can be extended to any autoregressive transformer-based architecture. |
|
2. Ultra-fast. Using our [self hosted service option](#self-hosted-service), the model can achieve speeds up to 400 toks/s (4s of audio generation per s) and under 20ms time to first token on RTX6000Ada NVIDIA GPU. |
|
1. On RTX6000Ada, it can support a batch size of 1k with full context length of 1024 tokens |
|
3. Supports voice cloning with small prompts (<5s). |
|
4. Code mixing text input in 2 languages - English and Hindi. |
|
|
|
### Details |
|
|
|
1. Model Type: GPT-2 based language model |
|
2. Size: 124M parameters |
|
3. Language Support: English, Hindi |
|
4. License: CC BY 4.0 |
|
|
|
## Technical details |
|
|
|
Here's a brief of how the model works: |
|
|
|
1. Converts input text into tokens |
|
2. Runs autoregressive decoding on GPT-2 based transformer model and generates audio tokens |
|
3. Decodes audio tokens (using [Kyutai/mimi](https://huggingface.co/kyutai/mimi)) to audio |
|
|
|
Please read our blog [here](https://www.indrivoice.ai/blog/2024-11-21-building-indri-tts) for more technical details on how it was built. |
|
|
|
## How to Get Started with the Model |
|
|
|
### 🤗 pipelines |
|
Use the code below to get started with the model. Pipelines are the best way to get started with the model. |
|
|
|
```python |
|
import torch |
|
import torchaudio |
|
from transformers import pipeline |
|
|
|
model_id = '11mlabs/indri-0.1-124m-tts' |
|
task = 'indri-tts' |
|
|
|
pipe = pipeline( |
|
task, |
|
model=model_id, |
|
device=torch.device('cuda:0'), # Update this based on your hardware, |
|
trust_remote_code=True |
|
) |
|
|
|
output = pipe(['Hi, my name is Indri and I like to talk.']) |
|
|
|
torchaudio.save('output.wav', output[0]['audio'][0], sample_rate=24000) |
|
``` |
|
|
|
### Self hosted service |
|
|
|
```bash |
|
git clone https://github.com/cmeraki/indri.git |
|
cd indri |
|
pip install -r requirements.txt |
|
|
|
# Install ffmpeg (for Mac/Windows, refer here: https://www.ffmpeg.org/download.html) |
|
sudo apt update -y |
|
sudo apt upgrade -y |
|
sudo apt install ffmpeg -y |
|
|
|
python -m inference --model_path 11mlabs/indri-0.1-124m-tts --device cuda:0 --port 8000 |
|
``` |
|
|
|
## Citation |
|
|
|
If you use this model in your research, please cite: |
|
|
|
```bibtex |
|
@misc{indri-multimodal-alm, |
|
author = {11mlabs}, |
|
title = {Indri: Multimodal audio language model}, |
|
year = {2024}, |
|
publisher = {GitHub}, |
|
journal = {GitHub Repository}, |
|
howpublished = {\url{https://github.com/cmeraki/indri}}, |
|
email = {compute@merakilabs.com} |
|
} |
|
``` |
|
|
|
## BibTex |
|
1. [nanoGPT](https://github.com/karpathy/nanoGPT) |
|
2. [Kyutai/mimi](https://huggingface.co/kyutai/mimi) |
|
```bibtex |
|
@techreport{kyutai2024moshi, |
|
title={Moshi: a speech-text foundation model for real-time dialogue}, |
|
author={Alexandre D\'efossez and Laurent Mazar\'e and Manu Orsini and |
|
Am\'elie Royer and Patrick P\'erez and Herv\'e J\'egou and Edouard Grave and Neil Zeghidour}, |
|
year={2024}, |
|
eprint={2410.00037}, |
|
archivePrefix={arXiv}, |
|
primaryClass={eess.AS}, |
|
url={https://arxiv.org/abs/2410.00037}, |
|
} |
|
``` |
|
3. [Whisper](https://github.com/openai/whisper) |
|
```bibtex |
|
@misc{radford2022whisper, |
|
doi = {10.48550/ARXIV.2212.04356}, |
|
url = {https://arxiv.org/abs/2212.04356}, |
|
author = {Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya}, |
|
title = {Robust Speech Recognition via Large-Scale Weak Supervision}, |
|
publisher = {arXiv}, |
|
year = {2022}, |
|
copyright = {arXiv.org perpetual, non-exclusive license} |
|
} |
|
``` |
|
4. [silero-vad](https://github.com/snakers4/silero-vad) |
|
```bibtex |
|
@misc{Silero VAD, |
|
author = {Silero Team}, |
|
title = {Silero VAD: pre-trained enterprise-grade Voice Activity Detector (VAD), Number Detector and Language Classifier}, |
|
year = {2024}, |
|
publisher = {GitHub}, |
|
journal = {GitHub repository}, |
|
howpublished = {\url{https://github.com/snakers4/silero-vad}}, |
|
commit = {insert_some_commit_here}, |
|
email = {hello@silero.ai} |
|
} |
|
``` |