metadata

license: cc-by-4.0
datasets:
  - speechcolab/gigaspeech
  - parler-tts/mls_eng_10k
  - reach-vb/jenny_tts_dataset
language:
  - en
  - hi
base_model:
  - openai-community/gpt2
pipeline_tag: text-to-speech

Model Card for indri-0.1-124m-tts

Indri is a series of audio models that can do TTS, ASR, and audio continuation. This is the smallest model (124M) in our series and supports TTS tasks in 2 languages:

English
Hindi

We have open-sourced our training scripts, inference, and other details.

Repository: GitHub
Demo: Website
Implementation details: Release Blog

Model Details

Model Description

indri-0.1-124m-tts is a novel, ultra-small, and lightweight TTS model based on the transformer architecture. It models audio as tokens and can generate high-quality audio with consistent style cloning of the speaker.

Key features

Based on GPT-2 architecture. The methodology can be extended to any transformer-based architecture.
Supports voice cloning with small prompts (<5s).
Code mixing text input in 2 languages - English and Hindi.
Ultra-fast. Can generate 5 seconds of audio per second on Amphere generation NVIDIA GPUs, and up to 10 seconds of audio per second on Ada generation NVIDIA GPUs.

Details

Model Type: GPT-2 based language model
Size: 124M parameters
Language Support: English, Hindi
License: CC BY 4.0

Technical details

Here's a brief of how the model works:

Converts input text into tokens
Runs autoregressive decoding on GPT-2 based transformer model and generates audio tokens
Decodes audio tokens (using Kyutai/mimi) to audio

Please read our blog here for more technical details on how it was built.

How to Get Started with the Model

Use the code below to get started with the model. Pipelines are the best way to get started with the model.

import torch
import torchaudio
from transformers import pipeline

model_id = '11mlabs/indri-0.1-124m-tts'
task = 'indri-tts'

pipe = pipeline(
    task,
    model=model_id,
    device=torch.device('cuda:0'), # Update this based on your hardware,
    trust_remote_code=True
)

output = pipe(['Hi, my name is Indri and I like to talk.'])

torchaudio.save('output.wav', output[0]['audio'][0], sample_rate=24000)

Citation

If you use this model in your research, please cite:

@misc{indri-multimodal-alm,
  author       = {11mlabs},
  title        = {Indri: Multimodal audio language model},
  year         = {2024},
  publisher    = {GitHub},
  journal      = {GitHub Repository},
  howpublished = {\url{https://github.com/cmeraki/indri}},
  email        = {compute@merakilabs.com}
}

BibTex

@techreport{kyutai2024moshi,
      title={Moshi: a speech-text foundation model for real-time dialogue},
      author={Alexandre D\'efossez and Laurent Mazar\'e and Manu Orsini and
      Am\'elie Royer and Patrick P\'erez and Herv\'e J\'egou and Edouard Grave and Neil Zeghidour},
      year={2024},
      eprint={2410.00037},
      archivePrefix={arXiv},
      primaryClass={eess.AS},
      url={https://arxiv.org/abs/2410.00037},
}

Whisper

@misc{radford2022whisper,
  doi = {10.48550/ARXIV.2212.04356},
  url = {https://arxiv.org/abs/2212.04356},
  author = {Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
  title = {Robust Speech Recognition via Large-Scale Weak Supervision},
  publisher = {arXiv},
  year = {2022},
  copyright = {arXiv.org perpetual, non-exclusive license}
}

silero-vad

@misc{Silero VAD,
  author = {Silero Team},
  title = {Silero VAD: pre-trained enterprise-grade Voice Activity Detector (VAD), Number Detector and Language Classifier},
  year = {2024},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/snakers4/silero-vad}},
  commit = {insert_some_commit_here},
  email = {hello@silero.ai}
}