11mlabs
/

indri-0.1-124m-tts

@@ -12,9 +12,9 @@ base_model:
 pipeline_tag: text-to-speech
 ---
-# Model Card for indri-0.1-125m-tts
-Indri is a series of audio models that can do TTS, ASR, and audio continuation. This is the smallest model (125M) in our series and supports TTS tasks in 2 languages:
 1. English
 2. Hindi
@@ -29,7 +29,7 @@ We have open-sourced our training scripts, inference, and other details.
 ### Model Description
-`indri-0.1-125m-tts` is a novel, ultra-small, and lightweight TTS model based on the transformer architecture.
 It models audio as tokens and can generate high-quality audio with consistent style cloning of the speaker.
 ### Key features
@@ -42,7 +42,7 @@ It models audio as tokens and can generate high-quality audio with consistent st
 ### Details
 1. Model Type: GPT-2 based language model
-2. Size: 125M parameters
 3. Language Support: English, Hindi
 4. License: CC BY 4.0
@@ -52,7 +52,7 @@ Here's a brief of how the model works:
 1. Converts input text into tokens
 2. Runs autoregressive decoding on GPT-2 based transformer model and generates audio tokens
-3. Decodes audio tokens (from [Kyutai/mimi](https://huggingface.co/kyutai/mimi)) to audio
 Please read our blog [here](#TODO) for more technical details on how it was built.
@@ -65,11 +65,11 @@ import torch
 import torchaudio
 from transformers import pipeline
 task = 'indri-tts'
-model_id = '11mlabs/indri-0.1-125m-tts'
 pipe = pipeline(
- task,
     model=model_id,
     device=torch.device('cuda:0'), # Update this based on your hardware,
     trust_remote_code=True
@@ -80,22 +80,59 @@ output = pipe(['Hi, my name is Indri and I like to talk.'])
 torchaudio.save('output.wav', output[0]['audio'][0], sample_rate=24000)
 ```
-## Credits
-1. [Kyutai/mimi](https://huggingface.co/kyutai/mimi)
-2. [nanoGPT](https://github.com/karpathy/nanoGPT)
 ## Citation
-To cite our work
-```
-@misc{indri-0.1-125m-tts,
   author       = {11mlabs},
-  title        = {indri-0.1-125m-tts},
-  year         = 2024,
-  publisher    = {Hugging Face},
   journal      = {GitHub Repository},
   howpublished = {\url{https://github.com/cmeraki/indri}},
 }
 ```

 pipeline_tag: text-to-speech
 ---
+# Model Card for indri-0.1-124m-tts
+Indri is a series of audio models that can do TTS, ASR, and audio continuation. This is the smallest model (124M) in our series and supports TTS tasks in 2 languages:
 1. English
 2. Hindi
 ### Model Description
+`indri-0.1-124m-tts` is a novel, ultra-small, and lightweight TTS model based on the transformer architecture.
 It models audio as tokens and can generate high-quality audio with consistent style cloning of the speaker.
 ### Key features
 ### Details
 1. Model Type: GPT-2 based language model
+2. Size: 124M parameters
 3. Language Support: English, Hindi
 4. License: CC BY 4.0
 1. Converts input text into tokens
 2. Runs autoregressive decoding on GPT-2 based transformer model and generates audio tokens
+3. Decodes audio tokens (using [Kyutai/mimi](https://huggingface.co/kyutai/mimi)) to audio
 Please read our blog [here](#TODO) for more technical details on how it was built.
 import torchaudio
 from transformers import pipeline
+model_id = '11mlabs/indri-0.1-124m-tts'
 task = 'indri-tts'
 pipe = pipeline(
+    task,
     model=model_id,
     device=torch.device('cuda:0'), # Update this based on your hardware,
     trust_remote_code=True
 torchaudio.save('output.wav', output[0]['audio'][0], sample_rate=24000)
 ```
 ## Citation
+If you use this model in your research, please cite:
+```bibtex
+@misc{indri-multimodal-alm,
   author       = {11mlabs},
+  title        = {Indri: Multimodal audio language model},
+  year         = {2024},
+  publisher    = {GitHub},
   journal      = {GitHub Repository},
   howpublished = {\url{https://github.com/cmeraki/indri}},
+  email        = {compute@merakilabs.com}
+}
+```
+## BibTex
+1. [nanoGPT](https://github.com/karpathy/nanoGPT)
+2. [Kyutai/mimi](https://huggingface.co/kyutai/mimi)
+```bibtex
+@techreport{kyutai2024moshi,
+      title={Moshi: a speech-text foundation model for real-time dialogue},
+      author={Alexandre D\'efossez and Laurent Mazar\'e and Manu Orsini and
+      Am\'elie Royer and Patrick P\'erez and Herv\'e J\'egou and Edouard Grave and Neil Zeghidour},
+      year={2024},
+      eprint={2410.00037},
+      archivePrefix={arXiv},
+      primaryClass={eess.AS},
+      url={https://arxiv.org/abs/2410.00037},
+}
+```
+3. [Whisper](https://github.com/openai/whisper)
+```bibtex
+@misc{radford2022whisper,
+  doi = {10.48550/ARXIV.2212.04356},
+  url = {https://arxiv.org/abs/2212.04356},
+  author = {Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
+  title = {Robust Speech Recognition via Large-Scale Weak Supervision},
+  publisher = {arXiv},
+  year = {2022},
+  copyright = {arXiv.org perpetual, non-exclusive license}
+}
+```
+4. [silero-vad](https://github.com/snakers4/silero-vad)
+```bibtex
+@misc{Silero VAD,
+  author = {Silero Team},
+  title = {Silero VAD: pre-trained enterprise-grade Voice Activity Detector (VAD), Number Detector and Language Classifier},
+  year = {2024},
+  publisher = {GitHub},
+  journal = {GitHub repository},
+  howpublished = {\url{https://github.com/snakers4/silero-vad}},
+  commit = {insert_some_commit_here},
+  email = {hello@silero.ai}
 }
 ```