11mlabs
/

indri-0.1-124m-tts

@@ -12,76 +12,90 @@ base_model:
 pipeline_tag: text-to-speech
 ---
-# Model Card for Model ID
-Indri is a series of audio models that can do TTS, ASR, and audio continuation. This is the smallest model in our series and supports TTS tasks in 2 languages:
 1. English
 2. Hindi
 ## Model Details
 ### Model Description
-`indri-0.1-125m-tts` is a novel, extremely small, and lightweight TTS model based on the transformer architecture.
 It models audio as tokens and can generate high-quality audio with consistent style cloning of the speaker.
 ### Key features
-1. Based on GPT-2 architecture
-2. Supports voice cloning with small prompts
-3. Code mixing text input in 2 languages - English and Hindi
-### Model Sources [optional]
-- **Repository:** [https://github.com/cmeraki/indri]
-- **Demo:** [https://www.indrivoice.ai/]
 ## Technical details
-Please read our blog [here]() for more technical details on how it was built.
-Here's a brief of how this model works:
 1. Converts input text into tokens
 2. Runs autoregressive decoding on GPT-2 based transformer model and generates audio tokens
-3. Decodes audio tokens (from [Kyutaui/mimi](https://huggingface.co/kyutai/mimi)) to audio
-## How to Get Started with the Model
-Use the code below to get started with the model.
-## Training Details
-### Training Data
-<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
-[More Information Needed]
-### Training Procedure
-<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
-#### Preprocessing [optional]
-[More Information Needed]
-#### Training Hyperparameters
-- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
-#### Speeds, Sizes, Times [optional]
-<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
-[More Information Needed]
-## Citation [optional]
-<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
-**BibTeX:**
-[More Information Needed]

 pipeline_tag: text-to-speech
 ---
+# Model Card for indri-0.1-125m-tts
+Indri is a series of audio models that can do TTS, ASR, and audio continuation. This is the smallest model (125M) in our series and supports TTS tasks in 2 languages:
 1. English
 2. Hindi
+We have open-sourced our training scripts, inference, and other details.
+- **Repository:** [GitHub](https://github.com/cmeraki/indri)
+- **Demo:** [Website](https://www.indrivoice.ai/)
+- **Implementation details**: [Release Blog](#TODO)
 ## Model Details
 ### Model Description
+`indri-0.1-125m-tts` is a novel, ultra-small, and lightweight TTS model based on the transformer architecture.
 It models audio as tokens and can generate high-quality audio with consistent style cloning of the speaker.
 ### Key features
+1. Based on GPT-2 architecture. The methodology can be extended to any transformer-based architecture.
+2. Supports voice cloning with small prompts (<5s).
+3. Code mixing text input in 2 languages - English and Hindi.
+4. Ultra-fast. Can generate 5 seconds of audio per second on Amphere generation NVIDIA GPUs, and up to 10 seconds of audio per second on Ada generation NVIDIA GPUs.
+### Details
+1. Model Type: GPT-2 based language model
+2. Size: 125M parameters
+3. Language Support: English, Hindi
+4. License: CC BY 4.0
 ## Technical details
+Here's a brief of how the model works:
 1. Converts input text into tokens
 2. Runs autoregressive decoding on GPT-2 based transformer model and generates audio tokens
+3. Decodes audio tokens (from [Kyutai/mimi](https://huggingface.co/kyutai/mimi)) to audio
+Please read our blog [here](#TODO) for more technical details on how it was built.
+## How to Get Started with the Model
+Use the code below to get started with the model. Pipelines are the best way to get started with the model.
+```python
+import torch
+import torchaudio
+from transformers import pipeline
+task = 'indri-tts'
+model_id = '11mlabs/indri-0.1-125m-tts'
+pipe = pipeline(
+ task,
+    model=model_id,
+    device=torch.device('cuda:0'), # Update this based on your hardware,
+    trust_remote_code=True
+)
+output = pipe(['Hi, my name is Indri and I like to talk.'])
+torchaudio.save('output.wav', output[0]['audio'][0], sample_rate=24000)
+```
+## Credits
+1. [Kyutai/mimi](https://huggingface.co/kyutai/mimi)
+2. [nanoGPT](https://github.com/karpathy/nanoGPT)
+## Citation
+To cite our work
+```
+@misc{indri-0.1-125m-tts,
+  author       = {11mlabs},
+  title        = {indri-0.1-125m-tts},
+  year         = 2024,
+  publisher    = {Hugging Face},
+  journal      = {GitHub Repository},
+  howpublished = {\url{https://github.com/cmeraki/indri}},
+}
+```