cmeraki commited on
Commit
6cb714c
1 Parent(s): 9e94911

Updated README

Browse files
Files changed (1) hide show
  1. README.md +53 -39
README.md CHANGED
@@ -12,76 +12,90 @@ base_model:
12
  pipeline_tag: text-to-speech
13
  ---
14
 
15
- # Model Card for Model ID
 
 
16
 
17
- Indri is a series of audio models that can do TTS, ASR, and audio continuation. This is the smallest model in our series and supports TTS tasks in 2 languages:
18
  1. English
19
  2. Hindi
20
 
 
 
 
 
 
21
 
22
  ## Model Details
23
 
24
  ### Model Description
25
 
26
- `indri-0.1-125m-tts` is a novel, extremely small, and lightweight TTS model based on the transformer architecture.
27
  It models audio as tokens and can generate high-quality audio with consistent style cloning of the speaker.
28
 
29
  ### Key features
30
 
31
- 1. Based on GPT-2 architecture
32
- 2. Supports voice cloning with small prompts
33
- 3. Code mixing text input in 2 languages - English and Hindi
 
34
 
35
- ### Model Sources [optional]
36
 
37
- - **Repository:** [https://github.com/cmeraki/indri]
38
- - **Demo:** [https://www.indrivoice.ai/]
 
 
39
 
40
  ## Technical details
41
 
42
- Please read our blog [here]() for more technical details on how it was built.
43
-
44
- Here's a brief of how this model works:
45
 
46
  1. Converts input text into tokens
47
  2. Runs autoregressive decoding on GPT-2 based transformer model and generates audio tokens
48
- 3. Decodes audio tokens (from [Kyutaui/mimi](https://huggingface.co/kyutai/mimi)) to audio
49
-
50
- ## How to Get Started with the Model
51
-
52
- Use the code below to get started with the model.
53
 
54
- ## Training Details
55
 
56
- ### Training Data
57
-
58
- <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
59
-
60
- [More Information Needed]
61
-
62
- ### Training Procedure
63
-
64
- <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
65
-
66
- #### Preprocessing [optional]
67
 
68
- [More Information Needed]
69
 
 
 
 
 
70
 
71
- #### Training Hyperparameters
 
72
 
73
- - **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
 
 
 
 
 
74
 
75
- #### Speeds, Sizes, Times [optional]
76
 
77
- <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
 
78
 
79
- [More Information Needed]
80
 
81
- ## Citation [optional]
 
82
 
83
- <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
84
 
85
- **BibTeX:**
86
 
87
- [More Information Needed]
 
 
 
 
 
 
 
 
 
 
12
  pipeline_tag: text-to-speech
13
  ---
14
 
15
+ # Model Card for indri-0.1-125m-tts
16
+
17
+ Indri is a series of audio models that can do TTS, ASR, and audio continuation. This is the smallest model (125M) in our series and supports TTS tasks in 2 languages:
18
 
 
19
  1. English
20
  2. Hindi
21
 
22
+ We have open-sourced our training scripts, inference, and other details.
23
+
24
+ - **Repository:** [GitHub](https://github.com/cmeraki/indri)
25
+ - **Demo:** [Website](https://www.indrivoice.ai/)
26
+ - **Implementation details**: [Release Blog](#TODO)
27
 
28
  ## Model Details
29
 
30
  ### Model Description
31
 
32
+ `indri-0.1-125m-tts` is a novel, ultra-small, and lightweight TTS model based on the transformer architecture.
33
  It models audio as tokens and can generate high-quality audio with consistent style cloning of the speaker.
34
 
35
  ### Key features
36
 
37
+ 1. Based on GPT-2 architecture. The methodology can be extended to any transformer-based architecture.
38
+ 2. Supports voice cloning with small prompts (<5s).
39
+ 3. Code mixing text input in 2 languages - English and Hindi.
40
+ 4. Ultra-fast. Can generate 5 seconds of audio per second on Amphere generation NVIDIA GPUs, and up to 10 seconds of audio per second on Ada generation NVIDIA GPUs.
41
 
42
+ ### Details
43
 
44
+ 1. Model Type: GPT-2 based language model
45
+ 2. Size: 125M parameters
46
+ 3. Language Support: English, Hindi
47
+ 4. License: CC BY 4.0
48
 
49
  ## Technical details
50
 
51
+ Here's a brief of how the model works:
 
 
52
 
53
  1. Converts input text into tokens
54
  2. Runs autoregressive decoding on GPT-2 based transformer model and generates audio tokens
55
+ 3. Decodes audio tokens (from [Kyutai/mimi](https://huggingface.co/kyutai/mimi)) to audio
 
 
 
 
56
 
57
+ Please read our blog [here](#TODO) for more technical details on how it was built.
58
 
59
+ ## How to Get Started with the Model
 
 
 
 
 
 
 
 
 
 
60
 
61
+ Use the code below to get started with the model. Pipelines are the best way to get started with the model.
62
 
63
+ ```python
64
+ import torch
65
+ import torchaudio
66
+ from transformers import pipeline
67
 
68
+ task = 'indri-tts'
69
+ model_id = '11mlabs/indri-0.1-125m-tts'
70
 
71
+ pipe = pipeline(
72
+ task,
73
+ model=model_id,
74
+ device=torch.device('cuda:0'), # Update this based on your hardware,
75
+ trust_remote_code=True
76
+ )
77
 
78
+ output = pipe(['Hi, my name is Indri and I like to talk.'])
79
 
80
+ torchaudio.save('output.wav', output[0]['audio'][0], sample_rate=24000)
81
+ ```
82
 
83
+ ## Credits
84
 
85
+ 1. [Kyutai/mimi](https://huggingface.co/kyutai/mimi)
86
+ 2. [nanoGPT](https://github.com/karpathy/nanoGPT)
87
 
88
+ ## Citation
89
 
90
+ To cite our work
91
 
92
+ ```
93
+ @misc{indri-0.1-125m-tts,
94
+ author = {11mlabs},
95
+ title = {indri-0.1-125m-tts},
96
+ year = 2024,
97
+ publisher = {Hugging Face},
98
+ journal = {GitHub Repository},
99
+ howpublished = {\url{https://github.com/cmeraki/indri}},
100
+ }
101
+ ```