Text-to-Speech
GGUF
Inference Endpoints
edwko commited on
Commit
aec1b5e
β€’
1 Parent(s): 99b4851

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +115 -33
README.md CHANGED
@@ -65,9 +65,9 @@ th, td {
65
  <div class="box">
66
  <div style="margin-bottom: 20px;">
67
  <h2 style="margin-bottom: 4px; margin-top: 0px;">OuteAI</h2>
68
- <a href="https://www.outeai.com/" target="_blank" style="margin-right: 10px;">🌎 OuteAI.com</a>
69
- <a href="https://discord.gg/vyBM87kAmf" target="_blank" style="margin-right: 10px;">🀝 Join our Discord</a>
70
- <a href="https://x.com/OuteAI" target="_blank">𝕏 @OuteAI</a>
71
  </div>
72
  <div class="badges">
73
  <a href="https://huggingface.co/OuteAI/OuteTTS-0.2-500M" target="_blank" class="badge badge-hf-blue">πŸ€— Hugging Face - OuteTTS 0.2 500M</a>
@@ -83,7 +83,7 @@ OuteTTS-0.2-500M is our improved successor to the v0.1 release.
83
  The model maintains the same approach of using audio prompts without architectural changes to the foundation model itself.
84
  Built upon the Qwen-2.5-0.5B, this version was trained on larger and more diverse datasets, resulting in significant improvements across all aspects of performance.
85
 
86
- Special thanks to **Hugging Face** for providing GPU grant that supported the training of this model.
87
 
88
  ## Key Improvements
89
 
@@ -100,17 +100,21 @@ Special thanks to **Hugging Face** for providing GPU grant that supported the tr
100
  Your browser does not support the video tag.
101
  </video>
102
 
103
- ## Usage
104
-
105
- ### Installation
106
 
107
  [![GitHub](https://img.shields.io/badge/GitHub-OuteTTS-181717?logo=github)](https://github.com/edwko/OuteTTS)
108
 
109
  ```bash
110
- pip install outetts
111
  ```
112
 
113
- ### Interface Usage
 
 
 
 
 
 
114
 
115
  ```python
116
  import outetts
@@ -118,30 +122,21 @@ import outetts
118
  # Configure the model
119
  model_config = outetts.HFModelConfig_v1(
120
  model_path="OuteAI/OuteTTS-0.2-500M",
121
- language="en", # Supported languages in v0.2: en, zh, ja, ko
122
  )
123
 
124
  # Initialize the interface
125
  interface = outetts.InterfaceHF(model_version="0.2", cfg=model_config)
126
 
127
- # Optional: Create a speaker profile (use a 10-15 second audio clip)
128
- # speaker = interface.create_speaker(
129
- # audio_path="path/to/audio/file",
130
- # transcript="Transcription of the audio file."
131
- # )
132
-
133
- # Optional: Save and load speaker profiles
134
- # interface.save_speaker(speaker, "speaker.json")
135
- # speaker = interface.load_speaker("speaker.json")
136
-
137
- # Optional: Load speaker from default presets
138
  interface.print_default_speakers()
 
 
139
  speaker = interface.load_default_speaker(name="male_1")
140
 
 
141
  output = interface.generate(
142
- text="Speech synthesis is the artificial production of human speech. A computer system used for this purpose is called a speech synthesizer, and it can be implemented in software or hardware products.",
143
- # Lower temperature values may result in a more stable tone,
144
- # while higher values can introduce varied and expressive speech
145
  temperature=0.1,
146
  repetition_penalty=1.1,
147
  max_length=4096,
@@ -151,36 +146,123 @@ output = interface.generate(
151
  speaker=speaker,
152
  )
153
 
154
- # Save the synthesized speech to a file
155
  output.save("output.wav")
156
 
157
- # Optional: Play the synthesized speech
158
  # output.play()
159
  ```
160
 
161
- ## Using GGUF Model
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
162
 
163
  ```python
164
- # Configure the GGUF model
 
165
  model_config = outetts.GGUFModelConfig_v1(
166
  model_path="local/path/to/model.gguf",
167
  language="en", # Supported languages in v0.2: en, zh, ja, ko
168
  n_gpu_layers=0,
169
  )
170
 
171
- # Initialize the GGUF interface
172
  interface = outetts.InterfaceGGUF(model_version="0.2", cfg=model_config)
173
  ```
174
 
175
- ## Configure the model with bfloat16 and flash attention
176
 
177
  ```python
178
  import outetts
179
- import torch
180
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
181
  model_config = outetts.HFModelConfig_v1(
182
  model_path="OuteAI/OuteTTS-0.2-500M",
183
- language="en", # Supported languages in v0.2: en, zh, ja, ko
184
  dtype=torch.bfloat16,
185
  additional_model_config={
186
  'attn_implementation': "flash_attention_2"
@@ -188,7 +270,7 @@ model_config = outetts.HFModelConfig_v1(
188
  )
189
  ```
190
 
191
- ## Creating a Speaker for Voice Cloning
192
 
193
  To achieve the best results when creating a speaker profile, consider the following recommendations:
194
 
 
65
  <div class="box">
66
  <div style="margin-bottom: 20px;">
67
  <h2 style="margin-bottom: 4px; margin-top: 0px;">OuteAI</h2>
68
+ <a href="https://www.outeai.com/" target="_blank" style="margin-right: 10px;">🌐 OuteAI.com</a>
69
+ <a href="https://discord.gg/vyBM87kAmf" target="_blank" style="margin-right: 10px;">πŸ’¬ Join our Discord</a>
70
+ <a href="https://x.com/OuteAI" target="_blank">βœ–οΈ (Twitter) @OuteAI</a>
71
  </div>
72
  <div class="badges">
73
  <a href="https://huggingface.co/OuteAI/OuteTTS-0.2-500M" target="_blank" class="badge badge-hf-blue">πŸ€— Hugging Face - OuteTTS 0.2 500M</a>
 
83
  The model maintains the same approach of using audio prompts without architectural changes to the foundation model itself.
84
  Built upon the Qwen-2.5-0.5B, this version was trained on larger and more diverse datasets, resulting in significant improvements across all aspects of performance.
85
 
86
+ Special thanks to **Hugging Face** for providing GPU grant that supported the training of this model!
87
 
88
  ## Key Improvements
89
 
 
100
  Your browser does not support the video tag.
101
  </video>
102
 
103
+ ## Installation
 
 
104
 
105
  [![GitHub](https://img.shields.io/badge/GitHub-OuteTTS-181717?logo=github)](https://github.com/edwko/OuteTTS)
106
 
107
  ```bash
108
+ pip install outetts --upgrade
109
  ```
110
 
111
+ **Important:**
112
+ - For GGUF support, install `llama-cpp-python` manually. [Installation Guide](https://github.com/abetlen/llama-cpp-python?tab=readme-ov-file#installation)
113
+ - For EXL2 support, install `exllamav2` manually. [Installation Guide](https://github.com/turboderp/exllamav2?tab=readme-ov-file#installation)
114
+
115
+ ## Usage
116
+
117
+ ### Quick Start: Basic Full Example
118
 
119
  ```python
120
  import outetts
 
122
  # Configure the model
123
  model_config = outetts.HFModelConfig_v1(
124
  model_path="OuteAI/OuteTTS-0.2-500M",
125
+ language="en", # Supported languages: en, zh, ja, ko
126
  )
127
 
128
  # Initialize the interface
129
  interface = outetts.InterfaceHF(model_version="0.2", cfg=model_config)
130
 
131
+ # Print available default speakers
 
 
 
 
 
 
 
 
 
 
132
  interface.print_default_speakers()
133
+
134
+ # Load a default speaker
135
  speaker = interface.load_default_speaker(name="male_1")
136
 
137
+ # Generate speech
138
  output = interface.generate(
139
+ text="Speech synthesis is the artificial production of human speech.",
 
 
140
  temperature=0.1,
141
  repetition_penalty=1.1,
142
  max_length=4096,
 
146
  speaker=speaker,
147
  )
148
 
149
+ # Save the generated speech to a file
150
  output.save("output.wav")
151
 
152
+ # Optional: Play the generated audio
153
  # output.play()
154
  ```
155
 
156
+ ### Backend-Specific Configuration
157
+
158
+ #### Hugging Face Transformers
159
+
160
+ ```python
161
+ import outetts
162
+
163
+ model_config = outetts.HFModelConfig_v1(
164
+ model_path="OuteAI/OuteTTS-0.2-500M",
165
+ language="en", # Supported languages in v0.2: en, zh, ja, ko
166
+ )
167
+
168
+ interface = outetts.InterfaceHF(model_version="0.2", cfg=model_config)
169
+ ```
170
+
171
+ #### GGUF (llama-cpp-python)
172
 
173
  ```python
174
+ import outetts
175
+
176
  model_config = outetts.GGUFModelConfig_v1(
177
  model_path="local/path/to/model.gguf",
178
  language="en", # Supported languages in v0.2: en, zh, ja, ko
179
  n_gpu_layers=0,
180
  )
181
 
 
182
  interface = outetts.InterfaceGGUF(model_version="0.2", cfg=model_config)
183
  ```
184
 
185
+ #### ExLlamaV2
186
 
187
  ```python
188
  import outetts
 
189
 
190
+ model_config = outetts.EXL2ModelConfig_v1(
191
+ model_path="local/path/to/model",
192
+ language="en", # Supported languages in v0.2: en, zh, ja, ko
193
+ )
194
+
195
+ interface = outetts.InterfaceEXL2(model_version="0.2", cfg=model_config)
196
+ ```
197
+
198
+ ### Speaker Creation and Management
199
+
200
+ #### Creating a Speaker
201
+
202
+ You can create a speaker profile for voice cloning, which is compatible across all backends.
203
+
204
+ ```python
205
+ speaker = interface.create_speaker(
206
+ audio_path="path/to/audio/file.wav",
207
+
208
+ # If transcript is not provided, it will be automatically transcribed using Whisper
209
+ transcript=None, # Set to None to use Whisper for transcription
210
+
211
+ whisper_model="turbo", # Optional: specify Whisper model (default: "turbo")
212
+ whisper_device=None, # Optional: specify device for Whisper (default: None)
213
+ )
214
+ ```
215
+ #### Saving and Loading Speaker Profiles
216
+
217
+ Speaker profiles can be saved and loaded across all supported backends.
218
+
219
+ ```python
220
+ # Save speaker profile
221
+ interface.save_speaker(speaker, "speaker.json")
222
+
223
+ # Load speaker profile
224
+ speaker = interface.load_speaker("speaker.json")
225
+ ```
226
+
227
+ #### Default Speaker Initialization
228
+
229
+ OuteTTS includes a set of default speaker profiles. Use them directly:
230
+
231
+ ```python
232
+ # Print available default speakers
233
+ interface.print_default_speakers()
234
+ # Load a default speaker
235
+ speaker = interface.load_default_speaker(name="male_1")
236
+ ```
237
+
238
+ ### Text-to-Speech Generation
239
+
240
+ The generation process is consistent across all backends.
241
+
242
+ ```python
243
+ output = interface.generate(
244
+ text="Speech synthesis is the artificial production of human speech.",
245
+ temperature=0.1,
246
+ repetition_penalty=1.1,
247
+ max_length=4096,
248
+ speaker=speaker, # Optional: speaker profile
249
+ )
250
+
251
+ output.save("output.wav")
252
+ # Optional: Play the audio
253
+ # output.play()
254
+ ```
255
+
256
+ ### Custom Backend Configuration
257
+
258
+ You can initialize custom backend configurations for specific needs.
259
+
260
+ #### Example with Flash Attention for Hugging Face Transformers
261
+
262
+ ```python
263
  model_config = outetts.HFModelConfig_v1(
264
  model_path="OuteAI/OuteTTS-0.2-500M",
265
+ language="en",
266
  dtype=torch.bfloat16,
267
  additional_model_config={
268
  'attn_implementation': "flash_attention_2"
 
270
  )
271
  ```
272
 
273
+ ## Speaker Profile Recommendations
274
 
275
  To achieve the best results when creating a speaker profile, consider the following recommendations:
276