Update README.md
Browse files
README.md
CHANGED
@@ -65,9 +65,9 @@ th, td {
|
|
65 |
<div class="box">
|
66 |
<div style="margin-bottom: 20px;">
|
67 |
<h2 style="margin-bottom: 4px; margin-top: 0px;">OuteAI</h2>
|
68 |
-
<a href="https://www.outeai.com/" target="_blank" style="margin-right: 10px;"
|
69 |
-
<a href="https://discord.gg/vyBM87kAmf" target="_blank" style="margin-right: 10px;"
|
70 |
-
<a href="https://x.com/OuteAI" target="_blank"
|
71 |
</div>
|
72 |
<div class="badges">
|
73 |
<a href="https://huggingface.co/OuteAI/OuteTTS-0.2-500M" target="_blank" class="badge badge-hf-blue">π€ Hugging Face - OuteTTS 0.2 500M</a>
|
@@ -83,7 +83,7 @@ OuteTTS-0.2-500M is our improved successor to the v0.1 release.
|
|
83 |
The model maintains the same approach of using audio prompts without architectural changes to the foundation model itself.
|
84 |
Built upon the Qwen-2.5-0.5B, this version was trained on larger and more diverse datasets, resulting in significant improvements across all aspects of performance.
|
85 |
|
86 |
-
Special thanks to **Hugging Face** for providing GPU grant that supported the training of this model
|
87 |
|
88 |
## Key Improvements
|
89 |
|
@@ -100,17 +100,21 @@ Special thanks to **Hugging Face** for providing GPU grant that supported the tr
|
|
100 |
Your browser does not support the video tag.
|
101 |
</video>
|
102 |
|
103 |
-
##
|
104 |
-
|
105 |
-
### Installation
|
106 |
|
107 |
[![GitHub](https://img.shields.io/badge/GitHub-OuteTTS-181717?logo=github)](https://github.com/edwko/OuteTTS)
|
108 |
|
109 |
```bash
|
110 |
-
pip install outetts
|
111 |
```
|
112 |
|
113 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
114 |
|
115 |
```python
|
116 |
import outetts
|
@@ -118,30 +122,21 @@ import outetts
|
|
118 |
# Configure the model
|
119 |
model_config = outetts.HFModelConfig_v1(
|
120 |
model_path="OuteAI/OuteTTS-0.2-500M",
|
121 |
-
language="en", # Supported languages
|
122 |
)
|
123 |
|
124 |
# Initialize the interface
|
125 |
interface = outetts.InterfaceHF(model_version="0.2", cfg=model_config)
|
126 |
|
127 |
-
#
|
128 |
-
# speaker = interface.create_speaker(
|
129 |
-
# audio_path="path/to/audio/file",
|
130 |
-
# transcript="Transcription of the audio file."
|
131 |
-
# )
|
132 |
-
|
133 |
-
# Optional: Save and load speaker profiles
|
134 |
-
# interface.save_speaker(speaker, "speaker.json")
|
135 |
-
# speaker = interface.load_speaker("speaker.json")
|
136 |
-
|
137 |
-
# Optional: Load speaker from default presets
|
138 |
interface.print_default_speakers()
|
|
|
|
|
139 |
speaker = interface.load_default_speaker(name="male_1")
|
140 |
|
|
|
141 |
output = interface.generate(
|
142 |
-
text="Speech synthesis is the artificial production of human speech.
|
143 |
-
# Lower temperature values may result in a more stable tone,
|
144 |
-
# while higher values can introduce varied and expressive speech
|
145 |
temperature=0.1,
|
146 |
repetition_penalty=1.1,
|
147 |
max_length=4096,
|
@@ -151,36 +146,123 @@ output = interface.generate(
|
|
151 |
speaker=speaker,
|
152 |
)
|
153 |
|
154 |
-
# Save the
|
155 |
output.save("output.wav")
|
156 |
|
157 |
-
# Optional: Play the
|
158 |
# output.play()
|
159 |
```
|
160 |
|
161 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
162 |
|
163 |
```python
|
164 |
-
|
|
|
165 |
model_config = outetts.GGUFModelConfig_v1(
|
166 |
model_path="local/path/to/model.gguf",
|
167 |
language="en", # Supported languages in v0.2: en, zh, ja, ko
|
168 |
n_gpu_layers=0,
|
169 |
)
|
170 |
|
171 |
-
# Initialize the GGUF interface
|
172 |
interface = outetts.InterfaceGGUF(model_version="0.2", cfg=model_config)
|
173 |
```
|
174 |
|
175 |
-
|
176 |
|
177 |
```python
|
178 |
import outetts
|
179 |
-
import torch
|
180 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
181 |
model_config = outetts.HFModelConfig_v1(
|
182 |
model_path="OuteAI/OuteTTS-0.2-500M",
|
183 |
-
language="en",
|
184 |
dtype=torch.bfloat16,
|
185 |
additional_model_config={
|
186 |
'attn_implementation': "flash_attention_2"
|
@@ -188,7 +270,7 @@ model_config = outetts.HFModelConfig_v1(
|
|
188 |
)
|
189 |
```
|
190 |
|
191 |
-
##
|
192 |
|
193 |
To achieve the best results when creating a speaker profile, consider the following recommendations:
|
194 |
|
|
|
65 |
<div class="box">
|
66 |
<div style="margin-bottom: 20px;">
|
67 |
<h2 style="margin-bottom: 4px; margin-top: 0px;">OuteAI</h2>
|
68 |
+
<a href="https://www.outeai.com/" target="_blank" style="margin-right: 10px;">π OuteAI.com</a>
|
69 |
+
<a href="https://discord.gg/vyBM87kAmf" target="_blank" style="margin-right: 10px;">π¬ Join our Discord</a>
|
70 |
+
<a href="https://x.com/OuteAI" target="_blank">βοΈ (Twitter) @OuteAI</a>
|
71 |
</div>
|
72 |
<div class="badges">
|
73 |
<a href="https://huggingface.co/OuteAI/OuteTTS-0.2-500M" target="_blank" class="badge badge-hf-blue">π€ Hugging Face - OuteTTS 0.2 500M</a>
|
|
|
83 |
The model maintains the same approach of using audio prompts without architectural changes to the foundation model itself.
|
84 |
Built upon the Qwen-2.5-0.5B, this version was trained on larger and more diverse datasets, resulting in significant improvements across all aspects of performance.
|
85 |
|
86 |
+
Special thanks to **Hugging Face** for providing GPU grant that supported the training of this model!
|
87 |
|
88 |
## Key Improvements
|
89 |
|
|
|
100 |
Your browser does not support the video tag.
|
101 |
</video>
|
102 |
|
103 |
+
## Installation
|
|
|
|
|
104 |
|
105 |
[![GitHub](https://img.shields.io/badge/GitHub-OuteTTS-181717?logo=github)](https://github.com/edwko/OuteTTS)
|
106 |
|
107 |
```bash
|
108 |
+
pip install outetts --upgrade
|
109 |
```
|
110 |
|
111 |
+
**Important:**
|
112 |
+
- For GGUF support, install `llama-cpp-python` manually. [Installation Guide](https://github.com/abetlen/llama-cpp-python?tab=readme-ov-file#installation)
|
113 |
+
- For EXL2 support, install `exllamav2` manually. [Installation Guide](https://github.com/turboderp/exllamav2?tab=readme-ov-file#installation)
|
114 |
+
|
115 |
+
## Usage
|
116 |
+
|
117 |
+
### Quick Start: Basic Full Example
|
118 |
|
119 |
```python
|
120 |
import outetts
|
|
|
122 |
# Configure the model
|
123 |
model_config = outetts.HFModelConfig_v1(
|
124 |
model_path="OuteAI/OuteTTS-0.2-500M",
|
125 |
+
language="en", # Supported languages: en, zh, ja, ko
|
126 |
)
|
127 |
|
128 |
# Initialize the interface
|
129 |
interface = outetts.InterfaceHF(model_version="0.2", cfg=model_config)
|
130 |
|
131 |
+
# Print available default speakers
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
132 |
interface.print_default_speakers()
|
133 |
+
|
134 |
+
# Load a default speaker
|
135 |
speaker = interface.load_default_speaker(name="male_1")
|
136 |
|
137 |
+
# Generate speech
|
138 |
output = interface.generate(
|
139 |
+
text="Speech synthesis is the artificial production of human speech.",
|
|
|
|
|
140 |
temperature=0.1,
|
141 |
repetition_penalty=1.1,
|
142 |
max_length=4096,
|
|
|
146 |
speaker=speaker,
|
147 |
)
|
148 |
|
149 |
+
# Save the generated speech to a file
|
150 |
output.save("output.wav")
|
151 |
|
152 |
+
# Optional: Play the generated audio
|
153 |
# output.play()
|
154 |
```
|
155 |
|
156 |
+
### Backend-Specific Configuration
|
157 |
+
|
158 |
+
#### Hugging Face Transformers
|
159 |
+
|
160 |
+
```python
|
161 |
+
import outetts
|
162 |
+
|
163 |
+
model_config = outetts.HFModelConfig_v1(
|
164 |
+
model_path="OuteAI/OuteTTS-0.2-500M",
|
165 |
+
language="en", # Supported languages in v0.2: en, zh, ja, ko
|
166 |
+
)
|
167 |
+
|
168 |
+
interface = outetts.InterfaceHF(model_version="0.2", cfg=model_config)
|
169 |
+
```
|
170 |
+
|
171 |
+
#### GGUF (llama-cpp-python)
|
172 |
|
173 |
```python
|
174 |
+
import outetts
|
175 |
+
|
176 |
model_config = outetts.GGUFModelConfig_v1(
|
177 |
model_path="local/path/to/model.gguf",
|
178 |
language="en", # Supported languages in v0.2: en, zh, ja, ko
|
179 |
n_gpu_layers=0,
|
180 |
)
|
181 |
|
|
|
182 |
interface = outetts.InterfaceGGUF(model_version="0.2", cfg=model_config)
|
183 |
```
|
184 |
|
185 |
+
#### ExLlamaV2
|
186 |
|
187 |
```python
|
188 |
import outetts
|
|
|
189 |
|
190 |
+
model_config = outetts.EXL2ModelConfig_v1(
|
191 |
+
model_path="local/path/to/model",
|
192 |
+
language="en", # Supported languages in v0.2: en, zh, ja, ko
|
193 |
+
)
|
194 |
+
|
195 |
+
interface = outetts.InterfaceEXL2(model_version="0.2", cfg=model_config)
|
196 |
+
```
|
197 |
+
|
198 |
+
### Speaker Creation and Management
|
199 |
+
|
200 |
+
#### Creating a Speaker
|
201 |
+
|
202 |
+
You can create a speaker profile for voice cloning, which is compatible across all backends.
|
203 |
+
|
204 |
+
```python
|
205 |
+
speaker = interface.create_speaker(
|
206 |
+
audio_path="path/to/audio/file.wav",
|
207 |
+
|
208 |
+
# If transcript is not provided, it will be automatically transcribed using Whisper
|
209 |
+
transcript=None, # Set to None to use Whisper for transcription
|
210 |
+
|
211 |
+
whisper_model="turbo", # Optional: specify Whisper model (default: "turbo")
|
212 |
+
whisper_device=None, # Optional: specify device for Whisper (default: None)
|
213 |
+
)
|
214 |
+
```
|
215 |
+
#### Saving and Loading Speaker Profiles
|
216 |
+
|
217 |
+
Speaker profiles can be saved and loaded across all supported backends.
|
218 |
+
|
219 |
+
```python
|
220 |
+
# Save speaker profile
|
221 |
+
interface.save_speaker(speaker, "speaker.json")
|
222 |
+
|
223 |
+
# Load speaker profile
|
224 |
+
speaker = interface.load_speaker("speaker.json")
|
225 |
+
```
|
226 |
+
|
227 |
+
#### Default Speaker Initialization
|
228 |
+
|
229 |
+
OuteTTS includes a set of default speaker profiles. Use them directly:
|
230 |
+
|
231 |
+
```python
|
232 |
+
# Print available default speakers
|
233 |
+
interface.print_default_speakers()
|
234 |
+
# Load a default speaker
|
235 |
+
speaker = interface.load_default_speaker(name="male_1")
|
236 |
+
```
|
237 |
+
|
238 |
+
### Text-to-Speech Generation
|
239 |
+
|
240 |
+
The generation process is consistent across all backends.
|
241 |
+
|
242 |
+
```python
|
243 |
+
output = interface.generate(
|
244 |
+
text="Speech synthesis is the artificial production of human speech.",
|
245 |
+
temperature=0.1,
|
246 |
+
repetition_penalty=1.1,
|
247 |
+
max_length=4096,
|
248 |
+
speaker=speaker, # Optional: speaker profile
|
249 |
+
)
|
250 |
+
|
251 |
+
output.save("output.wav")
|
252 |
+
# Optional: Play the audio
|
253 |
+
# output.play()
|
254 |
+
```
|
255 |
+
|
256 |
+
### Custom Backend Configuration
|
257 |
+
|
258 |
+
You can initialize custom backend configurations for specific needs.
|
259 |
+
|
260 |
+
#### Example with Flash Attention for Hugging Face Transformers
|
261 |
+
|
262 |
+
```python
|
263 |
model_config = outetts.HFModelConfig_v1(
|
264 |
model_path="OuteAI/OuteTTS-0.2-500M",
|
265 |
+
language="en",
|
266 |
dtype=torch.bfloat16,
|
267 |
additional_model_config={
|
268 |
'attn_implementation': "flash_attention_2"
|
|
|
270 |
)
|
271 |
```
|
272 |
|
273 |
+
## Speaker Profile Recommendations
|
274 |
|
275 |
To achieve the best results when creating a speaker profile, consider the following recommendations:
|
276 |
|