Text-to-Speech
GGUF
Inference Endpoints
File size: 9,759 Bytes
eea9a07
 
 
 
 
 
c848559
eea9a07
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
aec1b5e
 
2722a11
eea9a07
 
 
 
 
 
 
 
 
 
 
 
 
 
 
aec1b5e
25ef22f
eea9a07
 
 
 
 
 
 
 
 
 
 
 
 
 
 
aec1b5e
eea9a07
 
 
 
aec1b5e
eea9a07
 
aec1b5e
 
 
 
 
 
 
eea9a07
 
 
 
 
 
 
e6d7872
eea9a07
 
 
 
 
aec1b5e
eea9a07
aec1b5e
 
eea9a07
 
aec1b5e
eea9a07
aec1b5e
eea9a07
 
 
 
 
 
 
 
 
aec1b5e
eea9a07
 
aec1b5e
eea9a07
 
 
aec1b5e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
eea9a07
 
aec1b5e
 
eea9a07
 
 
 
 
 
 
 
 
aec1b5e
7be995f
 
 
 
aec1b5e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7be995f
 
aec1b5e
7be995f
 
 
 
 
 
 
aec1b5e
ab56b02
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
eea9a07
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
---
license: cc-by-nc-4.0
datasets:
- facebook/multilingual_librispeech
- parler-tts/libritts_r_filtered
- amphion/Emilia-Dataset
- parler-tts/mls_eng
language:
- en
- zh
- ja
- ko
pipeline_tag: text-to-speech
---
<style>
table {
    border-collapse: collapse;
    width: 100%;
    margin-bottom: 20px;
}
th, td {
    border: 1px solid #ddd;
    padding: 8px;
    text-align: center;
}
.best {
    font-weight: bold;
    text-decoration: underline;
}
.box {
  text-align: center;
  margin: 20px auto;
  padding: 30px;
  box-shadow: 0px 0px 20px 10px rgba(0, 0, 0, 0.05), 0px 1px 3px 10px rgba(255, 255, 255, 0.05);
  border-radius: 10px;
}
.badges {
    display: flex;
    justify-content: center;
    gap: 10px;
    flex-wrap: wrap;
    margin-top: 10px;
}
.badge {
    text-decoration: none;
    display: inline-block;
    padding: 4px 8px;
    border-radius: 5px;
    color: #fff;
    font-size: 12px;
    font-weight: bold;
    width: 250px;
}
.badge-hf-blue {
    background-color: #767b81;
}
.badge-hf-pink {
    background-color: #7b768a;
}
.badge-github {
    background-color: #2c2b2b;
}
</style>

<div class="box">
  <div style="margin-bottom: 20px;">
    <h2 style="margin-bottom: 4px; margin-top: 0px;">OuteAI</h2>
    <a href="https://www.outeai.com/" target="_blank" style="margin-right: 10px;">🌐 OuteAI.com</a> 
    <a href="https://discord.gg/vyBM87kAmf" target="_blank" style="margin-right: 10px;">πŸ’¬ Join our Discord</a>
    <a href="https://x.com/OuteAI" target="_blank">𝕏 @OuteAI</a>
  </div>
  <div class="badges">
    <a href="https://huggingface.co/OuteAI/OuteTTS-0.2-500M" target="_blank" class="badge badge-hf-blue">πŸ€— Hugging Face - OuteTTS 0.2 500M</a>
    <a href="https://huggingface.co/OuteAI/OuteTTS-0.2-500M-GGUF" target="_blank" class="badge badge-hf-blue">πŸ€— Hugging Face - OuteTTS 0.2 500M GGUF</a>
    <a href="https://huggingface.co/spaces/OuteAI/OuteTTS-0.2-500M-Demo" target="_blank" class="badge badge-hf-pink">πŸ€— Hugging Face - Demo Space</a>
    <a href="https://github.com/edwko/OuteTTS" target="_blank" class="badge badge-github">GitHub - OuteTTS</a>
  </div>
</div>

## Model Description

OuteTTS-0.2-500M is our improved successor to the v0.1 release. 
The model maintains the same approach of using audio prompts without architectural changes to the foundation model itself. 
Built upon the Qwen-2.5-0.5B, this version was trained on larger and more diverse datasets, resulting in significant improvements across all aspects of performance.

Special thanks to **Hugging Face** for providing GPU grant that supported the training of this model!

## Key Improvements

- **Enhanced Accuracy**: Significantly improved prompt following and output coherence compared to the previous version
- **Natural Speech**: Produces more natural and fluid speech synthesis
- **Expanded Vocabulary**: Trained on over 5 billion audio prompt tokens
- **Voice Cloning**: Improved voice cloning capabilities with greater diversity and accuracy
- **Multilingual Support**: New experimental support for Chinese, Japanese, and Korean languages

## Speech Demo

<video width="1280" height="720" controls>
  <source src="https://huggingface.co/OuteAI/OuteTTS-0.2-500M-GGUF/resolve/main/media/demo.mp4" type="video/mp4">
Your browser does not support the video tag.
</video>

## Installation

[![GitHub](https://img.shields.io/badge/GitHub-OuteTTS-181717?logo=github)](https://github.com/edwko/OuteTTS)

```bash
pip install outetts --upgrade
```

**Important:**
- For GGUF support, install `llama-cpp-python` manually. [Installation Guide](https://github.com/abetlen/llama-cpp-python?tab=readme-ov-file#installation)
- For EXL2 support, install `exllamav2` manually. [Installation Guide](https://github.com/turboderp/exllamav2?tab=readme-ov-file#installation)

## Usage

### Quick Start: Basic Full Example

```python
import outetts

# Configure the model
model_config = outetts.HFModelConfig_v1(
    model_path="OuteAI/OuteTTS-0.2-500M",
    language="en",  # Supported languages in v0.2: en, zh, ja, ko
)

# Initialize the interface
interface = outetts.InterfaceHF(model_version="0.2", cfg=model_config)

# Print available default speakers
interface.print_default_speakers()

# Load a default speaker
speaker = interface.load_default_speaker(name="male_1")

# Generate speech
output = interface.generate(
    text="Speech synthesis is the artificial production of human speech.",
    temperature=0.1,
    repetition_penalty=1.1,
    max_length=4096,

    # Optional: Use a speaker profile for consistent voice characteristics
    # Without a speaker profile, the model will generate a voice with random characteristics
    speaker=speaker,
)

# Save the generated speech to a file
output.save("output.wav")

# Optional: Play the generated audio
# output.play()
```

### Backend-Specific Configuration

#### Hugging Face Transformers

```python
import outetts

model_config = outetts.HFModelConfig_v1(
    model_path="OuteAI/OuteTTS-0.2-500M",
    language="en",  # Supported languages in v0.2: en, zh, ja, ko
)

interface = outetts.InterfaceHF(model_version="0.2", cfg=model_config)
```

#### GGUF (llama-cpp-python)

```python
import outetts

model_config = outetts.GGUFModelConfig_v1(
    model_path="local/path/to/model.gguf",
    language="en", # Supported languages in v0.2: en, zh, ja, ko
    n_gpu_layers=0,
)

interface = outetts.InterfaceGGUF(model_version="0.2", cfg=model_config)
```

#### ExLlamaV2

```python
import outetts

model_config = outetts.EXL2ModelConfig_v1(
    model_path="local/path/to/model",
    language="en", # Supported languages in v0.2: en, zh, ja, ko
)

interface = outetts.InterfaceEXL2(model_version="0.2", cfg=model_config)
```

### Speaker Creation and Management

#### Creating a Speaker

You can create a speaker profile for voice cloning, which is compatible across all backends.

```python
speaker = interface.create_speaker(
    audio_path="path/to/audio/file.wav",

    # If transcript is not provided, it will be automatically transcribed using Whisper
    transcript=None,            # Set to None to use Whisper for transcription

    whisper_model="turbo",      # Optional: specify Whisper model (default: "turbo")
    whisper_device=None,        # Optional: specify device for Whisper (default: None)
)
```
#### Saving and Loading Speaker Profiles

Speaker profiles can be saved and loaded across all supported backends.

```python
# Save speaker profile
interface.save_speaker(speaker, "speaker.json")

# Load speaker profile
speaker = interface.load_speaker("speaker.json")
```

#### Default Speaker Initialization

OuteTTS includes a set of default speaker profiles. Use them directly:

```python
# Print available default speakers
interface.print_default_speakers()
# Load a default speaker
speaker = interface.load_default_speaker(name="male_1")
```

### Text-to-Speech Generation

The generation process is consistent across all backends.

```python
output = interface.generate(
    text="Speech synthesis is the artificial production of human speech.",
    temperature=0.1,
    repetition_penalty=1.1,
    max_length=4096,
    speaker=speaker, # Optional: speaker profile
)

output.save("output.wav")
# Optional: Play the audio
# output.play()
```

### Custom Backend Configuration

You can initialize custom backend configurations for specific needs.

#### Example with Flash Attention for Hugging Face Transformers

```python
model_config = outetts.HFModelConfig_v1(
    model_path="OuteAI/OuteTTS-0.2-500M",
    language="en",
    dtype=torch.bfloat16,
    additional_model_config={
        'attn_implementation': "flash_attention_2"
    }
)
```

## Speaker Profile Recommendations

To achieve the best results when creating a speaker profile, consider the following recommendations:

1. **Audio Clip Duration:**
   - Use an audio clip of around **10-15 seconds**.
   - This duration provides sufficient data for the model to learn the speaker's characteristics while keeping the input manageable. The model's context length is 4096 tokens, allowing it to generate around 54 seconds of audio in total. However, when a speaker profile is included, this capacity is reduced proportionally to the length of the speaker's audio clip.

2. **Audio Quality:**
   - Ensure the audio is **clear and noise-free**. Background noise or distortions can reduce the model's ability to extract accurate voice features.

3. **Accurate Transcription:**
   - Provide a highly **accurate transcription** of the audio clip. Mismatches between the audio and transcription can lead to suboptimal results.

4. **Speaker Familiarity:**
   - The model performs best with voices that are similar to those seen during training. Using a voice that is **significantly different from typical training samples** (e.g., unique accents, rare vocal characteristics) might result in inaccurate replication.
   - In such cases, you may need to **fine-tune the model** specifically on your target speaker's voice to achieve a better representation.

5. **Parameter Adjustments:**
   - Adjust parameters like `temperature` in the `generate` function to refine the expressive quality and consistency of the synthesized voice.

## Model Specifications
- **Base Model**: Qwen-2.5-0.5B
- **Parameter Count**: 500M
- **Language Support**:
  - Primary: English
  - Experimental: Chinese, Japanese, Korean
- **License**: CC BY NC 4.0

## Training Datasets
- Emilia-Dataset (CC BY NC 4.0)
- LibriTTS-R (CC BY 4.0)
- Multilingual LibriSpeech (MLS) (CC BY 4.0)

## Credits & References
- [WavTokenizer](https://github.com/jishengpeng/WavTokenizer)
- [CTC Forced Alignment](https://pytorch.org/audio/stable/tutorials/ctc_forced_alignment_api_tutorial.html)
- [Qwen-2.5-0.5B](https://huggingface.co/Qwen/Qwen2.5-0.5B)