File size: 9,759 Bytes
eea9a07 c848559 eea9a07 aec1b5e 2722a11 eea9a07 aec1b5e 25ef22f eea9a07 aec1b5e eea9a07 aec1b5e eea9a07 aec1b5e eea9a07 e6d7872 eea9a07 aec1b5e eea9a07 aec1b5e eea9a07 aec1b5e eea9a07 aec1b5e eea9a07 aec1b5e eea9a07 aec1b5e eea9a07 aec1b5e eea9a07 aec1b5e eea9a07 aec1b5e 7be995f aec1b5e 7be995f aec1b5e 7be995f aec1b5e ab56b02 eea9a07 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 |
---
license: cc-by-nc-4.0
datasets:
- facebook/multilingual_librispeech
- parler-tts/libritts_r_filtered
- amphion/Emilia-Dataset
- parler-tts/mls_eng
language:
- en
- zh
- ja
- ko
pipeline_tag: text-to-speech
---
<style>
table {
border-collapse: collapse;
width: 100%;
margin-bottom: 20px;
}
th, td {
border: 1px solid #ddd;
padding: 8px;
text-align: center;
}
.best {
font-weight: bold;
text-decoration: underline;
}
.box {
text-align: center;
margin: 20px auto;
padding: 30px;
box-shadow: 0px 0px 20px 10px rgba(0, 0, 0, 0.05), 0px 1px 3px 10px rgba(255, 255, 255, 0.05);
border-radius: 10px;
}
.badges {
display: flex;
justify-content: center;
gap: 10px;
flex-wrap: wrap;
margin-top: 10px;
}
.badge {
text-decoration: none;
display: inline-block;
padding: 4px 8px;
border-radius: 5px;
color: #fff;
font-size: 12px;
font-weight: bold;
width: 250px;
}
.badge-hf-blue {
background-color: #767b81;
}
.badge-hf-pink {
background-color: #7b768a;
}
.badge-github {
background-color: #2c2b2b;
}
</style>
<div class="box">
<div style="margin-bottom: 20px;">
<h2 style="margin-bottom: 4px; margin-top: 0px;">OuteAI</h2>
<a href="https://www.outeai.com/" target="_blank" style="margin-right: 10px;">π OuteAI.com</a>
<a href="https://discord.gg/vyBM87kAmf" target="_blank" style="margin-right: 10px;">π¬ Join our Discord</a>
<a href="https://x.com/OuteAI" target="_blank">π @OuteAI</a>
</div>
<div class="badges">
<a href="https://huggingface.co/OuteAI/OuteTTS-0.2-500M" target="_blank" class="badge badge-hf-blue">π€ Hugging Face - OuteTTS 0.2 500M</a>
<a href="https://huggingface.co/OuteAI/OuteTTS-0.2-500M-GGUF" target="_blank" class="badge badge-hf-blue">π€ Hugging Face - OuteTTS 0.2 500M GGUF</a>
<a href="https://huggingface.co/spaces/OuteAI/OuteTTS-0.2-500M-Demo" target="_blank" class="badge badge-hf-pink">π€ Hugging Face - Demo Space</a>
<a href="https://github.com/edwko/OuteTTS" target="_blank" class="badge badge-github">GitHub - OuteTTS</a>
</div>
</div>
## Model Description
OuteTTS-0.2-500M is our improved successor to the v0.1 release.
The model maintains the same approach of using audio prompts without architectural changes to the foundation model itself.
Built upon the Qwen-2.5-0.5B, this version was trained on larger and more diverse datasets, resulting in significant improvements across all aspects of performance.
Special thanks to **Hugging Face** for providing GPU grant that supported the training of this model!
## Key Improvements
- **Enhanced Accuracy**: Significantly improved prompt following and output coherence compared to the previous version
- **Natural Speech**: Produces more natural and fluid speech synthesis
- **Expanded Vocabulary**: Trained on over 5 billion audio prompt tokens
- **Voice Cloning**: Improved voice cloning capabilities with greater diversity and accuracy
- **Multilingual Support**: New experimental support for Chinese, Japanese, and Korean languages
## Speech Demo
<video width="1280" height="720" controls>
<source src="https://huggingface.co/OuteAI/OuteTTS-0.2-500M-GGUF/resolve/main/media/demo.mp4" type="video/mp4">
Your browser does not support the video tag.
</video>
## Installation
[![GitHub](https://img.shields.io/badge/GitHub-OuteTTS-181717?logo=github)](https://github.com/edwko/OuteTTS)
```bash
pip install outetts --upgrade
```
**Important:**
- For GGUF support, install `llama-cpp-python` manually. [Installation Guide](https://github.com/abetlen/llama-cpp-python?tab=readme-ov-file#installation)
- For EXL2 support, install `exllamav2` manually. [Installation Guide](https://github.com/turboderp/exllamav2?tab=readme-ov-file#installation)
## Usage
### Quick Start: Basic Full Example
```python
import outetts
# Configure the model
model_config = outetts.HFModelConfig_v1(
model_path="OuteAI/OuteTTS-0.2-500M",
language="en", # Supported languages in v0.2: en, zh, ja, ko
)
# Initialize the interface
interface = outetts.InterfaceHF(model_version="0.2", cfg=model_config)
# Print available default speakers
interface.print_default_speakers()
# Load a default speaker
speaker = interface.load_default_speaker(name="male_1")
# Generate speech
output = interface.generate(
text="Speech synthesis is the artificial production of human speech.",
temperature=0.1,
repetition_penalty=1.1,
max_length=4096,
# Optional: Use a speaker profile for consistent voice characteristics
# Without a speaker profile, the model will generate a voice with random characteristics
speaker=speaker,
)
# Save the generated speech to a file
output.save("output.wav")
# Optional: Play the generated audio
# output.play()
```
### Backend-Specific Configuration
#### Hugging Face Transformers
```python
import outetts
model_config = outetts.HFModelConfig_v1(
model_path="OuteAI/OuteTTS-0.2-500M",
language="en", # Supported languages in v0.2: en, zh, ja, ko
)
interface = outetts.InterfaceHF(model_version="0.2", cfg=model_config)
```
#### GGUF (llama-cpp-python)
```python
import outetts
model_config = outetts.GGUFModelConfig_v1(
model_path="local/path/to/model.gguf",
language="en", # Supported languages in v0.2: en, zh, ja, ko
n_gpu_layers=0,
)
interface = outetts.InterfaceGGUF(model_version="0.2", cfg=model_config)
```
#### ExLlamaV2
```python
import outetts
model_config = outetts.EXL2ModelConfig_v1(
model_path="local/path/to/model",
language="en", # Supported languages in v0.2: en, zh, ja, ko
)
interface = outetts.InterfaceEXL2(model_version="0.2", cfg=model_config)
```
### Speaker Creation and Management
#### Creating a Speaker
You can create a speaker profile for voice cloning, which is compatible across all backends.
```python
speaker = interface.create_speaker(
audio_path="path/to/audio/file.wav",
# If transcript is not provided, it will be automatically transcribed using Whisper
transcript=None, # Set to None to use Whisper for transcription
whisper_model="turbo", # Optional: specify Whisper model (default: "turbo")
whisper_device=None, # Optional: specify device for Whisper (default: None)
)
```
#### Saving and Loading Speaker Profiles
Speaker profiles can be saved and loaded across all supported backends.
```python
# Save speaker profile
interface.save_speaker(speaker, "speaker.json")
# Load speaker profile
speaker = interface.load_speaker("speaker.json")
```
#### Default Speaker Initialization
OuteTTS includes a set of default speaker profiles. Use them directly:
```python
# Print available default speakers
interface.print_default_speakers()
# Load a default speaker
speaker = interface.load_default_speaker(name="male_1")
```
### Text-to-Speech Generation
The generation process is consistent across all backends.
```python
output = interface.generate(
text="Speech synthesis is the artificial production of human speech.",
temperature=0.1,
repetition_penalty=1.1,
max_length=4096,
speaker=speaker, # Optional: speaker profile
)
output.save("output.wav")
# Optional: Play the audio
# output.play()
```
### Custom Backend Configuration
You can initialize custom backend configurations for specific needs.
#### Example with Flash Attention for Hugging Face Transformers
```python
model_config = outetts.HFModelConfig_v1(
model_path="OuteAI/OuteTTS-0.2-500M",
language="en",
dtype=torch.bfloat16,
additional_model_config={
'attn_implementation': "flash_attention_2"
}
)
```
## Speaker Profile Recommendations
To achieve the best results when creating a speaker profile, consider the following recommendations:
1. **Audio Clip Duration:**
- Use an audio clip of around **10-15 seconds**.
- This duration provides sufficient data for the model to learn the speaker's characteristics while keeping the input manageable. The model's context length is 4096 tokens, allowing it to generate around 54 seconds of audio in total. However, when a speaker profile is included, this capacity is reduced proportionally to the length of the speaker's audio clip.
2. **Audio Quality:**
- Ensure the audio is **clear and noise-free**. Background noise or distortions can reduce the model's ability to extract accurate voice features.
3. **Accurate Transcription:**
- Provide a highly **accurate transcription** of the audio clip. Mismatches between the audio and transcription can lead to suboptimal results.
4. **Speaker Familiarity:**
- The model performs best with voices that are similar to those seen during training. Using a voice that is **significantly different from typical training samples** (e.g., unique accents, rare vocal characteristics) might result in inaccurate replication.
- In such cases, you may need to **fine-tune the model** specifically on your target speaker's voice to achieve a better representation.
5. **Parameter Adjustments:**
- Adjust parameters like `temperature` in the `generate` function to refine the expressive quality and consistency of the synthesized voice.
## Model Specifications
- **Base Model**: Qwen-2.5-0.5B
- **Parameter Count**: 500M
- **Language Support**:
- Primary: English
- Experimental: Chinese, Japanese, Korean
- **License**: CC BY NC 4.0
## Training Datasets
- Emilia-Dataset (CC BY NC 4.0)
- LibriTTS-R (CC BY 4.0)
- Multilingual LibriSpeech (MLS) (CC BY 4.0)
## Credits & References
- [WavTokenizer](https://github.com/jishengpeng/WavTokenizer)
- [CTC Forced Alignment](https://pytorch.org/audio/stable/tutorials/ctc_forced_alignment_api_tutorial.html)
- [Qwen-2.5-0.5B](https://huggingface.co/Qwen/Qwen2.5-0.5B) |