Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,188 @@
|
|
1 |
-
---
|
2 |
-
license: cc-by-nc-4.0
|
3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: cc-by-nc-4.0
|
3 |
+
datasets:
|
4 |
+
- facebook/multilingual_librispeech
|
5 |
+
- parler-tts/libritts_r_filtered
|
6 |
+
- amphion/Emilia-Dataset
|
7 |
+
language:
|
8 |
+
- en
|
9 |
+
- zh
|
10 |
+
- ja
|
11 |
+
- ko
|
12 |
+
pipeline_tag: text-to-speech
|
13 |
+
---
|
14 |
+
<style>
|
15 |
+
table {
|
16 |
+
border-collapse: collapse;
|
17 |
+
width: 100%;
|
18 |
+
margin-bottom: 20px;
|
19 |
+
}
|
20 |
+
th, td {
|
21 |
+
border: 1px solid #ddd;
|
22 |
+
padding: 8px;
|
23 |
+
text-align: center;
|
24 |
+
}
|
25 |
+
.best {
|
26 |
+
font-weight: bold;
|
27 |
+
text-decoration: underline;
|
28 |
+
}
|
29 |
+
.box {
|
30 |
+
text-align: center;
|
31 |
+
margin: 20px auto;
|
32 |
+
padding: 30px;
|
33 |
+
box-shadow: 0px 0px 20px 10px rgba(0, 0, 0, 0.05), 0px 1px 3px 10px rgba(255, 255, 255, 0.05);
|
34 |
+
border-radius: 10px;
|
35 |
+
}
|
36 |
+
.badges {
|
37 |
+
display: flex;
|
38 |
+
justify-content: center;
|
39 |
+
gap: 10px;
|
40 |
+
flex-wrap: wrap;
|
41 |
+
margin-top: 10px;
|
42 |
+
}
|
43 |
+
.badge {
|
44 |
+
text-decoration: none;
|
45 |
+
display: inline-block;
|
46 |
+
padding: 4px 8px;
|
47 |
+
border-radius: 5px;
|
48 |
+
color: #fff;
|
49 |
+
font-size: 12px;
|
50 |
+
font-weight: bold;
|
51 |
+
width: 250px;
|
52 |
+
}
|
53 |
+
.badge-hf-blue {
|
54 |
+
background-color: #767b81;
|
55 |
+
}
|
56 |
+
.badge-hf-pink {
|
57 |
+
background-color: #7b768a;
|
58 |
+
}
|
59 |
+
.badge-github {
|
60 |
+
background-color: #2c2b2b;
|
61 |
+
}
|
62 |
+
</style>
|
63 |
+
|
64 |
+
<div class="box">
|
65 |
+
<div style="margin-bottom: 20px;">
|
66 |
+
<h2 style="margin-bottom: 4px; margin-top: 0px;">OuteAI</h2>
|
67 |
+
<a href="https://www.outeai.com/" target="_blank" style="margin-right: 10px;">π OuteAI.com</a>
|
68 |
+
<a href="https://discord.gg/vyBM87kAmf" target="_blank" style="margin-right: 10px;">π€ Join our Discord</a>
|
69 |
+
<a href="https://x.com/OuteAI" target="_blank">π @OuteAI</a>
|
70 |
+
</div>
|
71 |
+
<div class="badges">
|
72 |
+
<a href="https://huggingface.co/OuteAI/OuteTTS-0.2-500M" target="_blank" class="badge badge-hf-blue">π€ Hugging Face - OuteTTS 0.2 500M</a>
|
73 |
+
<a href="https://huggingface.co/OuteAI/OuteTTS-0.2-500M-GGUF" target="_blank" class="badge badge-hf-blue">π€ Hugging Face - OuteTTS 0.2 500M GGUF</a>
|
74 |
+
<a href="https://huggingface.co/spaces/OuteAI/OuteTTS-0.2-500M-Demo" target="_blank" class="badge badge-hf-pink">π€ Hugging Face - Demo Space</a>
|
75 |
+
<a href="https://github.com/edwko/OuteTTS" target="_blank" class="badge badge-github">GitHub - OuteTTS</a>
|
76 |
+
</div>
|
77 |
+
</div>
|
78 |
+
|
79 |
+
## Model Description
|
80 |
+
|
81 |
+
OuteTTS-0.2-500M is our improved successor to the v0.1 release.
|
82 |
+
The model maintains the same approach of using audio prompts without architectural changes to the foundation model itself.
|
83 |
+
Built upon the Qwen-2.5-0.5B, this version was trained on larger and more diverse datasets, resulting in significant improvements across all aspects of performance.
|
84 |
+
|
85 |
+
## Key Improvements
|
86 |
+
|
87 |
+
- **Enhanced Accuracy**: Significantly improved prompt following and output coherence compared to the previous version
|
88 |
+
- **Natural Speech**: Produces more natural and fluid speech synthesis
|
89 |
+
- **Expanded Vocabulary**: Trained on over 5 billion audio prompt tokens
|
90 |
+
- **Voice Cloning**: Improved voice cloning capabilities with greater diversity and accuracy
|
91 |
+
- **Multilingual Support**: New experimental support for Chinese, Japanese, and Korean languages
|
92 |
+
|
93 |
+
## Speech Demo
|
94 |
+
|
95 |
+
<video width="1280" height="720" controls>
|
96 |
+
<source src="https://huggingface.co/OuteAI/OuteTTS-0.2-500M-GGUF/resolve/main/media/demo.mp4" type="video/mp4">
|
97 |
+
Your browser does not support the video tag.
|
98 |
+
</video>
|
99 |
+
|
100 |
+
## Usage
|
101 |
+
|
102 |
+
### Installation
|
103 |
+
|
104 |
+
[![GitHub](https://img.shields.io/badge/GitHub-OuteTTS-181717?logo=github)](https://github.com/edwko/OuteTTS)
|
105 |
+
|
106 |
+
```bash
|
107 |
+
pip install outetts
|
108 |
+
```
|
109 |
+
|
110 |
+
### Interface Usage
|
111 |
+
|
112 |
+
```python
|
113 |
+
import outetts
|
114 |
+
|
115 |
+
# Configure the model
|
116 |
+
model_config = outetts.HFModelConfig_v1(
|
117 |
+
model_path="OuteAI/OuteTTS-0.2-500M",
|
118 |
+
language="en", # Supported languages in v0.2: en, zh, ja, ko
|
119 |
+
)
|
120 |
+
|
121 |
+
# Initialize the interface
|
122 |
+
interface = outetts.InterfaceHF(model_version="0.2", cfg=model_config)
|
123 |
+
|
124 |
+
# Optional: Create a speaker profile (use a 10-15 second audio clip)
|
125 |
+
# speaker = interface.create_speaker(
|
126 |
+
# audio_path="path/to/audio/file",
|
127 |
+
# transcript="Transcription of the audio file."
|
128 |
+
# )
|
129 |
+
|
130 |
+
# Optional: Save and load speaker profiles
|
131 |
+
# interface.save_speaker(speaker, "speaker.pkl")
|
132 |
+
# speaker = interface.load_speaker("speaker.pkl")
|
133 |
+
|
134 |
+
# Optional: Load speaker from default presets
|
135 |
+
interface.print_default_speakers()
|
136 |
+
speaker = interface.load_default_speaker(name="male_1")
|
137 |
+
|
138 |
+
output = interface.generate(
|
139 |
+
text="Speech synthesis is the artificial production of human speech. A computer system used for this purpose is called a speech synthesizer, and it can be implemented in software or hardware products.",
|
140 |
+
# Lower temperature values may result in a more stable tone,
|
141 |
+
# while higher values can introduce varied and expressive speech
|
142 |
+
temperature=0.1,
|
143 |
+
repetition_penalty=1.1,
|
144 |
+
max_length=4096,
|
145 |
+
|
146 |
+
# Optional: Use a speaker profile for consistent voice characteristics
|
147 |
+
# Without a speaker profile, the model will generate a voice with random characteristics
|
148 |
+
speaker=speaker,
|
149 |
+
)
|
150 |
+
|
151 |
+
# Save the synthesized speech to a file
|
152 |
+
output.save("output.wav")
|
153 |
+
|
154 |
+
# Optional: Play the synthesized speech
|
155 |
+
# output.play()
|
156 |
+
```
|
157 |
+
|
158 |
+
## Using GGUF Model
|
159 |
+
|
160 |
+
```python
|
161 |
+
# Configure the GGUF model
|
162 |
+
model_config = outetts.GGUFModelConfig_v1(
|
163 |
+
model_path="local/path/to/model.gguf",
|
164 |
+
language="en", # Supported languages in v0.2: en, zh, ja, ko
|
165 |
+
n_gpu_layers=0,
|
166 |
+
)
|
167 |
+
|
168 |
+
# Initialize the GGUF interface
|
169 |
+
interface = outetts.InterfaceGGUF(model_version="0.2", cfg=model_config)
|
170 |
+
```
|
171 |
+
|
172 |
+
## Model Specifications
|
173 |
+
- **Base Model**: Qwen-2.5-0.5B
|
174 |
+
- **Parameter Count**: 500M
|
175 |
+
- **Language Support**:
|
176 |
+
- Primary: English
|
177 |
+
- Experimental: Chinese, Japanese, Korean
|
178 |
+
- **License**: CC BY NC 4.0
|
179 |
+
|
180 |
+
## Training Datasets
|
181 |
+
- Emilia-Dataset (CC BY NC 4.0)
|
182 |
+
- LibriTTS-R (CC BY 4.0)
|
183 |
+
- Multilingual LibriSpeech (MLS) (CC BY 4.0)
|
184 |
+
|
185 |
+
## Credits & References
|
186 |
+
- [WavTokenizer](https://github.com/jishengpeng/WavTokenizer)
|
187 |
+
- [CTC Forced Alignment](https://pytorch.org/audio/stable/tutorials/ctc_forced_alignment_api_tutorial.html)
|
188 |
+
- [Qwen-2.5-0.5B](https://huggingface.co/Qwen/Qwen2.5-0.5B)
|