Text-to-Speech
GGUF
Inference Endpoints
edwko commited on
Commit
eea9a07
β€’
1 Parent(s): a36edf5

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +188 -3
README.md CHANGED
@@ -1,3 +1,188 @@
1
- ---
2
- license: cc-by-nc-4.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-nc-4.0
3
+ datasets:
4
+ - facebook/multilingual_librispeech
5
+ - parler-tts/libritts_r_filtered
6
+ - amphion/Emilia-Dataset
7
+ language:
8
+ - en
9
+ - zh
10
+ - ja
11
+ - ko
12
+ pipeline_tag: text-to-speech
13
+ ---
14
+ <style>
15
+ table {
16
+ border-collapse: collapse;
17
+ width: 100%;
18
+ margin-bottom: 20px;
19
+ }
20
+ th, td {
21
+ border: 1px solid #ddd;
22
+ padding: 8px;
23
+ text-align: center;
24
+ }
25
+ .best {
26
+ font-weight: bold;
27
+ text-decoration: underline;
28
+ }
29
+ .box {
30
+ text-align: center;
31
+ margin: 20px auto;
32
+ padding: 30px;
33
+ box-shadow: 0px 0px 20px 10px rgba(0, 0, 0, 0.05), 0px 1px 3px 10px rgba(255, 255, 255, 0.05);
34
+ border-radius: 10px;
35
+ }
36
+ .badges {
37
+ display: flex;
38
+ justify-content: center;
39
+ gap: 10px;
40
+ flex-wrap: wrap;
41
+ margin-top: 10px;
42
+ }
43
+ .badge {
44
+ text-decoration: none;
45
+ display: inline-block;
46
+ padding: 4px 8px;
47
+ border-radius: 5px;
48
+ color: #fff;
49
+ font-size: 12px;
50
+ font-weight: bold;
51
+ width: 250px;
52
+ }
53
+ .badge-hf-blue {
54
+ background-color: #767b81;
55
+ }
56
+ .badge-hf-pink {
57
+ background-color: #7b768a;
58
+ }
59
+ .badge-github {
60
+ background-color: #2c2b2b;
61
+ }
62
+ </style>
63
+
64
+ <div class="box">
65
+ <div style="margin-bottom: 20px;">
66
+ <h2 style="margin-bottom: 4px; margin-top: 0px;">OuteAI</h2>
67
+ <a href="https://www.outeai.com/" target="_blank" style="margin-right: 10px;">🌎 OuteAI.com</a>
68
+ <a href="https://discord.gg/vyBM87kAmf" target="_blank" style="margin-right: 10px;">🀝 Join our Discord</a>
69
+ <a href="https://x.com/OuteAI" target="_blank">𝕏 @OuteAI</a>
70
+ </div>
71
+ <div class="badges">
72
+ <a href="https://huggingface.co/OuteAI/OuteTTS-0.2-500M" target="_blank" class="badge badge-hf-blue">πŸ€— Hugging Face - OuteTTS 0.2 500M</a>
73
+ <a href="https://huggingface.co/OuteAI/OuteTTS-0.2-500M-GGUF" target="_blank" class="badge badge-hf-blue">πŸ€— Hugging Face - OuteTTS 0.2 500M GGUF</a>
74
+ <a href="https://huggingface.co/spaces/OuteAI/OuteTTS-0.2-500M-Demo" target="_blank" class="badge badge-hf-pink">πŸ€— Hugging Face - Demo Space</a>
75
+ <a href="https://github.com/edwko/OuteTTS" target="_blank" class="badge badge-github">GitHub - OuteTTS</a>
76
+ </div>
77
+ </div>
78
+
79
+ ## Model Description
80
+
81
+ OuteTTS-0.2-500M is our improved successor to the v0.1 release.
82
+ The model maintains the same approach of using audio prompts without architectural changes to the foundation model itself.
83
+ Built upon the Qwen-2.5-0.5B, this version was trained on larger and more diverse datasets, resulting in significant improvements across all aspects of performance.
84
+
85
+ ## Key Improvements
86
+
87
+ - **Enhanced Accuracy**: Significantly improved prompt following and output coherence compared to the previous version
88
+ - **Natural Speech**: Produces more natural and fluid speech synthesis
89
+ - **Expanded Vocabulary**: Trained on over 5 billion audio prompt tokens
90
+ - **Voice Cloning**: Improved voice cloning capabilities with greater diversity and accuracy
91
+ - **Multilingual Support**: New experimental support for Chinese, Japanese, and Korean languages
92
+
93
+ ## Speech Demo
94
+
95
+ <video width="1280" height="720" controls>
96
+ <source src="https://huggingface.co/OuteAI/OuteTTS-0.2-500M-GGUF/resolve/main/media/demo.mp4" type="video/mp4">
97
+ Your browser does not support the video tag.
98
+ </video>
99
+
100
+ ## Usage
101
+
102
+ ### Installation
103
+
104
+ [![GitHub](https://img.shields.io/badge/GitHub-OuteTTS-181717?logo=github)](https://github.com/edwko/OuteTTS)
105
+
106
+ ```bash
107
+ pip install outetts
108
+ ```
109
+
110
+ ### Interface Usage
111
+
112
+ ```python
113
+ import outetts
114
+
115
+ # Configure the model
116
+ model_config = outetts.HFModelConfig_v1(
117
+ model_path="OuteAI/OuteTTS-0.2-500M",
118
+ language="en", # Supported languages in v0.2: en, zh, ja, ko
119
+ )
120
+
121
+ # Initialize the interface
122
+ interface = outetts.InterfaceHF(model_version="0.2", cfg=model_config)
123
+
124
+ # Optional: Create a speaker profile (use a 10-15 second audio clip)
125
+ # speaker = interface.create_speaker(
126
+ # audio_path="path/to/audio/file",
127
+ # transcript="Transcription of the audio file."
128
+ # )
129
+
130
+ # Optional: Save and load speaker profiles
131
+ # interface.save_speaker(speaker, "speaker.pkl")
132
+ # speaker = interface.load_speaker("speaker.pkl")
133
+
134
+ # Optional: Load speaker from default presets
135
+ interface.print_default_speakers()
136
+ speaker = interface.load_default_speaker(name="male_1")
137
+
138
+ output = interface.generate(
139
+ text="Speech synthesis is the artificial production of human speech. A computer system used for this purpose is called a speech synthesizer, and it can be implemented in software or hardware products.",
140
+ # Lower temperature values may result in a more stable tone,
141
+ # while higher values can introduce varied and expressive speech
142
+ temperature=0.1,
143
+ repetition_penalty=1.1,
144
+ max_length=4096,
145
+
146
+ # Optional: Use a speaker profile for consistent voice characteristics
147
+ # Without a speaker profile, the model will generate a voice with random characteristics
148
+ speaker=speaker,
149
+ )
150
+
151
+ # Save the synthesized speech to a file
152
+ output.save("output.wav")
153
+
154
+ # Optional: Play the synthesized speech
155
+ # output.play()
156
+ ```
157
+
158
+ ## Using GGUF Model
159
+
160
+ ```python
161
+ # Configure the GGUF model
162
+ model_config = outetts.GGUFModelConfig_v1(
163
+ model_path="local/path/to/model.gguf",
164
+ language="en", # Supported languages in v0.2: en, zh, ja, ko
165
+ n_gpu_layers=0,
166
+ )
167
+
168
+ # Initialize the GGUF interface
169
+ interface = outetts.InterfaceGGUF(model_version="0.2", cfg=model_config)
170
+ ```
171
+
172
+ ## Model Specifications
173
+ - **Base Model**: Qwen-2.5-0.5B
174
+ - **Parameter Count**: 500M
175
+ - **Language Support**:
176
+ - Primary: English
177
+ - Experimental: Chinese, Japanese, Korean
178
+ - **License**: CC BY NC 4.0
179
+
180
+ ## Training Datasets
181
+ - Emilia-Dataset (CC BY NC 4.0)
182
+ - LibriTTS-R (CC BY 4.0)
183
+ - Multilingual LibriSpeech (MLS) (CC BY 4.0)
184
+
185
+ ## Credits & References
186
+ - [WavTokenizer](https://github.com/jishengpeng/WavTokenizer)
187
+ - [CTC Forced Alignment](https://pytorch.org/audio/stable/tutorials/ctc_forced_alignment_api_tutorial.html)
188
+ - [Qwen-2.5-0.5B](https://huggingface.co/Qwen/Qwen2.5-0.5B)