Text-to-Speech
GGUF
Inference Endpoints
edwko commited on
Commit
ab56b02
1 Parent(s): 20efb21

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +21 -0
README.md CHANGED
@@ -169,6 +169,27 @@ model_config = outetts.GGUFModelConfig_v1(
169
  interface = outetts.InterfaceGGUF(model_version="0.2", cfg=model_config)
170
  ```
171
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
172
  ## Model Specifications
173
  - **Base Model**: Qwen-2.5-0.5B
174
  - **Parameter Count**: 500M
 
169
  interface = outetts.InterfaceGGUF(model_version="0.2", cfg=model_config)
170
  ```
171
 
172
+ ## Creating a Speaker for Voice Cloning
173
+
174
+ To achieve the best results when creating a speaker profile, consider the following recommendations:
175
+
176
+ 1. **Audio Clip Duration:**
177
+ - Use an audio clip of around **10-15 seconds**.
178
+ - This duration provides sufficient data for the model to learn the speaker's characteristics while keeping the input manageable. The model's context length is 4096 tokens, allowing it to generate around 54 seconds of audio in total. However, when a speaker profile is included, this capacity is reduced proportionally to the length of the speaker's audio clip.
179
+
180
+ 2. **Audio Quality:**
181
+ - Ensure the audio is **clear and noise-free**. Background noise or distortions can reduce the model's ability to extract accurate voice features.
182
+
183
+ 3. **Accurate Transcription:**
184
+ - Provide a highly **accurate transcription** of the audio clip. Mismatches between the audio and transcription can lead to suboptimal results.
185
+
186
+ 4. **Speaker Familiarity:**
187
+ - The model performs best with voices that are similar to those seen during training. Using a voice that is **significantly different from typical training samples** (e.g., unique accents, rare vocal characteristics) might result in inaccurate replication.
188
+ - In such cases, you may need to **fine-tune the model** specifically on your target speaker's voice to achieve a better representation.
189
+
190
+ 5. **Parameter Adjustments:**
191
+ - Adjust parameters like `temperature` in the `generate` function to refine the expressive quality and consistency of the synthesized voice.
192
+
193
  ## Model Specifications
194
  - **Base Model**: Qwen-2.5-0.5B
195
  - **Parameter Count**: 500M