OuteAI
/

OuteTTS-0.2-500M-GGUF

Inference Endpoints

Model card Files Files and versions Community

edwko commited on Nov 24, 2024

Commit

ab56b02

·

verified ·

1 Parent(s): 20efb21

Update README.md

Files changed (1) hide show

README.md +21 -0

README.md CHANGED Viewed

@@ -169,6 +169,27 @@ model_config = outetts.GGUFModelConfig_v1(
 interface = outetts.InterfaceGGUF(model_version="0.2", cfg=model_config)
 ```
 ## Model Specifications
 - **Base Model**: Qwen-2.5-0.5B
 - **Parameter Count**: 500M

 interface = outetts.InterfaceGGUF(model_version="0.2", cfg=model_config)
 ```
+## Creating a Speaker for Voice Cloning
+To achieve the best results when creating a speaker profile, consider the following recommendations:
+1. **Audio Clip Duration:**
+   - Use an audio clip of around **10-15 seconds**.
+   - This duration provides sufficient data for the model to learn the speaker's characteristics while keeping the input manageable. The model's context length is 4096 tokens, allowing it to generate around 54 seconds of audio in total. However, when a speaker profile is included, this capacity is reduced proportionally to the length of the speaker's audio clip.
+2. **Audio Quality:**
+   - Ensure the audio is **clear and noise-free**. Background noise or distortions can reduce the model's ability to extract accurate voice features.
+3. **Accurate Transcription:**
+   - Provide a highly **accurate transcription** of the audio clip. Mismatches between the audio and transcription can lead to suboptimal results.
+4. **Speaker Familiarity:**
+   - The model performs best with voices that are similar to those seen during training. Using a voice that is **significantly different from typical training samples** (e.g., unique accents, rare vocal characteristics) might result in inaccurate replication.
+   - In such cases, you may need to **fine-tune the model** specifically on your target speaker's voice to achieve a better representation.
+5. **Parameter Adjustments:**
+   - Adjust parameters like `temperature` in the `generate` function to refine the expressive quality and consistency of the synthesized voice.
 ## Model Specifications
 - **Base Model**: Qwen-2.5-0.5B
 - **Parameter Count**: 500M