license: cc-by-nc-4.0
datasets:
- facebook/multilingual_librispeech
- parler-tts/libritts_r_filtered
- amphion/Emilia-Dataset
- parler-tts/mls_eng
language:
- en
- zh
- ja
- ko
pipeline_tag: text-to-speech
Model Description
OuteTTS-0.2-500M is our improved successor to the v0.1 release. The model maintains the same approach of using audio prompts without architectural changes to the foundation model itself. Built upon the Qwen-2.5-0.5B, this version was trained on larger and more diverse datasets, resulting in significant improvements across all aspects of performance.
Special thanks to Hugging Face for providing GPU grant that supported the training of this model!
Key Improvements
- Enhanced Accuracy: Significantly improved prompt following and output coherence compared to the previous version
- Natural Speech: Produces more natural and fluid speech synthesis
- Expanded Vocabulary: Trained on over 5 billion audio prompt tokens
- Voice Cloning: Improved voice cloning capabilities with greater diversity and accuracy
- Multilingual Support: New experimental support for Chinese, Japanese, and Korean languages
Speech Demo
Installation
pip install outetts --upgrade
Important:
- For GGUF support, install
llama-cpp-python
manually. Installation Guide - For EXL2 support, install
exllamav2
manually. Installation Guide
Usage
Quick Start: Basic Full Example
import outetts
# Configure the model
model_config = outetts.HFModelConfig_v1(
model_path="OuteAI/OuteTTS-0.2-500M",
language="en", # Supported languages in v0.2: en, zh, ja, ko
)
# Initialize the interface
interface = outetts.InterfaceHF(model_version="0.2", cfg=model_config)
# Print available default speakers
interface.print_default_speakers()
# Load a default speaker
speaker = interface.load_default_speaker(name="male_1")
# Generate speech
output = interface.generate(
text="Speech synthesis is the artificial production of human speech.",
temperature=0.1,
repetition_penalty=1.1,
max_length=4096,
# Optional: Use a speaker profile for consistent voice characteristics
# Without a speaker profile, the model will generate a voice with random characteristics
speaker=speaker,
)
# Save the generated speech to a file
output.save("output.wav")
# Optional: Play the generated audio
# output.play()
Backend-Specific Configuration
Hugging Face Transformers
import outetts
model_config = outetts.HFModelConfig_v1(
model_path="OuteAI/OuteTTS-0.2-500M",
language="en", # Supported languages in v0.2: en, zh, ja, ko
)
interface = outetts.InterfaceHF(model_version="0.2", cfg=model_config)
GGUF (llama-cpp-python)
import outetts
model_config = outetts.GGUFModelConfig_v1(
model_path="local/path/to/model.gguf",
language="en", # Supported languages in v0.2: en, zh, ja, ko
n_gpu_layers=0,
)
interface = outetts.InterfaceGGUF(model_version="0.2", cfg=model_config)
ExLlamaV2
import outetts
model_config = outetts.EXL2ModelConfig_v1(
model_path="local/path/to/model",
language="en", # Supported languages in v0.2: en, zh, ja, ko
)
interface = outetts.InterfaceEXL2(model_version="0.2", cfg=model_config)
Speaker Creation and Management
Creating a Speaker
You can create a speaker profile for voice cloning, which is compatible across all backends.
speaker = interface.create_speaker(
audio_path="path/to/audio/file.wav",
# If transcript is not provided, it will be automatically transcribed using Whisper
transcript=None, # Set to None to use Whisper for transcription
whisper_model="turbo", # Optional: specify Whisper model (default: "turbo")
whisper_device=None, # Optional: specify device for Whisper (default: None)
)
Saving and Loading Speaker Profiles
Speaker profiles can be saved and loaded across all supported backends.
# Save speaker profile
interface.save_speaker(speaker, "speaker.json")
# Load speaker profile
speaker = interface.load_speaker("speaker.json")
Default Speaker Initialization
OuteTTS includes a set of default speaker profiles. Use them directly:
# Print available default speakers
interface.print_default_speakers()
# Load a default speaker
speaker = interface.load_default_speaker(name="male_1")
Text-to-Speech Generation
The generation process is consistent across all backends.
output = interface.generate(
text="Speech synthesis is the artificial production of human speech.",
temperature=0.1,
repetition_penalty=1.1,
max_length=4096,
speaker=speaker, # Optional: speaker profile
)
output.save("output.wav")
# Optional: Play the audio
# output.play()
Custom Backend Configuration
You can initialize custom backend configurations for specific needs.
Example with Flash Attention for Hugging Face Transformers
model_config = outetts.HFModelConfig_v1(
model_path="OuteAI/OuteTTS-0.2-500M",
language="en",
dtype=torch.bfloat16,
additional_model_config={
'attn_implementation': "flash_attention_2"
}
)
Speaker Profile Recommendations
To achieve the best results when creating a speaker profile, consider the following recommendations:
Audio Clip Duration:
- Use an audio clip of around 10-15 seconds.
- This duration provides sufficient data for the model to learn the speaker's characteristics while keeping the input manageable. The model's context length is 4096 tokens, allowing it to generate around 54 seconds of audio in total. However, when a speaker profile is included, this capacity is reduced proportionally to the length of the speaker's audio clip.
Audio Quality:
- Ensure the audio is clear and noise-free. Background noise or distortions can reduce the model's ability to extract accurate voice features.
Accurate Transcription:
- Provide a highly accurate transcription of the audio clip. Mismatches between the audio and transcription can lead to suboptimal results.
Speaker Familiarity:
- The model performs best with voices that are similar to those seen during training. Using a voice that is significantly different from typical training samples (e.g., unique accents, rare vocal characteristics) might result in inaccurate replication.
- In such cases, you may need to fine-tune the model specifically on your target speaker's voice to achieve a better representation.
Parameter Adjustments:
- Adjust parameters like
temperature
in thegenerate
function to refine the expressive quality and consistency of the synthesized voice.
- Adjust parameters like
Model Specifications
- Base Model: Qwen-2.5-0.5B
- Parameter Count: 500M
- Language Support:
- Primary: English
- Experimental: Chinese, Japanese, Korean
- License: CC BY NC 4.0
Training Datasets
- Emilia-Dataset (CC BY NC 4.0)
- LibriTTS-R (CC BY 4.0)
- Multilingual LibriSpeech (MLS) (CC BY 4.0)