README.md · OuteAI/OuteTTS-0.2-500M-GGUF at 20efb21bf0e67ab489e8f0840f6ed615a66dec44

metadata

license: cc-by-nc-4.0
datasets:
  - facebook/multilingual_librispeech
  - parler-tts/libritts_r_filtered
  - amphion/Emilia-Dataset
language:
  - en
  - zh
  - ja
  - ko
pipeline_tag: text-to-speech

OuteAI

🌎 OuteAI.com 🤝 Join our Discord 𝕏 @OuteAI

🤗 Hugging Face - OuteTTS 0.2 500M 🤗 Hugging Face - OuteTTS 0.2 500M GGUF 🤗 Hugging Face - Demo Space GitHub - OuteTTS

Model Description

OuteTTS-0.2-500M is our improved successor to the v0.1 release. The model maintains the same approach of using audio prompts without architectural changes to the foundation model itself. Built upon the Qwen-2.5-0.5B, this version was trained on larger and more diverse datasets, resulting in significant improvements across all aspects of performance.

Key Improvements

Enhanced Accuracy: Significantly improved prompt following and output coherence compared to the previous version
Natural Speech: Produces more natural and fluid speech synthesis
Expanded Vocabulary: Trained on over 5 billion audio prompt tokens
Voice Cloning: Improved voice cloning capabilities with greater diversity and accuracy
Multilingual Support: New experimental support for Chinese, Japanese, and Korean languages

Speech Demo

Usage

Installation

pip install outetts

Interface Usage

import outetts

# Configure the model
model_config = outetts.HFModelConfig_v1(
    model_path="OuteAI/OuteTTS-0.2-500M",
    language="en",  # Supported languages in v0.2: en, zh, ja, ko
)

# Initialize the interface
interface = outetts.InterfaceHF(model_version="0.2", cfg=model_config)

# Optional: Create a speaker profile (use a 10-15 second audio clip)
# speaker = interface.create_speaker(
#     audio_path="path/to/audio/file",
#     transcript="Transcription of the audio file."
# )

# Optional: Save and load speaker profiles
# interface.save_speaker(speaker, "speaker.json")
# speaker = interface.load_speaker("speaker.json")

# Optional: Load speaker from default presets
interface.print_default_speakers()
speaker = interface.load_default_speaker(name="male_1")

output = interface.generate(
    text="Speech synthesis is the artificial production of human speech. A computer system used for this purpose is called a speech synthesizer, and it can be implemented in software or hardware products.",
    # Lower temperature values may result in a more stable tone,
    # while higher values can introduce varied and expressive speech
    temperature=0.1,
    repetition_penalty=1.1,
    max_length=4096,

    # Optional: Use a speaker profile for consistent voice characteristics
    # Without a speaker profile, the model will generate a voice with random characteristics
    speaker=speaker,
)

# Save the synthesized speech to a file
output.save("output.wav")

# Optional: Play the synthesized speech
# output.play()

Using GGUF Model

# Configure the GGUF model
model_config = outetts.GGUFModelConfig_v1(
    model_path="local/path/to/model.gguf",
    language="en", # Supported languages in v0.2: en, zh, ja, ko
    n_gpu_layers=0,
)

# Initialize the GGUF interface
interface = outetts.InterfaceGGUF(model_version="0.2", cfg=model_config)

Model Specifications

Base Model: Qwen-2.5-0.5B
Parameter Count: 500M
Language Support:
- Primary: English
- Experimental: Chinese, Japanese, Korean
License: CC BY NC 4.0

Training Datasets

Emilia-Dataset (CC BY NC 4.0)
LibriTTS-R (CC BY 4.0)
Multilingual LibriSpeech (MLS) (CC BY 4.0)

OuteAI
/

OuteTTS-0.2-500M-GGUF