OuteAI
/

OuteTTS-0.2-500M-GGUF

Text-to-Speech

GGUF

Inference Endpoints

Model card Files Files and versions Community

edwko commited on 12 days ago

Commit

aec1b5e

•

1 Parent(s): 99b4851

Update README.md

Browse files

Files changed (1) hide show

README.md +115 -33

README.md CHANGED Viewed

@@ -65,9 +65,9 @@ th, td {
 <div class="box">
   <div style="margin-bottom: 20px;">
     <h2 style="margin-bottom: 4px; margin-top: 0px;">OuteAI</h2>
-    <a href="https://www.outeai.com/" target="_blank" style="margin-right: 10px;">🌎 OuteAI.com</a>
-    <a href="https://discord.gg/vyBM87kAmf" target="_blank" style="margin-right: 10px;">🤝 Join our Discord</a>
-    <a href="https://x.com/OuteAI" target="_blank">𝕏 @OuteAI</a>
   </div>
   <div class="badges">
     <a href="https://huggingface.co/OuteAI/OuteTTS-0.2-500M" target="_blank" class="badge badge-hf-blue">🤗 Hugging Face - OuteTTS 0.2 500M</a>
@@ -83,7 +83,7 @@ OuteTTS-0.2-500M is our improved successor to the v0.1 release.
 The model maintains the same approach of using audio prompts without architectural changes to the foundation model itself.
 Built upon the Qwen-2.5-0.5B, this version was trained on larger and more diverse datasets, resulting in significant improvements across all aspects of performance.
-Special thanks to **Hugging Face** for providing GPU grant that supported the training of this model.
 ## Key Improvements
@@ -100,17 +100,21 @@ Special thanks to **Hugging Face** for providing GPU grant that supported the tr
 Your browser does not support the video tag.
 </video>
-## Usage
-### Installation
 [![GitHub](https://img.shields.io/badge/GitHub-OuteTTS-181717?logo=github)](https://github.com/edwko/OuteTTS)
 ```bash
-pip install outetts
 ```
-### Interface Usage
 ```python
 import outetts
@@ -118,30 +122,21 @@ import outetts
 # Configure the model
 model_config = outetts.HFModelConfig_v1(
     model_path="OuteAI/OuteTTS-0.2-500M",
-    language="en",  # Supported languages in v0.2: en, zh, ja, ko
 )
 # Initialize the interface
 interface = outetts.InterfaceHF(model_version="0.2", cfg=model_config)
-# Optional: Create a speaker profile (use a 10-15 second audio clip)
-# speaker = interface.create_speaker(
-#     audio_path="path/to/audio/file",
-#     transcript="Transcription of the audio file."
-# )
-# Optional: Save and load speaker profiles
-# interface.save_speaker(speaker, "speaker.json")
-# speaker = interface.load_speaker("speaker.json")
-# Optional: Load speaker from default presets
 interface.print_default_speakers()
 speaker = interface.load_default_speaker(name="male_1")
 output = interface.generate(
-    text="Speech synthesis is the artificial production of human speech. A computer system used for this purpose is called a speech synthesizer, and it can be implemented in software or hardware products.",
-    # Lower temperature values may result in a more stable tone,
-    # while higher values can introduce varied and expressive speech
     temperature=0.1,
     repetition_penalty=1.1,
     max_length=4096,
@@ -151,36 +146,123 @@ output = interface.generate(
     speaker=speaker,
 )
-# Save the synthesized speech to a file
 output.save("output.wav")
-# Optional: Play the synthesized speech
 # output.play()
 ```
-## Using GGUF Model
 ```python
-# Configure the GGUF model
 model_config = outetts.GGUFModelConfig_v1(
     model_path="local/path/to/model.gguf",
     language="en", # Supported languages in v0.2: en, zh, ja, ko
     n_gpu_layers=0,
 )
-# Initialize the GGUF interface
 interface = outetts.InterfaceGGUF(model_version="0.2", cfg=model_config)
 ```
-## Configure the model with bfloat16 and flash attention
 ```python
 import outetts
-import torch
 model_config = outetts.HFModelConfig_v1(
     model_path="OuteAI/OuteTTS-0.2-500M",
-    language="en",  # Supported languages in v0.2: en, zh, ja, ko
     dtype=torch.bfloat16,
     additional_model_config={
         'attn_implementation': "flash_attention_2"
@@ -188,7 +270,7 @@ model_config = outetts.HFModelConfig_v1(
 )
 ```
-## Creating a Speaker for Voice Cloning
 To achieve the best results when creating a speaker profile, consider the following recommendations:

 <div class="box">
   <div style="margin-bottom: 20px;">
     <h2 style="margin-bottom: 4px; margin-top: 0px;">OuteAI</h2>
+    <a href="https://www.outeai.com/" target="_blank" style="margin-right: 10px;">🌐 OuteAI.com</a>
+    <a href="https://discord.gg/vyBM87kAmf" target="_blank" style="margin-right: 10px;">💬 Join our Discord</a>
+    <a href="https://x.com/OuteAI" target="_blank">✖️ (Twitter) @OuteAI</a>
   </div>
   <div class="badges">
     <a href="https://huggingface.co/OuteAI/OuteTTS-0.2-500M" target="_blank" class="badge badge-hf-blue">🤗 Hugging Face - OuteTTS 0.2 500M</a>
 The model maintains the same approach of using audio prompts without architectural changes to the foundation model itself.
 Built upon the Qwen-2.5-0.5B, this version was trained on larger and more diverse datasets, resulting in significant improvements across all aspects of performance.
+Special thanks to **Hugging Face** for providing GPU grant that supported the training of this model!
 ## Key Improvements
 Your browser does not support the video tag.
 </video>
+## Installation
 [![GitHub](https://img.shields.io/badge/GitHub-OuteTTS-181717?logo=github)](https://github.com/edwko/OuteTTS)
 ```bash
+pip install outetts --upgrade
 ```
+**Important:**
+- For GGUF support, install `llama-cpp-python` manually. [Installation Guide](https://github.com/abetlen/llama-cpp-python?tab=readme-ov-file#installation)
+- For EXL2 support, install `exllamav2` manually. [Installation Guide](https://github.com/turboderp/exllamav2?tab=readme-ov-file#installation)
+## Usage
+### Quick Start: Basic Full Example
 ```python
 import outetts
 # Configure the model
 model_config = outetts.HFModelConfig_v1(
     model_path="OuteAI/OuteTTS-0.2-500M",
+    language="en",  # Supported languages: en, zh, ja, ko
 )
 # Initialize the interface
 interface = outetts.InterfaceHF(model_version="0.2", cfg=model_config)
+# Print available default speakers
 interface.print_default_speakers()
+# Load a default speaker
 speaker = interface.load_default_speaker(name="male_1")
+# Generate speech
 output = interface.generate(
+    text="Speech synthesis is the artificial production of human speech.",
     temperature=0.1,
     repetition_penalty=1.1,
     max_length=4096,
     speaker=speaker,
 )
+# Save the generated speech to a file
 output.save("output.wav")
+# Optional: Play the generated audio
 # output.play()
 ```
+### Backend-Specific Configuration
+#### Hugging Face Transformers
+```python
+import outetts
+model_config = outetts.HFModelConfig_v1(
+    model_path="OuteAI/OuteTTS-0.2-500M",
+    language="en",  # Supported languages in v0.2: en, zh, ja, ko
+)
+interface = outetts.InterfaceHF(model_version="0.2", cfg=model_config)
+```
+#### GGUF (llama-cpp-python)
 ```python
+import outetts
 model_config = outetts.GGUFModelConfig_v1(
     model_path="local/path/to/model.gguf",
     language="en", # Supported languages in v0.2: en, zh, ja, ko
     n_gpu_layers=0,
 )
 interface = outetts.InterfaceGGUF(model_version="0.2", cfg=model_config)
 ```
+#### ExLlamaV2
 ```python
 import outetts
+model_config = outetts.EXL2ModelConfig_v1(
+    model_path="local/path/to/model",
+    language="en", # Supported languages in v0.2: en, zh, ja, ko
+)
+interface = outetts.InterfaceEXL2(model_version="0.2", cfg=model_config)
+```
+### Speaker Creation and Management
+#### Creating a Speaker
+You can create a speaker profile for voice cloning, which is compatible across all backends.
+```python
+speaker = interface.create_speaker(
+    audio_path="path/to/audio/file.wav",
+    # If transcript is not provided, it will be automatically transcribed using Whisper
+    transcript=None,            # Set to None to use Whisper for transcription
+    whisper_model="turbo",      # Optional: specify Whisper model (default: "turbo")
+    whisper_device=None,        # Optional: specify device for Whisper (default: None)
+)
+```
+#### Saving and Loading Speaker Profiles
+Speaker profiles can be saved and loaded across all supported backends.
+```python
+# Save speaker profile
+interface.save_speaker(speaker, "speaker.json")
+# Load speaker profile
+speaker = interface.load_speaker("speaker.json")
+```
+#### Default Speaker Initialization
+OuteTTS includes a set of default speaker profiles. Use them directly:
+```python
+# Print available default speakers
+interface.print_default_speakers()
+# Load a default speaker
+speaker = interface.load_default_speaker(name="male_1")
+```
+### Text-to-Speech Generation
+The generation process is consistent across all backends.
+```python
+output = interface.generate(
+    text="Speech synthesis is the artificial production of human speech.",
+    temperature=0.1,
+    repetition_penalty=1.1,
+    max_length=4096,
+    speaker=speaker, # Optional: speaker profile
+)
+output.save("output.wav")
+# Optional: Play the audio
+# output.play()
+```
+### Custom Backend Configuration
+You can initialize custom backend configurations for specific needs.
+#### Example with Flash Attention for Hugging Face Transformers
+```python
 model_config = outetts.HFModelConfig_v1(
     model_path="OuteAI/OuteTTS-0.2-500M",
+    language="en",
     dtype=torch.bfloat16,
     additional_model_config={
         'attn_implementation': "flash_attention_2"
 )
 ```
+## Speaker Profile Recommendations
 To achieve the best results when creating a speaker profile, consider the following recommendations: