Amphion Text-to-Speech (TTS) Recipe

Quick Start

We provide a beginner recipe to demonstrate how to train a cutting edge TTS model. Specifically, it is Amphion's re-implementation for VALL-E, which is a zero-shot TTS architecture that uses a neural codec language model with discrete codes.

Supported Model Architectures

Until now, Amphion TTS supports the following models or architectures,

FastSpeech2: A non-autoregressive TTS architecture that utilizes feed-forward Transformer blocks.
VITS: An end-to-end TTS architecture that utilizes conditional variational autoencoder with adversarial learning
VALL-E: A zero-shot TTS architecture that uses a neural codec language model with discrete codes. This model is our updated VALL-E implementation as of June 2024 which uses Llama as its underlying architecture. The previous version of VALL-E release can be found here
NaturalSpeech2 (👨‍💻 developing): An architecture for TTS that utilizes a latent diffusion model to generate natural-sounding voices.
Jets: An end-to-end TTS model that jointly trains FastSpeech2 and HiFi-GAN with an alignment module.

Amphion TTS Demo

Here are some TTS samples from Amphion.