Hecheng0625
commited on
Commit
β’
4949b9c
1
Parent(s):
7ee3434
Upload README.md
Browse files
README.md
CHANGED
@@ -1,169 +1,183 @@
|
|
1 |
-
|
2 |
-
|
3 |
-
<div>
|
4 |
-
<a href="https://arxiv.org/abs/2312.09911"><img src="https://img.shields.io/badge/arXiv-Paper-<COLOR>.svg"></a>
|
5 |
-
<a href="https://huggingface.co/amphion"><img src="https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace-Amphion-pink"></a>
|
6 |
-
<a href="https://openxlab.org.cn/usercenter/Amphion"><img src="https://cdn-static.openxlab.org.cn/app-center/openxlab_app.svg"></a>
|
7 |
-
<a href="https://discord.com/invite/ZxxREr3Y"><img src="https://img.shields.io/badge/Discord-Join%20chat-blue.svg"></a>
|
8 |
-
<a href="egs/tts/README.md"><img src="https://img.shields.io/badge/README-TTS-blue"></a>
|
9 |
-
<a href="egs/svc/README.md"><img src="https://img.shields.io/badge/README-SVC-blue"></a>
|
10 |
-
<a href="egs/tta/README.md"><img src="https://img.shields.io/badge/README-TTA-blue"></a>
|
11 |
-
<a href="egs/vocoder/README.md"><img src="https://img.shields.io/badge/README-Vocoder-purple"></a>
|
12 |
-
<a href="egs/metrics/README.md"><img src="https://img.shields.io/badge/README-Evaluation-yellow"></a>
|
13 |
-
<a href="LICENSE"><img src="https://img.shields.io/badge/LICENSE-MIT-red"></a>
|
14 |
-
</div>
|
15 |
-
<br>
|
16 |
-
|
17 |
-
**Amphion (/Γ¦mΛfaΙͺΙn/) is a toolkit for Audio, Music, and Speech Generation.** Its purpose is to support reproducible research and help junior researchers and engineers get started in the field of audio, music, and speech generation research and development. Amphion offers a unique feature: **visualizations** of classic models or architectures. We believe that these visualizations are beneficial for junior researchers and engineers who wish to gain a better understanding of the model.
|
18 |
-
|
19 |
-
**The North-Star objective of Amphion is to offer a platform for studying the conversion of any inputs into audio.** Amphion is designed to support individual generation tasks, including but not limited to,
|
20 |
-
|
21 |
-
- **TTS**: Text to Speech (β³Β supported)
|
22 |
-
- **SVS**: Singing Voice Synthesis (π¨βπ»Β developing)
|
23 |
-
- **VC**: Voice Conversion (π¨βπ»Β developing)
|
24 |
-
- **SVC**: Singing Voice Conversion (β³Β supported)
|
25 |
-
- **TTA**: Text to Audio (β³Β supported)
|
26 |
-
- **TTM**: Text to Music (π¨βπ»Β developing)
|
27 |
-
- moreβ¦
|
28 |
-
|
29 |
-
In addition to the specific generation tasks, Amphion includes several **vocoders** and **evaluation metrics**. A vocoder is an important module for producing high-quality audio signals, while evaluation metrics are critical for ensuring consistent metrics in generation tasks. Moreover, Amphion is dedicated to advancing audio generation in real-world applications, such as building **large-scale datasets** for speech synthesis.
|
30 |
-
|
31 |
-
## πΒ News
|
32 |
-
- **2024/10/19**: We release **MaskGCT**, a fully non-autoregressive TTS model that eliminates the need for explicit alignment information between text and speech supervision. MaskGCT is trained on Emilia dataset and achieves SOTA zero-shot TTS perfermance. [![arXiv](https://img.shields.io/badge/arXiv-Paper-COLOR.svg)](https://arxiv.org/abs/2409.00750) [![hf](https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace-model-yellow)](https://huggingface.co/amphion/maskgct) [![hf](https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace-demo-pink)](https://huggingface.co/spaces/amphion/maskgct) [![readme](https://img.shields.io/badge/README-Key%20Features-blue)](models/tts/maskgct/README.md)
|
33 |
-
- **2024/09/01**: [Amphion](https://arxiv.org/abs/2312.09911), [Emilia](https://arxiv.org/abs/2407.05361) and [DSFF-SVC](https://arxiv.org/abs/2310.11160) got accepted by IEEE SLT 2024! π€
|
34 |
-
- **2024/08/28**: Welcome to join Amphion's [Discord channel](https://discord.gg/drhW7ajqAG) to stay connected and engage with our communityοΌ
|
35 |
-
- **2024/08/20**: [SingVisio](https://arxiv.org/abs/2402.12660) got accepted by Computers & Graphics, [available here](https://www.sciencedirect.com/science/article/pii/S0097849324001936)! π
|
36 |
-
- **2024/08/27**: *The Emilia dataset is now publicly available!* Discover the most extensive and diverse speech generation dataset with 101k hours of in-the-wild speech data now at [![hf](https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace-Dataset-yellow)](https://huggingface.co/datasets/amphion/Emilia-Dataset) or [![OpenDataLab](https://img.shields.io/badge/OpenDataLab-Dataset-blue)](https://opendatalab.com/Amphion/Emilia)! πππ
|
37 |
-
- **2024/07/01**: Amphion now releases **Emilia**, the first open-source multilingual in-the-wild dataset for speech generation with over 101k hours of speech data, and the **Emilia-Pipe**, the first open-source preprocessing pipeline designed to transform in-the-wild speech data into high-quality training data with annotations for speech generation! [![arXiv](https://img.shields.io/badge/arXiv-Paper-COLOR.svg)](https://arxiv.org/abs/2407.05361) [![hf](https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace-Dataset-yellow)](https://huggingface.co/datasets/amphion/Emilia) [![demo](https://img.shields.io/badge/WebPage-Demo-red)](https://emilia-dataset.github.io/Emilia-Demo-Page/) [![readme](https://img.shields.io/badge/README-Key%20Features-blue)](preprocessors/Emilia/README.md)
|
38 |
-
- **2024/06/17**: Amphion has a new release for its **VALL-E** model! It uses Llama as its underlying architecture and has better model performance, faster training speed, and more readable codes compared to our first version. [![readme](https://img.shields.io/badge/README-Key%20Features-blue)](egs/tts/VALLE_V2/README.md)
|
39 |
-
- **2024/03/12**: Amphion now support **NaturalSpeech3 FACodec** and release pretrained checkpoints. [![arXiv](https://img.shields.io/badge/arXiv-Paper-COLOR.svg)](https://arxiv.org/abs/2403.03100) [![hf](https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace-model-yellow)](https://huggingface.co/amphion/naturalspeech3_facodec) [![hf](https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace-demo-pink)](https://huggingface.co/spaces/amphion/naturalspeech3_facodec) [![readme](https://img.shields.io/badge/README-Key%20Features-blue)](models/codec/ns3_codec/README.md)
|
40 |
-
- **2024/02/22**: The first Amphion visualization tool, **SingVisio**, release. [![arXiv](https://img.shields.io/badge/arXiv-Paper-COLOR.svg)](https://arxiv.org/abs/2402.12660) [![openxlab](https://cdn-static.openxlab.org.cn/app-center/openxlab_app.svg)](https://openxlab.org.cn/apps/detail/Amphion/SingVisio) [![Video](https://img.shields.io/badge/Video-Demo-orange)](https://drive.google.com/file/d/15097SGhQh-SwUNbdWDYNyWEP--YGLba5/view) [![readme](https://img.shields.io/badge/README-Key%20Features-blue)](egs/visualization/SingVisio/README.md)
|
41 |
-
- **2023/12/18**: Amphion v0.1 release. [![arXiv](https://img.shields.io/badge/arXiv-Paper-<COLOR>.svg)](https://arxiv.org/abs/2312.09911) [![hf](https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace-Amphion-pink)](https://huggingface.co/amphion) [![youtube](https://img.shields.io/badge/YouTube-Demo-red)](https://www.youtube.com/watch?v=1aw0HhcggvQ) [![readme](https://img.shields.io/badge/README-Key%20Features-blue)](https://github.com/open-mmlab/Amphion/pull/39)
|
42 |
-
- **2023/11/28**: Amphion alpha release. [![readme](https://img.shields.io/badge/README-Key%20Features-blue)](https://github.com/open-mmlab/Amphion/pull/2)
|
43 |
-
|
44 |
-
## βΒ Key Features
|
45 |
-
|
46 |
-
### TTS: Text to Speech
|
47 |
-
|
48 |
-
- Amphion achieves state-of-the-art performance compared to existing open-source repositories on text-to-speech (TTS) systems. It supports the following models or architectures:
|
49 |
-
- [FastSpeech2](https://arxiv.org/abs/2006.04558): A non-autoregressive TTS architecture that utilizes feed-forward Transformer blocks.
|
50 |
-
- [VITS](https://arxiv.org/abs/2106.06103): An end-to-end TTS architecture that utilizes conditional variational autoencoder with adversarial learning
|
51 |
-
- [VALL-E](https://arxiv.org/abs/2301.02111): A zero-shot TTS architecture that uses a neural codec language model with discrete codes.
|
52 |
-
- [NaturalSpeech2](https://arxiv.org/abs/2304.09116): An architecture for TTS that utilizes a latent diffusion model to generate natural-sounding voices.
|
53 |
-
- [Jets](Jets): An end-to-end TTS model that jointly trains FastSpeech2 and HiFi-GAN with an alignment module.
|
54 |
-
- [MaskGCT](https://arxiv.org/abs/2409.00750): a fully non-autoregressive TTS architecture that eliminates the need for explicit alignment information between text and speech supervision.
|
55 |
-
|
56 |
-
### SVC: Singing Voice Conversion
|
57 |
-
|
58 |
-
- Ampion supports multiple content-based features from various pretrained models, including [WeNet](https://github.com/wenet-e2e/wenet), [Whisper](https://github.com/openai/whisper), and [ContentVec](https://github.com/auspicious3000/contentvec). Their specific roles in SVC has been investigated in our SLT 2024 paper. [![arXiv](https://img.shields.io/badge/arXiv-Paper-<COLOR>.svg)](https://arxiv.org/abs/2310.11160) [![code](https://img.shields.io/badge/README-Code-red)](egs/svc/MultipleContentsSVC)
|
59 |
-
- Amphion implements several state-of-the-art model architectures, including diffusion-, transformer-, VAE- and flow-based models. The diffusion-based architecture uses [Bidirectional dilated CNN](https://openreview.net/pdf?id=a-xFK8Ymz5J) as a backend and supports several sampling algorithms such as [DDPM](https://arxiv.org/pdf/2006.11239.pdf), [DDIM](https://arxiv.org/pdf/2010.02502.pdf), and [PNDM](https://arxiv.org/pdf/2202.09778.pdf). Additionally, it supports single-step inference based on the [Consistency Model](https://openreview.net/pdf?id=FmqFfMTNnv).
|
60 |
-
|
61 |
-
### TTA: Text to Audio
|
62 |
-
|
63 |
-
- Amphion supports the TTA with a latent diffusion model. It is designed like [AudioLDM](https://arxiv.org/abs/2301.12503), [Make-an-Audio](https://arxiv.org/abs/2301.12661), and [AUDIT](https://arxiv.org/abs/2304.00830). It is also the official implementation of the text-to-audio generation part of our NeurIPS 2023 paper. [![arXiv](https://img.shields.io/badge/arXiv-Paper-<COLOR>.svg)](https://arxiv.org/abs/2304.00830) [![code](https://img.shields.io/badge/README-Code-red)](egs/tta/RECIPE.md)
|
64 |
-
|
65 |
-
### Vocoder
|
66 |
-
|
67 |
-
- Amphion supports various widely-used neural vocoders, including:
|
68 |
-
- GAN-based vocoders: [MelGAN](https://arxiv.org/abs/1910.06711), [HiFi-GAN](https://arxiv.org/abs/2010.05646), [NSF-HiFiGAN](https://github.com/nii-yamagishilab/project-NN-Pytorch-scripts), [BigVGAN](https://arxiv.org/abs/2206.04658), [APNet](https://arxiv.org/abs/2305.07952).
|
69 |
-
- Flow-based vocoders: [WaveGlow](https://arxiv.org/abs/1811.00002).
|
70 |
-
- Diffusion-based vocoders: [Diffwave](https://arxiv.org/abs/2009.09761).
|
71 |
-
- Auto-regressive based vocoders: [WaveNet](https://arxiv.org/abs/1609.03499), [WaveRNN](https://arxiv.org/abs/1802.08435v1).
|
72 |
-
- Amphion provides the official implementation of [Multi-Scale Constant-Q Transform Discriminator](https://arxiv.org/abs/2311.14957) (our ICASSP 2024 paper). It can be used to enhance any architecture GAN-based vocoders during training, and keep the inference stage (such as memory or speed) unchanged. [![arXiv](https://img.shields.io/badge/arXiv-Paper-<COLOR>.svg)](https://arxiv.org/abs/2311.14957) [![code](https://img.shields.io/badge/README-Code-red)](egs/vocoder/gan/tfr_enhanced_hifigan)
|
73 |
-
|
74 |
-
### Evaluation
|
75 |
-
|
76 |
-
Amphion provides a comprehensive objective evaluation of the generated audio. The evaluation metrics contain:
|
77 |
-
|
78 |
-
- **F0 Modeling**: F0 Pearson Coefficients, F0 Periodicity Root Mean Square Error, F0 Root Mean Square Error, Voiced/Unvoiced F1 Score, etc.
|
79 |
-
- **Energy Modeling**: Energy Root Mean Square Error, Energy Pearson Coefficients, etc.
|
80 |
-
- **Intelligibility**: Character/Word Error Rate, which can be calculated based on [Whisper](https://github.com/openai/whisper) and more.
|
81 |
-
- **Spectrogram Distortion**: Frechet Audio Distance (FAD), Mel Cepstral Distortion (MCD), Multi-Resolution STFT Distance (MSTFT), Perceptual Evaluation of Speech Quality (PESQ), Short Time Objective Intelligibility (STOI), etc.
|
82 |
-
- **Speaker Similarity**: Cosine similarity, which can be calculated based on [RawNet3](https://github.com/Jungjee/RawNet), [Resemblyzer](https://github.com/resemble-ai/Resemblyzer), [WeSpeaker](https://github.com/wenet-e2e/wespeaker), [WavLM](https://github.com/microsoft/unilm/tree/master/wavlm) and more.
|
83 |
|
84 |
-
|
|
|
|
|
|
|
85 |
|
86 |
-
|
87 |
-
- Amphion (exclusively) supports the [**Emilia**](preprocessors/Emilia/README.md) dataset and its preprocessing pipeline **Emilia-Pipe** for in-the-wild speech data!
|
88 |
|
89 |
-
|
90 |
|
91 |
-
|
92 |
-
|
93 |
-
|
94 |
-
|
95 |
-
|
96 |
-
## π Installation
|
97 |
-
|
98 |
-
Amphion can be installed through either Setup Installer or Docker Image.
|
99 |
-
|
100 |
-
### Setup Installer
|
101 |
-
|
102 |
-
```bash
|
103 |
-
git clone https://github.com/open-mmlab/Amphion.git
|
104 |
-
cd Amphion
|
105 |
|
106 |
-
|
107 |
-
conda create --name amphion python=3.9.15
|
108 |
-
conda activate amphion
|
109 |
|
110 |
-
|
111 |
-
sh env.sh
|
112 |
-
```
|
113 |
|
114 |
-
|
115 |
|
116 |
-
|
117 |
|
118 |
-
2. Run the following commands:
|
119 |
```bash
|
120 |
git clone https://github.com/open-mmlab/Amphion.git
|
121 |
-
|
122 |
-
|
123 |
-
docker pull realamphion/amphion
|
124 |
-
docker run --runtime=nvidia --gpus all -it -v .:/app realamphion/amphion
|
125 |
```
|
126 |
-
Mount dataset by argument `-v` is necessary when using Docker. Please refer to [Mount dataset in Docker container](egs/datasets/docker.md) and [Docker Docs](https://docs.docker.com/engine/reference/commandline/container_run/#volume) for more details.
|
127 |
|
|
|
128 |
|
129 |
-
|
130 |
|
131 |
-
We detail the instructions of different tasks in the following recipes:
|
132 |
|
133 |
-
|
134 |
-
|
135 |
-
|
136 |
-
|
137 |
-
|
138 |
-
|
139 |
|
140 |
-
|
141 |
-
We appreciate all contributions to improve Amphion. Please refer to [CONTRIBUTING.md](.github/CONTRIBUTING.md) for the contributing guideline.
|
142 |
|
143 |
-
|
|
|
144 |
|
|
|
|
|
145 |
|
146 |
-
|
147 |
-
|
148 |
-
|
149 |
-
- [WeNet](https://github.com/wenet-e2e/wenet), [Whisper](https://github.com/openai/whisper), [ContentVec](https://github.com/auspicious3000/contentvec), and [RawNet3](https://github.com/Jungjee/RawNet) for pretrained models and inference code.
|
150 |
-
- [HiFi-GAN](https://github.com/jik876/hifi-gan) for GAN-based Vocoder's architecture design and training strategy.
|
151 |
-
- [Encodec](https://github.com/facebookresearch/encodec) for well-organized GAN Discriminator's architecture and basic blocks.
|
152 |
-
- [Latent Diffusion](https://github.com/CompVis/latent-diffusion) for model architecture design.
|
153 |
-
- [TensorFlowTTS](https://github.com/TensorSpeech/TensorFlowTTS) for preparing the MFA tools.
|
154 |
|
|
|
|
|
155 |
|
156 |
-
|
|
|
|
|
|
|
157 |
|
158 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
159 |
|
160 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
161 |
|
162 |
```bibtex
|
163 |
-
@
|
164 |
-
|
165 |
-
|
166 |
-
|
167 |
-
|
168 |
}
|
169 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
## MaskGCT: Zero-Shot Text-to-Speech with Masked Generative Codec Transformer
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
2 |
|
3 |
+
[![arXiv](https://img.shields.io/badge/arXiv-Paper-COLOR.svg)](https://arxiv.org/abs/2409.00750)
|
4 |
+
[![hf](https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace-model-yellow)](https://huggingface.co/amphion/maskgct)
|
5 |
+
[![hf](https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace-demo-pink)](https://huggingface.co/spaces/amphion/maskgct)
|
6 |
+
[![readme](https://img.shields.io/badge/README-Key%20Features-blue)](./models/tts/maskgct/README.md)
|
7 |
|
8 |
+
## Overview
|
|
|
9 |
|
10 |
+
MaskGCT (**Mask**ed **G**enerative **C**odec **T**ransformer) is *a fully non-autoregressive TTS model that eliminates the need for explicit alignment information between text and speech supervision, as well as phone-level duration prediction*. MaskGCT is a two-stage model: in the first stage, the model uses text to predict semantic tokens extracted from a speech self-supervised learning (SSL) model, and in the second stage, the model predicts acoustic tokens conditioned on these semantic tokens. MaskGCT follows the *mask-and-predict* learning paradigm. During training, MaskGCT learns to predict masked semantic or acoustic tokens based on given conditions and prompts. During inference, the model generates tokens of a specified length in a parallel manner. Experiments with 100K hours of in-the-wild speech demonstrate that MaskGCT outperforms the current state-of-the-art zero-shot TTS systems in terms of quality, similarity, and intelligibility. Audio samples are available at [demo page](https://maskgct.github.io/).
|
11 |
|
12 |
+
<br>
|
13 |
+
<div align="center">
|
14 |
+
<img src="./imgs/maskgct/maskgct.png" width="100%">
|
15 |
+
</div>
|
16 |
+
<br>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
17 |
|
18 |
+
## News
|
|
|
|
|
19 |
|
20 |
+
- **2024/10/19**: We release **MaskGCT**, a fully non-autoregressive TTS model that eliminates the need for explicit alignment information between text and speech supervision. MaskGCT is trained on Emilia dataset and achieves SOTA zero-shot TTS perfermance.
|
|
|
|
|
21 |
|
22 |
+
## Quickstart
|
23 |
|
24 |
+
**Clone and install**
|
25 |
|
|
|
26 |
```bash
|
27 |
git clone https://github.com/open-mmlab/Amphion.git
|
28 |
+
# create env
|
29 |
+
bash ./models/tts/maskgct/env.sh
|
|
|
|
|
30 |
```
|
|
|
31 |
|
32 |
+
**Model download**
|
33 |
|
34 |
+
We provide the following pretrained checkpoints:
|
35 |
|
|
|
36 |
|
37 |
+
| Model Name | Description |
|
38 |
+
|-------------------|-------------|
|
39 |
+
| [Acoustic Codec](https://huggingface.co/amphion/MaskGCT/tree/main/acoustic_codec) | Converting speech to semantic tokens. |
|
40 |
+
| [Semantic Codec](https://huggingface.co/amphion/MaskGCT/tree/main/semantic_codec) | Converting speech to acoustic tokens and reconstructing waveform from acoustic tokens. |
|
41 |
+
| [MaskGCT-T2S](https://huggingface.co/amphion/MaskGCT/tree/main/t2s_model) | Predicting semantic tokens with text and prompt semantic tokens. |
|
42 |
+
| [MaskGCT-S2A](https://huggingface.co/amphion/MaskGCT/tree/main/s2a_model) | Predicts acoustic tokens conditioned on semantic tokens. |
|
43 |
|
44 |
+
You can download all pretrained checkpoints from [HuggingFace](https://huggingface.co/amphion/MaskGCT/tree/main) or use huggingface api.
|
|
|
45 |
|
46 |
+
```python
|
47 |
+
from huggingface_hub import hf_hub_download
|
48 |
|
49 |
+
# download semantic codec ckpt
|
50 |
+
semantic_code_ckpt = hf_hub_download("amphion/MaskGCT" filename="semantic_codec/model.safetensors")
|
51 |
|
52 |
+
# download acoustic codec ckpt
|
53 |
+
codec_encoder_ckpt = hf_hub_download("amphion/MaskGCT", filename="acoustic_codec/model.safetensors")
|
54 |
+
codec_decoder_ckpt = hf_hub_download("amphion/MaskGCT", filename="acoustic_codec/model_1.safetensors")
|
|
|
|
|
|
|
|
|
|
|
55 |
|
56 |
+
# download t2s model ckpt
|
57 |
+
t2s_model_ckpt = hf_hub_download("amphion/MaskGCT", filename="t2s_model/model.safetensors")
|
58 |
|
59 |
+
# download s2a model ckpt
|
60 |
+
s2a_1layer_ckpt = hf_hub_download("amphion/MaskGCT", filename="s2a_model/s2a_model_1layer/model.safetensors")
|
61 |
+
s2a_full_ckpt = hf_hub_download("amphion/MaskGCT", filename="s2a_model/s2a_model_full/model.safetensors")
|
62 |
+
```
|
63 |
|
64 |
+
**Basic Usage**
|
65 |
+
|
66 |
+
You can use the following code to generate speech from text and a prompt speech (the code is also provided in [inference.py](./models/tts/maskgct/maskgct_inference.py)).
|
67 |
+
|
68 |
+
```python
|
69 |
+
from models.tts.maskgct.maskgct_utils import *
|
70 |
+
from huggingface_hub import hf_hub_download
|
71 |
+
import safetensors
|
72 |
+
import soundfile as sf
|
73 |
+
|
74 |
+
if __name__ == "__main__":
|
75 |
+
|
76 |
+
# build model
|
77 |
+
device = torch.device("cuda:0")
|
78 |
+
cfg_path = "./models/tts/maskgct/config/maskgct.json"
|
79 |
+
cfg = load_config(cfg_path)
|
80 |
+
# 1. build semantic model (w2v-bert-2.0)
|
81 |
+
semantic_model, semantic_mean, semantic_std = build_semantic_model(device)
|
82 |
+
# 2. build semantic codec
|
83 |
+
semantic_codec = build_semantic_codec(cfg.model.semantic_codec, device)
|
84 |
+
# 3. build acoustic codec
|
85 |
+
codec_encoder, codec_decoder = build_acoustic_codec(cfg.model.acoustic_codec, device)
|
86 |
+
# 4. build t2s model
|
87 |
+
t2s_model = build_t2s_model(cfg.model.t2s_model, device)
|
88 |
+
# 5. build s2a model
|
89 |
+
s2a_model_1layer = build_s2a_model(cfg.model.s2a_model.s2a_1layer, device)
|
90 |
+
s2a_model_full = build_s2a_model(cfg.model.s2a_model.s2a_full, device)
|
91 |
+
|
92 |
+
# download checkpoint
|
93 |
+
...
|
94 |
+
|
95 |
+
# load semantic codec
|
96 |
+
safetensors.torch.load_model(semantic_codec, semantic_code_ckpt)
|
97 |
+
# load acoustic codec
|
98 |
+
safetensors.torch.load_model(codec_encoder, codec_encoder_ckpt)
|
99 |
+
safetensors.torch.load_model(codec_decoder, codec_decoder_ckpt)
|
100 |
+
# load t2s model
|
101 |
+
safetensors.torch.load_model(t2s_model, t2s_model_ckpt)
|
102 |
+
# load s2a model
|
103 |
+
safetensors.torch.load_model(s2a_model_1layer, s2a_1layer_ckpt)
|
104 |
+
safetensors.torch.load_model(s2a_model_full, s2a_full_ckpt)
|
105 |
+
|
106 |
+
# inference
|
107 |
+
prompt_wav_path = "./models/tts/maskgct/wav/prompt.wav"
|
108 |
+
save_path = "[YOUR SAVE PATH]"
|
109 |
+
prompt_text = " We do not break. We never give in. We never back down."
|
110 |
+
target_text = "In this paper, we introduce MaskGCT, a fully non-autoregressive TTS model that eliminates the need for explicit alignment information between text and speech supervision."
|
111 |
+
# Specify the target duration (in seconds). If target_len = None, we use a simple rule to predict the target duration.
|
112 |
+
target_len = 18
|
113 |
+
|
114 |
+
maskgct_inference_pipeline = MaskGCT_Inference_Pipeline(
|
115 |
+
semantic_model,
|
116 |
+
semantic_codec,
|
117 |
+
codec_encoder,
|
118 |
+
codec_decoder,
|
119 |
+
t2s_model,
|
120 |
+
s2a_model_1layer,
|
121 |
+
s2a_model_full,
|
122 |
+
semantic_mean,
|
123 |
+
semantic_std,
|
124 |
+
device,
|
125 |
+
)
|
126 |
+
|
127 |
+
recovered_audio = maskgct_inference_pipeline.maskgct_inference(
|
128 |
+
prompt_wav_path, prompt_text, target_text, "en", "en", target_len=target_len
|
129 |
+
)
|
130 |
+
sf.write(save_path, recovered_audio, 24000)
|
131 |
+
```
|
132 |
|
133 |
+
**Jupyter Notebook**
|
134 |
+
|
135 |
+
We also provide a [jupyter notebook](./models/tts/maskgct/maskgct_demo.ipynb) to show more details of MaskGCT inference.
|
136 |
+
|
137 |
+
|
138 |
+
## Evaluation Results of MaskGCT
|
139 |
+
|
140 |
+
| System | SIM-Oβ | WERβ | FSDβ | SMOSβ | CMOSβ |
|
141 |
+
| :--- | :---: | :---: | :---: | :---: | :---: |
|
142 |
+
| | | **LibriSpeech test-clean** |
|
143 |
+
| Ground Truth | 0.68 | 1.94 | | 4.05Β±0.12 | 0.00 |
|
144 |
+
| VALL-E | 0.50 | 5.90 | - | 3.47 Β±0.26 | -0.52Β±0.22 |
|
145 |
+
| VoiceBox | 0.64 | 2.03 | 0.762 | 3.80Β±0.17 | -0.41Β±0.13 |
|
146 |
+
| NaturalSpeech 3 | 0.67 | 1.94 | 0.786 | 4.26Β±0.10 | 0.16Β±0.14 |
|
147 |
+
| VoiceCraft | 0.45 | 4.68 | 0.981 | 3.52Β±0.21 | -0.33 Β±0.16 |
|
148 |
+
| XTTS-v2 | 0.51 | 4.20 | 0.945 | 3.02Β±0.22 | -0.98 Β±0.19 |
|
149 |
+
| MaskGCT | 0.687(0.723) | 2.634(1.976) | 0.886 | 4.27Β±0.14 | 0.10Β±0.16 |
|
150 |
+
| MaskGCT(gt length) | 0.697 | 2.012 | 0.746 | 4.33Β±0.11 | 0.13Β±0.13 |
|
151 |
+
| | | **SeedTTS test-en** |
|
152 |
+
| Ground Truth | 0.730 | 2.143 | | 3.92Β±0.15 | 0.00 |
|
153 |
+
| CosyVoice | 0.643 | 4.079 | 0.316 | 3.52Β±0.17 | -0.41 Β±0.18 |
|
154 |
+
| XTTS-v2 | 0.463 | 3.248 | 0.484 | 3.15Β±0.22 | -0.86Β±0.19 |
|
155 |
+
| VoiceCraft | 0.470 | 7.556 | 0.226 | 3.18Β±0.20 | -1.08 Β±0.15 |
|
156 |
+
| MaskGCT | 0.717(0.760) | 2.623(1.283) | 0.188 | 4.24 Β±0.12 | 0.03 Β±0.14 |
|
157 |
+
| MaskGCT(gt length) | 0.728 | 2.466 | 0.159 | 4.13 Β±0.17 | 0.12 Β±0.15 |
|
158 |
+
| | | **SeedTTS test-zh** |
|
159 |
+
| Ground Truth | 0.750 | 1.254 | | 3.86 Β±0.17 | 0.00 |
|
160 |
+
| CosyVoice | 0.750 | 4.089 | 0.276 | 3.54 Β±0.12 | -0.45 Β±0.15 |
|
161 |
+
| XTTS-v2 | 0.635 | 2.876 | 0.413 | 2.95 Β±0.18 | -0.81 Β±0.22 |
|
162 |
+
| MaskGCT | 0.774(0.805) | 2.273(0.843) | 0.106 | 4.09 Β±0.12 | 0.05 Β±0.17 |
|
163 |
+
| MaskGCT(gt length) | 0.777 | 2.183 | 0.101 | 4.11 Β±0.12 | 0.08Β±0.18 |
|
164 |
+
|
165 |
+
## Citations
|
166 |
+
|
167 |
+
If you use MaskGCT in your research, please cite the following paper:
|
168 |
|
169 |
```bibtex
|
170 |
+
@article{wang2024maskgct,
|
171 |
+
title={MaskGCT: Zero-Shot Text-to-Speech with Masked Generative Codec Transformer},
|
172 |
+
author={Wang, Yuancheng and Zhan, Haoyue and Liu, Liwei and Zeng, Ruihong and Guo, Haotian and Zheng, Jiachen and Zhang, Qiang and Zhang, Shunsi and Wu, Zhizheng},
|
173 |
+
journal={arXiv preprint arXiv:2409.00750},
|
174 |
+
year={2024}
|
175 |
}
|
176 |
+
|
177 |
+
@article{zhang2023amphion,
|
178 |
+
title={Amphion: An open-source audio, music and speech generation toolkit},
|
179 |
+
author={Zhang, Xueyao and Xue, Liumeng and Wang, Yuancheng and Gu, Yicheng and Chen, Xi and Fang, Zihao and Chen, Haopeng and Zou, Lexiao and Wang, Chaoren and Han, Jun and others},
|
180 |
+
journal={arXiv preprint arXiv:2312.09911},
|
181 |
+
year={2023}
|
182 |
+
}
|
183 |
+
```
|