Hecheng0625 commited on
Commit
4949b9c
β€’
1 Parent(s): 7ee3434

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +156 -142
README.md CHANGED
@@ -1,169 +1,183 @@
1
- # Amphion: An Open-Source Audio, Music, and Speech Generation Toolkit
2
-
3
- <div>
4
- <a href="https://arxiv.org/abs/2312.09911"><img src="https://img.shields.io/badge/arXiv-Paper-<COLOR>.svg"></a>
5
- <a href="https://huggingface.co/amphion"><img src="https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace-Amphion-pink"></a>
6
- <a href="https://openxlab.org.cn/usercenter/Amphion"><img src="https://cdn-static.openxlab.org.cn/app-center/openxlab_app.svg"></a>
7
- <a href="https://discord.com/invite/ZxxREr3Y"><img src="https://img.shields.io/badge/Discord-Join%20chat-blue.svg"></a>
8
- <a href="egs/tts/README.md"><img src="https://img.shields.io/badge/README-TTS-blue"></a>
9
- <a href="egs/svc/README.md"><img src="https://img.shields.io/badge/README-SVC-blue"></a>
10
- <a href="egs/tta/README.md"><img src="https://img.shields.io/badge/README-TTA-blue"></a>
11
- <a href="egs/vocoder/README.md"><img src="https://img.shields.io/badge/README-Vocoder-purple"></a>
12
- <a href="egs/metrics/README.md"><img src="https://img.shields.io/badge/README-Evaluation-yellow"></a>
13
- <a href="LICENSE"><img src="https://img.shields.io/badge/LICENSE-MIT-red"></a>
14
- </div>
15
- <br>
16
-
17
- **Amphion (/Γ¦mˈfaΙͺΙ™n/) is a toolkit for Audio, Music, and Speech Generation.** Its purpose is to support reproducible research and help junior researchers and engineers get started in the field of audio, music, and speech generation research and development. Amphion offers a unique feature: **visualizations** of classic models or architectures. We believe that these visualizations are beneficial for junior researchers and engineers who wish to gain a better understanding of the model.
18
-
19
- **The North-Star objective of Amphion is to offer a platform for studying the conversion of any inputs into audio.** Amphion is designed to support individual generation tasks, including but not limited to,
20
-
21
- - **TTS**: Text to Speech (β›³Β supported)
22
- - **SVS**: Singing Voice Synthesis (πŸ‘¨β€πŸ’»Β developing)
23
- - **VC**: Voice Conversion (πŸ‘¨β€πŸ’»Β developing)
24
- - **SVC**: Singing Voice Conversion (β›³Β supported)
25
- - **TTA**: Text to Audio (β›³Β supported)
26
- - **TTM**: Text to Music (πŸ‘¨β€πŸ’»Β developing)
27
- - more…
28
-
29
- In addition to the specific generation tasks, Amphion includes several **vocoders** and **evaluation metrics**. A vocoder is an important module for producing high-quality audio signals, while evaluation metrics are critical for ensuring consistent metrics in generation tasks. Moreover, Amphion is dedicated to advancing audio generation in real-world applications, such as building **large-scale datasets** for speech synthesis.
30
-
31
- ## πŸš€Β News
32
- - **2024/10/19**: We release **MaskGCT**, a fully non-autoregressive TTS model that eliminates the need for explicit alignment information between text and speech supervision. MaskGCT is trained on Emilia dataset and achieves SOTA zero-shot TTS perfermance. [![arXiv](https://img.shields.io/badge/arXiv-Paper-COLOR.svg)](https://arxiv.org/abs/2409.00750) [![hf](https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace-model-yellow)](https://huggingface.co/amphion/maskgct) [![hf](https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace-demo-pink)](https://huggingface.co/spaces/amphion/maskgct) [![readme](https://img.shields.io/badge/README-Key%20Features-blue)](models/tts/maskgct/README.md)
33
- - **2024/09/01**: [Amphion](https://arxiv.org/abs/2312.09911), [Emilia](https://arxiv.org/abs/2407.05361) and [DSFF-SVC](https://arxiv.org/abs/2310.11160) got accepted by IEEE SLT 2024! πŸ€—
34
- - **2024/08/28**: Welcome to join Amphion's [Discord channel](https://discord.gg/drhW7ajqAG) to stay connected and engage with our community!
35
- - **2024/08/20**: [SingVisio](https://arxiv.org/abs/2402.12660) got accepted by Computers & Graphics, [available here](https://www.sciencedirect.com/science/article/pii/S0097849324001936)! πŸŽ‰
36
- - **2024/08/27**: *The Emilia dataset is now publicly available!* Discover the most extensive and diverse speech generation dataset with 101k hours of in-the-wild speech data now at [![hf](https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace-Dataset-yellow)](https://huggingface.co/datasets/amphion/Emilia-Dataset) or [![OpenDataLab](https://img.shields.io/badge/OpenDataLab-Dataset-blue)](https://opendatalab.com/Amphion/Emilia)! πŸ‘‘πŸ‘‘πŸ‘‘
37
- - **2024/07/01**: Amphion now releases **Emilia**, the first open-source multilingual in-the-wild dataset for speech generation with over 101k hours of speech data, and the **Emilia-Pipe**, the first open-source preprocessing pipeline designed to transform in-the-wild speech data into high-quality training data with annotations for speech generation! [![arXiv](https://img.shields.io/badge/arXiv-Paper-COLOR.svg)](https://arxiv.org/abs/2407.05361) [![hf](https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace-Dataset-yellow)](https://huggingface.co/datasets/amphion/Emilia) [![demo](https://img.shields.io/badge/WebPage-Demo-red)](https://emilia-dataset.github.io/Emilia-Demo-Page/) [![readme](https://img.shields.io/badge/README-Key%20Features-blue)](preprocessors/Emilia/README.md)
38
- - **2024/06/17**: Amphion has a new release for its **VALL-E** model! It uses Llama as its underlying architecture and has better model performance, faster training speed, and more readable codes compared to our first version. [![readme](https://img.shields.io/badge/README-Key%20Features-blue)](egs/tts/VALLE_V2/README.md)
39
- - **2024/03/12**: Amphion now support **NaturalSpeech3 FACodec** and release pretrained checkpoints. [![arXiv](https://img.shields.io/badge/arXiv-Paper-COLOR.svg)](https://arxiv.org/abs/2403.03100) [![hf](https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace-model-yellow)](https://huggingface.co/amphion/naturalspeech3_facodec) [![hf](https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace-demo-pink)](https://huggingface.co/spaces/amphion/naturalspeech3_facodec) [![readme](https://img.shields.io/badge/README-Key%20Features-blue)](models/codec/ns3_codec/README.md)
40
- - **2024/02/22**: The first Amphion visualization tool, **SingVisio**, release. [![arXiv](https://img.shields.io/badge/arXiv-Paper-COLOR.svg)](https://arxiv.org/abs/2402.12660) [![openxlab](https://cdn-static.openxlab.org.cn/app-center/openxlab_app.svg)](https://openxlab.org.cn/apps/detail/Amphion/SingVisio) [![Video](https://img.shields.io/badge/Video-Demo-orange)](https://drive.google.com/file/d/15097SGhQh-SwUNbdWDYNyWEP--YGLba5/view) [![readme](https://img.shields.io/badge/README-Key%20Features-blue)](egs/visualization/SingVisio/README.md)
41
- - **2023/12/18**: Amphion v0.1 release. [![arXiv](https://img.shields.io/badge/arXiv-Paper-<COLOR>.svg)](https://arxiv.org/abs/2312.09911) [![hf](https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace-Amphion-pink)](https://huggingface.co/amphion) [![youtube](https://img.shields.io/badge/YouTube-Demo-red)](https://www.youtube.com/watch?v=1aw0HhcggvQ) [![readme](https://img.shields.io/badge/README-Key%20Features-blue)](https://github.com/open-mmlab/Amphion/pull/39)
42
- - **2023/11/28**: Amphion alpha release. [![readme](https://img.shields.io/badge/README-Key%20Features-blue)](https://github.com/open-mmlab/Amphion/pull/2)
43
-
44
- ## ⭐ Key Features
45
-
46
- ### TTS: Text to Speech
47
-
48
- - Amphion achieves state-of-the-art performance compared to existing open-source repositories on text-to-speech (TTS) systems. It supports the following models or architectures:
49
- - [FastSpeech2](https://arxiv.org/abs/2006.04558): A non-autoregressive TTS architecture that utilizes feed-forward Transformer blocks.
50
- - [VITS](https://arxiv.org/abs/2106.06103): An end-to-end TTS architecture that utilizes conditional variational autoencoder with adversarial learning
51
- - [VALL-E](https://arxiv.org/abs/2301.02111): A zero-shot TTS architecture that uses a neural codec language model with discrete codes.
52
- - [NaturalSpeech2](https://arxiv.org/abs/2304.09116): An architecture for TTS that utilizes a latent diffusion model to generate natural-sounding voices.
53
- - [Jets](Jets): An end-to-end TTS model that jointly trains FastSpeech2 and HiFi-GAN with an alignment module.
54
- - [MaskGCT](https://arxiv.org/abs/2409.00750): a fully non-autoregressive TTS architecture that eliminates the need for explicit alignment information between text and speech supervision.
55
-
56
- ### SVC: Singing Voice Conversion
57
-
58
- - Ampion supports multiple content-based features from various pretrained models, including [WeNet](https://github.com/wenet-e2e/wenet), [Whisper](https://github.com/openai/whisper), and [ContentVec](https://github.com/auspicious3000/contentvec). Their specific roles in SVC has been investigated in our SLT 2024 paper. [![arXiv](https://img.shields.io/badge/arXiv-Paper-<COLOR>.svg)](https://arxiv.org/abs/2310.11160) [![code](https://img.shields.io/badge/README-Code-red)](egs/svc/MultipleContentsSVC)
59
- - Amphion implements several state-of-the-art model architectures, including diffusion-, transformer-, VAE- and flow-based models. The diffusion-based architecture uses [Bidirectional dilated CNN](https://openreview.net/pdf?id=a-xFK8Ymz5J) as a backend and supports several sampling algorithms such as [DDPM](https://arxiv.org/pdf/2006.11239.pdf), [DDIM](https://arxiv.org/pdf/2010.02502.pdf), and [PNDM](https://arxiv.org/pdf/2202.09778.pdf). Additionally, it supports single-step inference based on the [Consistency Model](https://openreview.net/pdf?id=FmqFfMTNnv).
60
-
61
- ### TTA: Text to Audio
62
-
63
- - Amphion supports the TTA with a latent diffusion model. It is designed like [AudioLDM](https://arxiv.org/abs/2301.12503), [Make-an-Audio](https://arxiv.org/abs/2301.12661), and [AUDIT](https://arxiv.org/abs/2304.00830). It is also the official implementation of the text-to-audio generation part of our NeurIPS 2023 paper. [![arXiv](https://img.shields.io/badge/arXiv-Paper-<COLOR>.svg)](https://arxiv.org/abs/2304.00830) [![code](https://img.shields.io/badge/README-Code-red)](egs/tta/RECIPE.md)
64
-
65
- ### Vocoder
66
-
67
- - Amphion supports various widely-used neural vocoders, including:
68
- - GAN-based vocoders: [MelGAN](https://arxiv.org/abs/1910.06711), [HiFi-GAN](https://arxiv.org/abs/2010.05646), [NSF-HiFiGAN](https://github.com/nii-yamagishilab/project-NN-Pytorch-scripts), [BigVGAN](https://arxiv.org/abs/2206.04658), [APNet](https://arxiv.org/abs/2305.07952).
69
- - Flow-based vocoders: [WaveGlow](https://arxiv.org/abs/1811.00002).
70
- - Diffusion-based vocoders: [Diffwave](https://arxiv.org/abs/2009.09761).
71
- - Auto-regressive based vocoders: [WaveNet](https://arxiv.org/abs/1609.03499), [WaveRNN](https://arxiv.org/abs/1802.08435v1).
72
- - Amphion provides the official implementation of [Multi-Scale Constant-Q Transform Discriminator](https://arxiv.org/abs/2311.14957) (our ICASSP 2024 paper). It can be used to enhance any architecture GAN-based vocoders during training, and keep the inference stage (such as memory or speed) unchanged. [![arXiv](https://img.shields.io/badge/arXiv-Paper-<COLOR>.svg)](https://arxiv.org/abs/2311.14957) [![code](https://img.shields.io/badge/README-Code-red)](egs/vocoder/gan/tfr_enhanced_hifigan)
73
-
74
- ### Evaluation
75
-
76
- Amphion provides a comprehensive objective evaluation of the generated audio. The evaluation metrics contain:
77
-
78
- - **F0 Modeling**: F0 Pearson Coefficients, F0 Periodicity Root Mean Square Error, F0 Root Mean Square Error, Voiced/Unvoiced F1 Score, etc.
79
- - **Energy Modeling**: Energy Root Mean Square Error, Energy Pearson Coefficients, etc.
80
- - **Intelligibility**: Character/Word Error Rate, which can be calculated based on [Whisper](https://github.com/openai/whisper) and more.
81
- - **Spectrogram Distortion**: Frechet Audio Distance (FAD), Mel Cepstral Distortion (MCD), Multi-Resolution STFT Distance (MSTFT), Perceptual Evaluation of Speech Quality (PESQ), Short Time Objective Intelligibility (STOI), etc.
82
- - **Speaker Similarity**: Cosine similarity, which can be calculated based on [RawNet3](https://github.com/Jungjee/RawNet), [Resemblyzer](https://github.com/resemble-ai/Resemblyzer), [WeSpeaker](https://github.com/wenet-e2e/wespeaker), [WavLM](https://github.com/microsoft/unilm/tree/master/wavlm) and more.
83
 
84
- ### Datasets
 
 
 
85
 
86
- - Amphion unifies the data preprocess of the open-source datasets including [AudioCaps](https://audiocaps.github.io/), [LibriTTS](https://www.openslr.org/60/), [LJSpeech](https://keithito.com/LJ-Speech-Dataset/), [M4Singer](https://github.com/M4Singer/M4Singer), [Opencpop](https://wenet.org.cn/opencpop/), [OpenSinger](https://github.com/Multi-Singer/Multi-Singer.github.io), [SVCC](http://vc-challenge.org/), [VCTK](https://datashare.ed.ac.uk/handle/10283/3443), and more. The supported dataset list can be seen [here](egs/datasets/README.md) (updating).
87
- - Amphion (exclusively) supports the [**Emilia**](preprocessors/Emilia/README.md) dataset and its preprocessing pipeline **Emilia-Pipe** for in-the-wild speech data!
88
 
89
- ### Visualization
90
 
91
- Amphion provides visualization tools to interactively illustrate the internal processing mechanism of classic models. This provides an invaluable resource for educational purposes and for facilitating understandable research.
92
-
93
- Currently, Amphion supports [SingVisio](egs/visualization/SingVisio/README.md), a visualization tool of the diffusion model for singing voice conversion. [![arXiv](https://img.shields.io/badge/arXiv-Paper-COLOR.svg)](https://arxiv.org/abs/2402.12660) [![openxlab](https://cdn-static.openxlab.org.cn/app-center/openxlab_app.svg)](https://openxlab.org.cn/apps/detail/Amphion/SingVisio) [![Video](https://img.shields.io/badge/Video-Demo-orange)](https://drive.google.com/file/d/15097SGhQh-SwUNbdWDYNyWEP--YGLba5/view)
94
-
95
-
96
- ## πŸ“€ Installation
97
-
98
- Amphion can be installed through either Setup Installer or Docker Image.
99
-
100
- ### Setup Installer
101
-
102
- ```bash
103
- git clone https://github.com/open-mmlab/Amphion.git
104
- cd Amphion
105
 
106
- # Install Python Environment
107
- conda create --name amphion python=3.9.15
108
- conda activate amphion
109
 
110
- # Install Python Packages Dependencies
111
- sh env.sh
112
- ```
113
 
114
- ### Docker Image
115
 
116
- 1. Install [Docker](https://docs.docker.com/get-docker/), [NVIDIA Driver](https://www.nvidia.com/download/index.aspx), [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html), and [CUDA](https://developer.nvidia.com/cuda-downloads).
117
 
118
- 2. Run the following commands:
119
  ```bash
120
  git clone https://github.com/open-mmlab/Amphion.git
121
- cd Amphion
122
-
123
- docker pull realamphion/amphion
124
- docker run --runtime=nvidia --gpus all -it -v .:/app realamphion/amphion
125
  ```
126
- Mount dataset by argument `-v` is necessary when using Docker. Please refer to [Mount dataset in Docker container](egs/datasets/docker.md) and [Docker Docs](https://docs.docker.com/engine/reference/commandline/container_run/#volume) for more details.
127
 
 
128
 
129
- ## 🐍 Usage in Python
130
 
131
- We detail the instructions of different tasks in the following recipes:
132
 
133
- - [Text to Speech (TTS)](egs/tts/README.md)
134
- - [Singing Voice Conversion (SVC)](egs/svc/README.md)
135
- - [Text to Audio (TTA)](egs/tta/README.md)
136
- - [Vocoder](egs/vocoder/README.md)
137
- - [Evaluation](egs/metrics/README.md)
138
- - [Visualization](egs/visualization/README.md)
139
 
140
- ## πŸ‘¨β€πŸ’» Contributing
141
- We appreciate all contributions to improve Amphion. Please refer to [CONTRIBUTING.md](.github/CONTRIBUTING.md) for the contributing guideline.
142
 
143
- ## πŸ™Β Acknowledgement
 
144
 
 
 
145
 
146
- - [ming024's FastSpeech2](https://github.com/ming024/FastSpeech2) and [jaywalnut310's VITS](https://github.com/jaywalnut310/vits) for model architecture code.
147
- - [lifeiteng's VALL-E](https://github.com/lifeiteng/vall-e) for training pipeline and model architecture design.
148
- - [SpeechTokenizer](https://github.com/ZhangXInFD/SpeechTokenizer) for semantic-distilled tokenizer design.
149
- - [WeNet](https://github.com/wenet-e2e/wenet), [Whisper](https://github.com/openai/whisper), [ContentVec](https://github.com/auspicious3000/contentvec), and [RawNet3](https://github.com/Jungjee/RawNet) for pretrained models and inference code.
150
- - [HiFi-GAN](https://github.com/jik876/hifi-gan) for GAN-based Vocoder's architecture design and training strategy.
151
- - [Encodec](https://github.com/facebookresearch/encodec) for well-organized GAN Discriminator's architecture and basic blocks.
152
- - [Latent Diffusion](https://github.com/CompVis/latent-diffusion) for model architecture design.
153
- - [TensorFlowTTS](https://github.com/TensorSpeech/TensorFlowTTS) for preparing the MFA tools.
154
 
 
 
155
 
156
- ## ©️ License
 
 
 
157
 
158
- Amphion is under the [MIT License](LICENSE). It is free for both research and commercial use cases.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
159
 
160
- ## πŸ“š Citations
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
161
 
162
  ```bibtex
163
- @inproceedings{amphion,
164
- author={Zhang, Xueyao and Xue, Liumeng and Gu, Yicheng and Wang, Yuancheng and Li, Jiaqi and He, Haorui and Wang, Chaoren and Song, Ting and Chen, Xi and Fang, Zihao and Chen, Haopeng and Zhang, Junan and Tang, Tze Ying and Zou, Lexiao and Wang, Mingxuan and Han, Jun and Chen, Kai and Li, Haizhou and Wu, Zhizheng},
165
- title={Amphion: An Open-Source Audio, Music and Speech Generation Toolkit},
166
- booktitle={{IEEE} Spoken Language Technology Workshop, {SLT} 2024},
167
- year={2024}
168
  }
169
- ```
 
 
 
 
 
 
 
 
1
+ ## MaskGCT: Zero-Shot Text-to-Speech with Masked Generative Codec Transformer
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2
 
3
+ [![arXiv](https://img.shields.io/badge/arXiv-Paper-COLOR.svg)](https://arxiv.org/abs/2409.00750)
4
+ [![hf](https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace-model-yellow)](https://huggingface.co/amphion/maskgct)
5
+ [![hf](https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace-demo-pink)](https://huggingface.co/spaces/amphion/maskgct)
6
+ [![readme](https://img.shields.io/badge/README-Key%20Features-blue)](./models/tts/maskgct/README.md)
7
 
8
+ ## Overview
 
9
 
10
+ MaskGCT (**Mask**ed **G**enerative **C**odec **T**ransformer) is *a fully non-autoregressive TTS model that eliminates the need for explicit alignment information between text and speech supervision, as well as phone-level duration prediction*. MaskGCT is a two-stage model: in the first stage, the model uses text to predict semantic tokens extracted from a speech self-supervised learning (SSL) model, and in the second stage, the model predicts acoustic tokens conditioned on these semantic tokens. MaskGCT follows the *mask-and-predict* learning paradigm. During training, MaskGCT learns to predict masked semantic or acoustic tokens based on given conditions and prompts. During inference, the model generates tokens of a specified length in a parallel manner. Experiments with 100K hours of in-the-wild speech demonstrate that MaskGCT outperforms the current state-of-the-art zero-shot TTS systems in terms of quality, similarity, and intelligibility. Audio samples are available at [demo page](https://maskgct.github.io/).
11
 
12
+ <br>
13
+ <div align="center">
14
+ <img src="./imgs/maskgct/maskgct.png" width="100%">
15
+ </div>
16
+ <br>
 
 
 
 
 
 
 
 
 
17
 
18
+ ## News
 
 
19
 
20
+ - **2024/10/19**: We release **MaskGCT**, a fully non-autoregressive TTS model that eliminates the need for explicit alignment information between text and speech supervision. MaskGCT is trained on Emilia dataset and achieves SOTA zero-shot TTS perfermance.
 
 
21
 
22
+ ## Quickstart
23
 
24
+ **Clone and install**
25
 
 
26
  ```bash
27
  git clone https://github.com/open-mmlab/Amphion.git
28
+ # create env
29
+ bash ./models/tts/maskgct/env.sh
 
 
30
  ```
 
31
 
32
+ **Model download**
33
 
34
+ We provide the following pretrained checkpoints:
35
 
 
36
 
37
+ | Model Name | Description |
38
+ |-------------------|-------------|
39
+ | [Acoustic Codec](https://huggingface.co/amphion/MaskGCT/tree/main/acoustic_codec) | Converting speech to semantic tokens. |
40
+ | [Semantic Codec](https://huggingface.co/amphion/MaskGCT/tree/main/semantic_codec) | Converting speech to acoustic tokens and reconstructing waveform from acoustic tokens. |
41
+ | [MaskGCT-T2S](https://huggingface.co/amphion/MaskGCT/tree/main/t2s_model) | Predicting semantic tokens with text and prompt semantic tokens. |
42
+ | [MaskGCT-S2A](https://huggingface.co/amphion/MaskGCT/tree/main/s2a_model) | Predicts acoustic tokens conditioned on semantic tokens. |
43
 
44
+ You can download all pretrained checkpoints from [HuggingFace](https://huggingface.co/amphion/MaskGCT/tree/main) or use huggingface api.
 
45
 
46
+ ```python
47
+ from huggingface_hub import hf_hub_download
48
 
49
+ # download semantic codec ckpt
50
+ semantic_code_ckpt = hf_hub_download("amphion/MaskGCT" filename="semantic_codec/model.safetensors")
51
 
52
+ # download acoustic codec ckpt
53
+ codec_encoder_ckpt = hf_hub_download("amphion/MaskGCT", filename="acoustic_codec/model.safetensors")
54
+ codec_decoder_ckpt = hf_hub_download("amphion/MaskGCT", filename="acoustic_codec/model_1.safetensors")
 
 
 
 
 
55
 
56
+ # download t2s model ckpt
57
+ t2s_model_ckpt = hf_hub_download("amphion/MaskGCT", filename="t2s_model/model.safetensors")
58
 
59
+ # download s2a model ckpt
60
+ s2a_1layer_ckpt = hf_hub_download("amphion/MaskGCT", filename="s2a_model/s2a_model_1layer/model.safetensors")
61
+ s2a_full_ckpt = hf_hub_download("amphion/MaskGCT", filename="s2a_model/s2a_model_full/model.safetensors")
62
+ ```
63
 
64
+ **Basic Usage**
65
+
66
+ You can use the following code to generate speech from text and a prompt speech (the code is also provided in [inference.py](./models/tts/maskgct/maskgct_inference.py)).
67
+
68
+ ```python
69
+ from models.tts.maskgct.maskgct_utils import *
70
+ from huggingface_hub import hf_hub_download
71
+ import safetensors
72
+ import soundfile as sf
73
+
74
+ if __name__ == "__main__":
75
+
76
+ # build model
77
+ device = torch.device("cuda:0")
78
+ cfg_path = "./models/tts/maskgct/config/maskgct.json"
79
+ cfg = load_config(cfg_path)
80
+ # 1. build semantic model (w2v-bert-2.0)
81
+ semantic_model, semantic_mean, semantic_std = build_semantic_model(device)
82
+ # 2. build semantic codec
83
+ semantic_codec = build_semantic_codec(cfg.model.semantic_codec, device)
84
+ # 3. build acoustic codec
85
+ codec_encoder, codec_decoder = build_acoustic_codec(cfg.model.acoustic_codec, device)
86
+ # 4. build t2s model
87
+ t2s_model = build_t2s_model(cfg.model.t2s_model, device)
88
+ # 5. build s2a model
89
+ s2a_model_1layer = build_s2a_model(cfg.model.s2a_model.s2a_1layer, device)
90
+ s2a_model_full = build_s2a_model(cfg.model.s2a_model.s2a_full, device)
91
+
92
+ # download checkpoint
93
+ ...
94
+
95
+ # load semantic codec
96
+ safetensors.torch.load_model(semantic_codec, semantic_code_ckpt)
97
+ # load acoustic codec
98
+ safetensors.torch.load_model(codec_encoder, codec_encoder_ckpt)
99
+ safetensors.torch.load_model(codec_decoder, codec_decoder_ckpt)
100
+ # load t2s model
101
+ safetensors.torch.load_model(t2s_model, t2s_model_ckpt)
102
+ # load s2a model
103
+ safetensors.torch.load_model(s2a_model_1layer, s2a_1layer_ckpt)
104
+ safetensors.torch.load_model(s2a_model_full, s2a_full_ckpt)
105
+
106
+ # inference
107
+ prompt_wav_path = "./models/tts/maskgct/wav/prompt.wav"
108
+ save_path = "[YOUR SAVE PATH]"
109
+ prompt_text = " We do not break. We never give in. We never back down."
110
+ target_text = "In this paper, we introduce MaskGCT, a fully non-autoregressive TTS model that eliminates the need for explicit alignment information between text and speech supervision."
111
+ # Specify the target duration (in seconds). If target_len = None, we use a simple rule to predict the target duration.
112
+ target_len = 18
113
+
114
+ maskgct_inference_pipeline = MaskGCT_Inference_Pipeline(
115
+ semantic_model,
116
+ semantic_codec,
117
+ codec_encoder,
118
+ codec_decoder,
119
+ t2s_model,
120
+ s2a_model_1layer,
121
+ s2a_model_full,
122
+ semantic_mean,
123
+ semantic_std,
124
+ device,
125
+ )
126
+
127
+ recovered_audio = maskgct_inference_pipeline.maskgct_inference(
128
+ prompt_wav_path, prompt_text, target_text, "en", "en", target_len=target_len
129
+ )
130
+ sf.write(save_path, recovered_audio, 24000)
131
+ ```
132
 
133
+ **Jupyter Notebook**
134
+
135
+ We also provide a [jupyter notebook](./models/tts/maskgct/maskgct_demo.ipynb) to show more details of MaskGCT inference.
136
+
137
+
138
+ ## Evaluation Results of MaskGCT
139
+
140
+ | System | SIM-O↑ | WER↓ | FSD↓ | SMOS↑ | CMOS↑ |
141
+ | :--- | :---: | :---: | :---: | :---: | :---: |
142
+ | | | **LibriSpeech test-clean** |
143
+ | Ground Truth | 0.68 | 1.94 | | 4.05Β±0.12 | 0.00 |
144
+ | VALL-E | 0.50 | 5.90 | - | 3.47 Β±0.26 | -0.52Β±0.22 |
145
+ | VoiceBox | 0.64 | 2.03 | 0.762 | 3.80Β±0.17 | -0.41Β±0.13 |
146
+ | NaturalSpeech 3 | 0.67 | 1.94 | 0.786 | 4.26Β±0.10 | 0.16Β±0.14 |
147
+ | VoiceCraft | 0.45 | 4.68 | 0.981 | 3.52Β±0.21 | -0.33 Β±0.16 |
148
+ | XTTS-v2 | 0.51 | 4.20 | 0.945 | 3.02Β±0.22 | -0.98 Β±0.19 |
149
+ | MaskGCT | 0.687(0.723) | 2.634(1.976) | 0.886 | 4.27Β±0.14 | 0.10Β±0.16 |
150
+ | MaskGCT(gt length) | 0.697 | 2.012 | 0.746 | 4.33Β±0.11 | 0.13Β±0.13 |
151
+ | | | **SeedTTS test-en** |
152
+ | Ground Truth | 0.730 | 2.143 | | 3.92Β±0.15 | 0.00 |
153
+ | CosyVoice | 0.643 | 4.079 | 0.316 | 3.52Β±0.17 | -0.41 Β±0.18 |
154
+ | XTTS-v2 | 0.463 | 3.248 | 0.484 | 3.15Β±0.22 | -0.86Β±0.19 |
155
+ | VoiceCraft | 0.470 | 7.556 | 0.226 | 3.18Β±0.20 | -1.08 Β±0.15 |
156
+ | MaskGCT | 0.717(0.760) | 2.623(1.283) | 0.188 | 4.24 Β±0.12 | 0.03 Β±0.14 |
157
+ | MaskGCT(gt length) | 0.728 | 2.466 | 0.159 | 4.13 Β±0.17 | 0.12 Β±0.15 |
158
+ | | | **SeedTTS test-zh** |
159
+ | Ground Truth | 0.750 | 1.254 | | 3.86 Β±0.17 | 0.00 |
160
+ | CosyVoice | 0.750 | 4.089 | 0.276 | 3.54 Β±0.12 | -0.45 Β±0.15 |
161
+ | XTTS-v2 | 0.635 | 2.876 | 0.413 | 2.95 Β±0.18 | -0.81 Β±0.22 |
162
+ | MaskGCT | 0.774(0.805) | 2.273(0.843) | 0.106 | 4.09 Β±0.12 | 0.05 Β±0.17 |
163
+ | MaskGCT(gt length) | 0.777 | 2.183 | 0.101 | 4.11 Β±0.12 | 0.08Β±0.18 |
164
+
165
+ ## Citations
166
+
167
+ If you use MaskGCT in your research, please cite the following paper:
168
 
169
  ```bibtex
170
+ @article{wang2024maskgct,
171
+ title={MaskGCT: Zero-Shot Text-to-Speech with Masked Generative Codec Transformer},
172
+ author={Wang, Yuancheng and Zhan, Haoyue and Liu, Liwei and Zeng, Ruihong and Guo, Haotian and Zheng, Jiachen and Zhang, Qiang and Zhang, Shunsi and Wu, Zhizheng},
173
+ journal={arXiv preprint arXiv:2409.00750},
174
+ year={2024}
175
  }
176
+
177
+ @article{zhang2023amphion,
178
+ title={Amphion: An open-source audio, music and speech generation toolkit},
179
+ author={Zhang, Xueyao and Xue, Liumeng and Wang, Yuancheng and Gu, Yicheng and Chen, Xi and Fang, Zihao and Chen, Haopeng and Zou, Lexiao and Wang, Chaoren and Han, Jun and others},
180
+ journal={arXiv preprint arXiv:2312.09911},
181
+ year={2023}
182
+ }
183
+ ```