Duplicate from BSC-LT/matcha-tts-cat-multispeaker
Browse filesCo-authored-by: Alex Peiró Lilja <AlexK-PL@users.noreply.huggingface.co>
- .gitattributes +35 -0
- README.md +209 -0
- checkpoint_epoch=2399.ckpt +3 -0
- config.yaml +43 -0
- matcha_multispeaker_cat_opset_15_10_steps_2399.onnx +3 -0
- pytorch_model.bin +3 -0
.gitattributes
ADDED
@@ -0,0 +1,35 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
*.7z filter=lfs diff=lfs merge=lfs -text
|
2 |
+
*.arrow filter=lfs diff=lfs merge=lfs -text
|
3 |
+
*.bin filter=lfs diff=lfs merge=lfs -text
|
4 |
+
*.bz2 filter=lfs diff=lfs merge=lfs -text
|
5 |
+
*.ckpt filter=lfs diff=lfs merge=lfs -text
|
6 |
+
*.ftz filter=lfs diff=lfs merge=lfs -text
|
7 |
+
*.gz filter=lfs diff=lfs merge=lfs -text
|
8 |
+
*.h5 filter=lfs diff=lfs merge=lfs -text
|
9 |
+
*.joblib filter=lfs diff=lfs merge=lfs -text
|
10 |
+
*.lfs.* filter=lfs diff=lfs merge=lfs -text
|
11 |
+
*.mlmodel filter=lfs diff=lfs merge=lfs -text
|
12 |
+
*.model filter=lfs diff=lfs merge=lfs -text
|
13 |
+
*.msgpack filter=lfs diff=lfs merge=lfs -text
|
14 |
+
*.npy filter=lfs diff=lfs merge=lfs -text
|
15 |
+
*.npz filter=lfs diff=lfs merge=lfs -text
|
16 |
+
*.onnx filter=lfs diff=lfs merge=lfs -text
|
17 |
+
*.ot filter=lfs diff=lfs merge=lfs -text
|
18 |
+
*.parquet filter=lfs diff=lfs merge=lfs -text
|
19 |
+
*.pb filter=lfs diff=lfs merge=lfs -text
|
20 |
+
*.pickle filter=lfs diff=lfs merge=lfs -text
|
21 |
+
*.pkl filter=lfs diff=lfs merge=lfs -text
|
22 |
+
*.pt filter=lfs diff=lfs merge=lfs -text
|
23 |
+
*.pth filter=lfs diff=lfs merge=lfs -text
|
24 |
+
*.rar filter=lfs diff=lfs merge=lfs -text
|
25 |
+
*.safetensors filter=lfs diff=lfs merge=lfs -text
|
26 |
+
saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
27 |
+
*.tar.* filter=lfs diff=lfs merge=lfs -text
|
28 |
+
*.tar filter=lfs diff=lfs merge=lfs -text
|
29 |
+
*.tflite filter=lfs diff=lfs merge=lfs -text
|
30 |
+
*.tgz filter=lfs diff=lfs merge=lfs -text
|
31 |
+
*.wasm filter=lfs diff=lfs merge=lfs -text
|
32 |
+
*.xz filter=lfs diff=lfs merge=lfs -text
|
33 |
+
*.zip filter=lfs diff=lfs merge=lfs -text
|
34 |
+
*.zst filter=lfs diff=lfs merge=lfs -text
|
35 |
+
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
README.md
ADDED
@@ -0,0 +1,209 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
language:
|
3 |
+
- ca
|
4 |
+
licence:
|
5 |
+
- apache-2.0
|
6 |
+
tags:
|
7 |
+
- matcha-tts
|
8 |
+
- acoustic modelling
|
9 |
+
- speech
|
10 |
+
- multispeaker
|
11 |
+
pipeline_tag: text-to-speech
|
12 |
+
datasets:
|
13 |
+
- projecte-aina/festcat_trimmed_denoised
|
14 |
+
- projecte-aina/openslr-slr69-ca-trimmed-denoised
|
15 |
+
---
|
16 |
+
|
17 |
+
# Matcha-TTS Catalan Multispeaker
|
18 |
+
|
19 |
+
## Table of Contents
|
20 |
+
<details>
|
21 |
+
<summary>Click to expand</summary>
|
22 |
+
|
23 |
+
- [Model description](#model-description)
|
24 |
+
- [Intended uses and limitations](#intended-uses-and-limitations)
|
25 |
+
- [How to use](#how-to-use)
|
26 |
+
- [Training](#training)
|
27 |
+
- [Evaluation](#evaluation)
|
28 |
+
- [Citation](#citation)
|
29 |
+
- [Additional information](#additional-information)
|
30 |
+
|
31 |
+
</details>
|
32 |
+
|
33 |
+
## Model Description
|
34 |
+
|
35 |
+
**Matcha-TTS** is an encoder-decoder architecture designed for fast acoustic modelling in TTS.
|
36 |
+
The encoder part is based on a text encoder and a phoneme duration prediction that together predict averaged acoustic features.
|
37 |
+
And the decoder has essentially a U-Net backbone inspired by [Grad-TTS](https://arxiv.org/pdf/2105.06337.pdf), which is based on the Transformer architecture.
|
38 |
+
In the latter, by replacing 2D CNNs by 1D CNNs, a large reduction in memory consumption and fast synthesis is achieved.
|
39 |
+
|
40 |
+
**Matcha-TTS** is a non-autorregressive model trained with optimal-transport conditional flow matching (OT-CFM).
|
41 |
+
This yields an ODE-based decoder capable of generating high output quality in fewer synthesis steps than models trained using score matching.
|
42 |
+
|
43 |
+
## Intended Uses and Limitations
|
44 |
+
|
45 |
+
This model is intended to serve as an acoustic feature generator for multispeaker text-to-speech systems for the Catalan language.
|
46 |
+
It has been finetuned using a Catalan phonemizer, therefore if the model is used for other languages it may will not produce intelligible samples after mapping
|
47 |
+
its output into a speech waveform.
|
48 |
+
|
49 |
+
The quality of the samples can vary depending on the speaker.
|
50 |
+
This may be due to the sensitivity of the model in learning specific frequencies and also due to the quality of samples for each speaker.
|
51 |
+
|
52 |
+
## How to Get Started with the Model
|
53 |
+
|
54 |
+
### Installation
|
55 |
+
|
56 |
+
This model has been trained using the espeak-ng open source text-to-speech software.
|
57 |
+
The espeak-ng containing the Catalan phonemizer can be found [here](https://github.com/projecte-aina/espeak-ng)
|
58 |
+
|
59 |
+
Create a virtual environment:
|
60 |
+
```bash
|
61 |
+
python -m venv /path/to/venv
|
62 |
+
```
|
63 |
+
```bash
|
64 |
+
source /path/to/venv/bin/activate
|
65 |
+
```
|
66 |
+
|
67 |
+
For training and inferencing with Catalan Matcha-TTS you need to compile the provided espeak-ng with the Catalan phonemizer:
|
68 |
+
```bash
|
69 |
+
git clone https://github.com/projecte-aina/espeak-ng.git
|
70 |
+
|
71 |
+
export PYTHON=/path/to/env/<env_name>/bin/python
|
72 |
+
cd /path/to/espeak-ng
|
73 |
+
./autogen.sh
|
74 |
+
./configure --prefix=/path/to/espeak-ng
|
75 |
+
make
|
76 |
+
make install
|
77 |
+
|
78 |
+
pip cache purge
|
79 |
+
pip install mecab-python3
|
80 |
+
pip install unidic-lite
|
81 |
+
```
|
82 |
+
Clone the repository:
|
83 |
+
|
84 |
+
```bash
|
85 |
+
git clone -b dev-cat https://github.com/langtech-bsc/Matcha-TTS.git
|
86 |
+
cd Matcha-TTS
|
87 |
+
|
88 |
+
```
|
89 |
+
Install the package from source:
|
90 |
+
```bash
|
91 |
+
pip install -e .
|
92 |
+
|
93 |
+
```
|
94 |
+
|
95 |
+
|
96 |
+
### For Inference
|
97 |
+
|
98 |
+
#### PyTorch
|
99 |
+
|
100 |
+
Speech end-to-end inference can be done together with **Catalan Matcha-TTS**.
|
101 |
+
Both models (Catalan Matcha-TTS and Vocos) are loaded remotely from the HF hub.
|
102 |
+
|
103 |
+
First, export the following environment variables to include the installed espeak-ng version:
|
104 |
+
|
105 |
+
```bash
|
106 |
+
export PYTHON=/path/to/your/venv/bin/python
|
107 |
+
export ESPEAK_DATA_PATH=/path/to/espeak-ng/espeak-ng-data
|
108 |
+
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/path/to/espeak-ng/lib
|
109 |
+
export PATH="/path/to/espeak-ng/bin:$PATH"
|
110 |
+
|
111 |
+
```
|
112 |
+
Then you can run the inference script:
|
113 |
+
```bash
|
114 |
+
cd Matcha-TTS
|
115 |
+
python3 matcha_vocos_inference.py --output_path=/output/path --text_input="Bon dia Manel, avui anem a la muntanya."
|
116 |
+
|
117 |
+
```
|
118 |
+
You can also modify the length scale (speech rate) and the temperature of the generated sample:
|
119 |
+
```bash
|
120 |
+
python3 matcha_vocos_inference.py --output_path=/output/path --text_input="Bon dia Manel, avui anem a la muntanya." --length_scale=0.8 --temperature=0.7
|
121 |
+
|
122 |
+
```
|
123 |
+
|
124 |
+
#### ONNX
|
125 |
+
|
126 |
+
We also release a ONNX version of the model
|
127 |
+
|
128 |
+
### For Training
|
129 |
+
|
130 |
+
The entire checkpoint is also released to continue training or finetuning.
|
131 |
+
See the [repo instructions](https://github.com/langtech-bsc/Matcha-TTS/tree/dev-cat)
|
132 |
+
|
133 |
+
|
134 |
+
## Training Details
|
135 |
+
|
136 |
+
### Training data
|
137 |
+
|
138 |
+
The model was trained on 2 **Catalan** speech datasets
|
139 |
+
|
140 |
+
| Dataset | Language | Hours | Num. Speakers |
|
141 |
+
|---------------------|----------|---------|-----------------|
|
142 |
+
| [Festcat](https://huggingface.co/datasets/projecte-aina/festcat_trimmed_denoised) | ca | 22 | 11 |
|
143 |
+
| [OpenSLR69](https://huggingface.co/datasets/projecte-aina/openslr-slr69-ca-trimmed-denoised) | ca | 5 | 36 |
|
144 |
+
|
145 |
+
### Training procedure
|
146 |
+
|
147 |
+
***Catalan Matcha-TTS*** was finetuned from the English multispeaker checkpoint,
|
148 |
+
which was trained with the [VCTK dataset](https://huggingface.co/datasets/vctk) and provided by the model authors.
|
149 |
+
|
150 |
+
The embedding layer was initialized with the number of catalan speakers (47) and the original hyperparameters were kept.
|
151 |
+
|
152 |
+
### Training Hyperparameters
|
153 |
+
|
154 |
+
* batch size: 32 (x2 GPUs)
|
155 |
+
* learning rate: 1e-4
|
156 |
+
* number of speakers: 47
|
157 |
+
* n_fft: 1024
|
158 |
+
* n_feats: 80
|
159 |
+
* sample_rate: 22050
|
160 |
+
* hop_length: 256
|
161 |
+
* win_length: 1024
|
162 |
+
* f_min: 0
|
163 |
+
* f_max: 8000
|
164 |
+
* data_statistics:
|
165 |
+
* mel_mean: -6578195
|
166 |
+
* mel_std: 2.538758
|
167 |
+
* number of samples: 13340
|
168 |
+
|
169 |
+
## Evaluation
|
170 |
+
|
171 |
+
Validation values obtained from tensorboard from epoch 2399*:
|
172 |
+
|
173 |
+
* val_dur_loss_epoch: 0.38
|
174 |
+
* val_prior_loss_epoch: 0.97
|
175 |
+
* val_diff_loss_epoch: 2.195
|
176 |
+
|
177 |
+
(Note that the finetuning started from epoch 1864, as previous ones were trained with VCTK dataset)
|
178 |
+
|
179 |
+
## Citation
|
180 |
+
|
181 |
+
If this code contributes to your research, please cite the work:
|
182 |
+
|
183 |
+
```
|
184 |
+
@misc{mehta2024matchatts,
|
185 |
+
title={Matcha-TTS: A fast TTS architecture with conditional flow matching},
|
186 |
+
author={Shivam Mehta and Ruibo Tu and Jonas Beskow and Éva Székely and Gustav Eje Henter},
|
187 |
+
year={2024},
|
188 |
+
eprint={2309.03199},
|
189 |
+
archivePrefix={arXiv},
|
190 |
+
primaryClass={eess.AS}
|
191 |
+
}
|
192 |
+
```
|
193 |
+
|
194 |
+
## Additional Information
|
195 |
+
|
196 |
+
### Author
|
197 |
+
The Language Technologies Unit from Barcelona Supercomputing Center.
|
198 |
+
|
199 |
+
### Contact
|
200 |
+
For further information, please send an email to <langtech@bsc.es>.
|
201 |
+
|
202 |
+
### Copyright
|
203 |
+
Copyright(c) 2023 by Language Technologies Unit, Barcelona Supercomputing Center.
|
204 |
+
|
205 |
+
### License
|
206 |
+
[Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)
|
207 |
+
|
208 |
+
### Funding
|
209 |
+
This work has been promoted and financed by the Generalitat de Catalunya through the [Aina project](https://projecteaina.cat/).
|
checkpoint_epoch=2399.ckpt
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:352de98d658b7096f2b270da79e398217045a566c33e0496be3df21efd217b55
|
3 |
+
size 250638851
|
config.yaml
ADDED
@@ -0,0 +1,43 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
cfm:
|
2 |
+
name: CFM
|
3 |
+
sigma_min: 0.0001
|
4 |
+
solver: euler
|
5 |
+
data_statistics:
|
6 |
+
mel_mean: -6.578195
|
7 |
+
mel_std: 2.538758
|
8 |
+
decoder:
|
9 |
+
act_fn: snakebeta
|
10 |
+
attention_head_dim: 64
|
11 |
+
channels:
|
12 |
+
- 256
|
13 |
+
- 256
|
14 |
+
dropout: 0.05
|
15 |
+
n_blocks: 1
|
16 |
+
num_heads: 2
|
17 |
+
num_mid_blocks: 2
|
18 |
+
encoder:
|
19 |
+
duration_predictor_params:
|
20 |
+
filter_channels_dp: 256
|
21 |
+
kernel_size: 3
|
22 |
+
p_dropout: 0.1
|
23 |
+
encoder_params:
|
24 |
+
filter_channels: 768
|
25 |
+
filter_channels_dp: 256
|
26 |
+
kernel_size: 3
|
27 |
+
n_channels: 192
|
28 |
+
n_feats: 80
|
29 |
+
n_heads: 2
|
30 |
+
n_layers: 6
|
31 |
+
n_spks: 47
|
32 |
+
p_dropout: 0.1
|
33 |
+
prenet: true
|
34 |
+
spk_emb_dim: 64
|
35 |
+
encoder_type: RoPE Encoder
|
36 |
+
n_feats: 80
|
37 |
+
n_spks: 47
|
38 |
+
n_vocab: 178
|
39 |
+
optimizer: null
|
40 |
+
out_size: null
|
41 |
+
prior_loss: true
|
42 |
+
scheduler: null
|
43 |
+
spk_emb_dim: 64
|
matcha_multispeaker_cat_opset_15_10_steps_2399.onnx
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:87faec2df4837126ca324a72b16b53cf378a230dc0dc86f1781431388e714a94
|
3 |
+
size 86049399
|
pytorch_model.bin
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:44b9640678f0d3be86a09484bbcf2cd55c9c4d2a92fc0eb3fb193ada6b5d01aa
|
3 |
+
size 83535314
|