zyingt commited on
Commit
0636c80
·
1 Parent(s): 1e5e83f

Delete egs/tts/vits_hifitts_libritts

Browse files
egs/tts/vits_hifitts_libritts/README.md DELETED
@@ -1,135 +0,0 @@
1
-
2
- # vits_hifitts_libritts Recipe
3
-
4
- In this recipe, we will show how to train [vits_hifitts_libritts](https://arxiv.org/abs/2106.06103) using Amphion's infrastructure. vits_hifitts_libritts is an end-to-end TTS architecture that utilizes conditional variational autoencoder with adversarial learning.
5
-
6
- There are four stages in total:
7
-
8
- 1. Data preparation
9
- 2. Features extraction
10
- 3. Training
11
- 4. Inference
12
-
13
- > **NOTE:** You need to run every command of this recipe in the `Amphion` root path:
14
- > ```bash
15
- > cd Amphion
16
- > ```
17
-
18
- ## 1. Data Preparation
19
-
20
- ### Dataset Download
21
- You can use the commonly used TTS dataset to train TTS model, e.g., LJSpeech, VCTK, LibriTTS, etc. We strongly recommend you use LJSpeech to train TTS model for the first time. How to download dataset is detailed [here](../../datasets/README.md).
22
-
23
- ### Configuration
24
-
25
- After downloading the dataset, you can set the dataset paths in `exp_config.json`. Note that you can change the `dataset` list to use your preferred datasets.
26
-
27
- ```json
28
- "dataset": [
29
- "LJSpeech",
30
- ],
31
- "dataset_path": {
32
- // TODO: Fill in your dataset path
33
- "LJSpeech": "[LJSpeech dataset path]",
34
- },
35
- ```
36
-
37
- ## 2. Features Extraction
38
-
39
- ### Configuration
40
-
41
- Specify the `processed_dir` and the `log_dir` and for saving the processed data and the checkpoints in `exp_config.json`:
42
-
43
- ```json
44
- // TODO: Fill in the output log path. The default value is "Amphion/ckpts/tts"
45
- "log_dir": "ckpts/tts",
46
- "preprocess": {
47
- // TODO: Fill in the output data path. The default value is "Amphion/data"
48
- "processed_dir": "data",
49
- ...
50
- },
51
- ```
52
-
53
- ### Run
54
-
55
- Run the `run.sh` as the preproces stage (set `--stage 1`):
56
-
57
- ```bash
58
- sh egs/tts/vits_hifitts_libritts/run.sh --stage 1
59
- ```
60
-
61
- > **NOTE:** The `CUDA_VISIBLE_DEVICES` is set as `"0"` in default. You can change it when running `run.sh` by specifying such as `--gpu "1"`.
62
-
63
- ## 3. Training
64
-
65
- ### Configuration
66
-
67
- We provide the default hyparameters in the `exp_config.json`. They can work on single NVIDIA-24g GPU. You can adjust them based on your GPU machines.
68
-
69
- ```
70
- "train": {
71
- "batch_size": 16,
72
- }
73
- ```
74
-
75
- ### Run
76
-
77
- Run the `run.sh` as the training stage (set `--stage 2`). Specify a experimental name to run the following command. The tensorboard logs and checkpoints will be saved in `Amphion/ckpts/tts/[YourExptName]`.
78
-
79
- ```bash
80
- sh egs/tts/vits_hifitts_libritts/run.sh --stage 2 --name [YourExptName]
81
- ```
82
-
83
- > **NOTE:** The `CUDA_VISIBLE_DEVICES` is set as `"0"` in default. You can change it when running `run.sh` by specifying such as `--gpu "0,1,2,3"`.
84
-
85
-
86
- ## 4. Inference
87
-
88
- ### Configuration
89
-
90
- For inference, you need to specify the following configurations when running `run.sh`:
91
-
92
-
93
- | Parameters | Description | Example |
94
- | --------------------- | -------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
95
- | `--infer_expt_dir` | The experimental directory which contains `checkpoint` | `Amphion/ckpts/tts/[YourExptName]` |
96
- | `--infer_output_dir` | The output directory to save inferred audios. | `Amphion/ckpts/tts/[YourExptName]/result` |
97
- | `--infer_mode` | The inference mode, e.g., "`single`", "`batch`". | "`single`" to generate a clip of speech, "`batch`" to generate a batch of speech at a time. |
98
- | `--infer_dataset` | The dataset used for inference. | For LJSpeech dataset, the inference dataset would be `LJSpeech`. |
99
- | `--infer_testing_set` | The subset of the inference dataset used for inference, e.g., train, test, golden_test | For LJSpeech dataset, the testing set would be  "`test`" split from LJSpeech at the feature extraction, or "`golden_test`" cherry-picked from test set as template testing set. |
100
- | `--infer_text` | The text to be synthesized. | "`This is a clip of generated speech with the given text from a TTS model.`" |
101
-
102
- ### Run
103
- For example, if you want to generate speech of all testing set split from LJSpeech, just run:
104
-
105
- ```bash
106
- sh egs/tts/vits_hifitts_libritts/run.sh --stage 3 --gpu "0" \
107
- --infer_expt_dir Amphion/ckpts/tts/[YourExptName] \
108
- --infer_output_dir Amphion/ckpts/tts/[YourExptName]/result \
109
- --infer_mode "batch" \
110
- --infer_dataset "LJSpeech" \
111
- --infer_testing_set "test"
112
- ```
113
-
114
- Or, if you want to generate a single clip of speech from a given text, just run:
115
-
116
- ```bash
117
- sh egs/tts/vits_hifitts_libritts/run.sh --stage 3 --gpu "0" \
118
- --infer_expt_dir Amphion/ckpts/tts/[YourExptName] \
119
- --infer_output_dir Amphion/ckpts/tts/[YourExptName]/result \
120
- --infer_mode "single" \
121
- --infer_text "This is a clip of generated speech with the given text from a TTS model."
122
- ```
123
-
124
- We will release a pre-trained vits_hifitts_libritts model trained on LJSpeech. So you can download the pre-trained model and generate speech following the above inference instruction.
125
-
126
-
127
- ```bibtex
128
- @inproceedings{kim2021conditional,
129
- title={Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech},
130
- author={Kim, Jaehyeon and Kong, Jungil and Son, Juhee},
131
- booktitle={International Conference on Machine Learning},
132
- pages={5530--5540},
133
- year={2021},
134
- }
135
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
egs/tts/vits_hifitts_libritts/exp_config.json DELETED
@@ -1,33 +0,0 @@
1
- {
2
- "base_config": "config/vits.json",
3
- "model_type": "VITS",
4
- "dataset": [
5
- "hifitts",
6
- "libritts"
7
- ],
8
- "dataset_path": {
9
- // TODO: Fill in your dataset path
10
- "hifitts": "/mnt/workspace/xueliumeng/data/hifitts/hi_fi_tts_v0",
11
- "libritts": "/mnt/workspace/xueliumeng/data/libritts/raw/LibriTTS"
12
- },
13
- // TODO: Fill in the output log path. The default value is "Amphion/ckpts/tts"
14
- "log_dir": "/mnt/workspace/xueliumeng/data/vits_on_libritts_hifitts/logs",
15
- "preprocess": {
16
- "extract_audio": true,
17
- "use_phone": true,
18
- // linguistic features
19
- "extract_phone": true,
20
- "phone_extractor": "espeak", // "espeak, pypinyin, pypinyin_initials_finals, lexicon (only for language=en-us right now)"
21
- // TODO: Fill in the output data path. The default value is "Amphion/data"
22
- "processed_dir": "/mnt/workspace/xueliumeng/data/vits_on_libritts_hifitts/processed_data",
23
- "sample_rate": 24000,
24
- "train_file": "train_all.json",
25
- "valid_file": "valid.json", // validattion set
26
- "use_spkid": true, // True: use speaker id for multi-speaker dataset
27
- },
28
- "train": {
29
- "batch_size": 16,
30
- "multi_speaker_training": true, // True: train multi-speaker model; False: training single-speaker model;
31
- "n_speakers": 2500, // number of speakers, while be automatically set if n_speakers is 0 and multi_speaker_training is true
32
- }
33
- }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
egs/tts/vits_hifitts_libritts/run.sh DELETED
@@ -1,146 +0,0 @@
1
- # Copyright (c) 2023 Amphion.
2
- #
3
- # This source code is licensed under the MIT license found in the
4
- # LICENSE file in the root directory of this source tree.
5
-
6
- ######## Build Experiment Environment ###########
7
- exp_dir=$(cd `dirname $0`; pwd)
8
- work_dir=$(dirname $(dirname $(dirname $exp_dir)))
9
-
10
- export WORK_DIR=$work_dir
11
- export PYTHONPATH=$work_dir
12
- export PYTHONIOENCODING=UTF-8
13
-
14
- cd $work_dir/modules/monotonic_align
15
- mkdir -p monotonic_align
16
- python setup.py build_ext --inplace
17
- cd $work_dir
18
-
19
- ######## Parse the Given Parameters from the Commond ###########
20
- # options=$(getopt -o c:n:s --long gpu:,config:,infer_expt_dir:,infer_output_dir:,infer_source_file:,infer_source_audio_dir:,infer_target_speaker:,infer_key_shift:,infer_vocoder_dir:,name:,stage: -- "$@")
21
- options=$(getopt -o c:n:s --long gpu:,config:,infer_expt_dir:,infer_output_dir:,infer_mode:,infer_dataset:,infer_testing_set:,infer_text:,infer_speaker_name:,name:,stage: -- "$@")
22
- eval set -- "$options"
23
-
24
- while true; do
25
- case $1 in
26
- # Experimental Configuration File
27
- -c | --config) shift; exp_config=$1 ; shift ;;
28
- # Experimental Name
29
- -n | --name) shift; exp_name=$1 ; shift ;;
30
- # Running Stage
31
- -s | --stage) shift; running_stage=$1 ; shift ;;
32
- # Visible GPU machines. The default value is "0".
33
- --gpu) shift; gpu=$1 ; shift ;;
34
-
35
- # [Only for Inference] The experiment dir. The value is like "[Your path to save logs and checkpoints]/[YourExptName]"
36
- --infer_expt_dir) shift; infer_expt_dir=$1 ; shift ;;
37
- # [Only for Inference] The output dir to save inferred audios. Its default value is "$expt_dir/result"
38
- --infer_output_dir) shift; infer_output_dir=$1 ; shift ;;
39
- # [Only for Inference] The inference mode. It can be "batch" to generate speech by batch, or "single" to generage a single clip of speech.
40
- --infer_mode) shift; infer_mode=$1 ; shift ;;
41
- # [Only for Inference] The inference dataset. It is only used when the inference model is "batch".
42
- --infer_dataset) shift; infer_dataset=$1 ; shift ;;
43
- # [Only for Inference] The inference testing set. It is only used when the inference model is "batch". It can be "test" set split from the dataset, or "golden_test" carefully selected from the testing set.
44
- --infer_testing_set) shift; infer_testing_set=$1 ; shift ;;
45
- # [Only for Inference] The text to be synthesized from. It is only used when the inference model is "single".
46
- --infer_text) shift; infer_text=$1 ; shift ;;
47
- # [Only for Inference] The speaker voice to be delivered in the synthesized speech. It is only used when the inference model is "single".
48
- --infer_speaker_name) shift; infer_speaker_name=$1 ; shift ;;
49
-
50
- --) shift ; break ;;
51
- *) echo "Invalid option: $1" exit 1 ;;
52
- esac
53
- done
54
-
55
-
56
- ### Value check ###
57
- if [ -z "$running_stage" ]; then
58
- echo "[Error] Please specify the running stage"
59
- exit 1
60
- fi
61
-
62
- if [ -z "$exp_config" ]; then
63
- exp_config="${exp_dir}"/exp_config.json
64
- fi
65
- echo "Exprimental Configuration File: $exp_config"
66
-
67
- if [ -z "$gpu" ]; then
68
- gpu="0"
69
- fi
70
-
71
- ######## Features Extraction ###########
72
- if [ $running_stage -eq 1 ]; then
73
- CUDA_VISIBLE_DEVICES=$gpu /home/pai/envs/amphion/bin/python "${work_dir}"/bins/tts/preprocess.py \
74
- --config=$exp_config \
75
- --num_workers=4
76
- fi
77
-
78
- ######## Training ###########
79
- if [ $running_stage -eq 2 ]; then
80
- if [ -z "$exp_name" ]; then
81
- echo "[Error] Please specify the experiments name"
82
- exit 1
83
- fi
84
- echo "Exprimental Name: $exp_name"
85
-
86
- CUDA_VISIBLE_DEVICES=$gpu accelerate launch "${work_dir}"/bins/tts/train.py \
87
- --config $exp_config \
88
- --exp_name $exp_name \
89
- --log_level debug
90
- fi
91
-
92
- ######## Inference ###########
93
- if [ $running_stage -eq 3 ]; then
94
- if [ -z "$infer_expt_dir" ]; then
95
- echo "[Error] Please specify the experimental directionary. The value is like [Your path to save logs and checkpoints]/[YourExptName]"
96
- exit 1
97
- fi
98
-
99
- if [ -z "$infer_output_dir" ]; then
100
- infer_output_dir="$expt_dir/result"
101
- fi
102
-
103
- if [ -z "$infer_mode" ]; then
104
- echo "[Error] Please specify the inference mode, e.g., "batch", "single""
105
- exit 1
106
- fi
107
-
108
- if [ "$infer_mode" = "batch" ] && [ -z "$infer_dataset" ]; then
109
- echo "[Error] Please specify the dataset used in inference when the inference mode is batch"
110
- exit 1
111
- fi
112
-
113
- if [ "$infer_mode" = "batch" ] && [ -z "$infer_testing_set" ]; then
114
- echo "[Error] Please specify the testing set used in inference when the inference mode is batch"
115
- exit 1
116
- fi
117
-
118
- if [ "$infer_mode" = "single" ] && [ -z "$infer_text" ]; then
119
- echo "[Error] Please specify the text to be synthesized when the inference mode is single"
120
- exit 1
121
- fi
122
-
123
- if [ "$infer_mode" = "single" ]; then
124
- echo 'Text: ' ${infer_text}
125
- infer_dataset=None
126
- infer_testing_set=None
127
- elif [ "$infer_mode" = "batch" ]; then
128
- infer_text=''
129
- infer_speaker_name=None
130
- fi
131
-
132
-
133
- CUDA_VISIBLE_DEVICES=$gpu accelerate launch "$work_dir"/bins/tts/inference.py \
134
- --config $exp_config \
135
- --acoustics_dir $infer_expt_dir \
136
- --output_dir $infer_output_dir \
137
- --mode $infer_mode \
138
- --dataset $infer_dataset \
139
- --testing_set $infer_testing_set \
140
- --text "$infer_text" \
141
- --speaker_name $infer_speaker_name \
142
- --log_level debug
143
-
144
-
145
-
146
- fi