Siddhant commited on
Commit
264bf42
1 Parent(s): fc5c51e

import from zenodo

Browse files
Files changed (18) hide show
  1. README.md +50 -0
  2. exp/tts_train_conformer_fastspeech2_tacotron2_teacher_raw_phn_jaconv_pyopenjtalk_accent_with_pause/config.yaml +292 -0
  3. exp/tts_train_conformer_fastspeech2_tacotron2_teacher_raw_phn_jaconv_pyopenjtalk_accent_with_pause/images/backward_time.png +0 -0
  4. exp/tts_train_conformer_fastspeech2_tacotron2_teacher_raw_phn_jaconv_pyopenjtalk_accent_with_pause/images/duration_loss.png +0 -0
  5. exp/tts_train_conformer_fastspeech2_tacotron2_teacher_raw_phn_jaconv_pyopenjtalk_accent_with_pause/images/energy_loss.png +0 -0
  6. exp/tts_train_conformer_fastspeech2_tacotron2_teacher_raw_phn_jaconv_pyopenjtalk_accent_with_pause/images/forward_time.png +0 -0
  7. exp/tts_train_conformer_fastspeech2_tacotron2_teacher_raw_phn_jaconv_pyopenjtalk_accent_with_pause/images/iter_time.png +0 -0
  8. exp/tts_train_conformer_fastspeech2_tacotron2_teacher_raw_phn_jaconv_pyopenjtalk_accent_with_pause/images/l1_loss.png +0 -0
  9. exp/tts_train_conformer_fastspeech2_tacotron2_teacher_raw_phn_jaconv_pyopenjtalk_accent_with_pause/images/loss.png +0 -0
  10. exp/tts_train_conformer_fastspeech2_tacotron2_teacher_raw_phn_jaconv_pyopenjtalk_accent_with_pause/images/lr_0.png +0 -0
  11. exp/tts_train_conformer_fastspeech2_tacotron2_teacher_raw_phn_jaconv_pyopenjtalk_accent_with_pause/images/optim_step_time.png +0 -0
  12. exp/tts_train_conformer_fastspeech2_tacotron2_teacher_raw_phn_jaconv_pyopenjtalk_accent_with_pause/images/pitch_loss.png +0 -0
  13. exp/tts_train_conformer_fastspeech2_tacotron2_teacher_raw_phn_jaconv_pyopenjtalk_accent_with_pause/images/train_time.png +0 -0
  14. exp/tts_train_conformer_fastspeech2_tacotron2_teacher_raw_phn_jaconv_pyopenjtalk_accent_with_pause/train.loss.ave_5best.pth +3 -0
  15. exp/tts_train_tacotron2_raw_phn_jaconv_pyopenjtalk_accent_with_pause/decode_use_teacher_forcingtrue_train.loss.ave/stats/train/energy_stats.npz +0 -0
  16. exp/tts_train_tacotron2_raw_phn_jaconv_pyopenjtalk_accent_with_pause/decode_use_teacher_forcingtrue_train.loss.ave/stats/train/feats_stats.npz +0 -0
  17. exp/tts_train_tacotron2_raw_phn_jaconv_pyopenjtalk_accent_with_pause/decode_use_teacher_forcingtrue_train.loss.ave/stats/train/pitch_stats.npz +0 -0
  18. meta.yaml +8 -0
README.md ADDED
@@ -0,0 +1,50 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - espnet
4
+ - audio
5
+ - text-to-speech
6
+ language: ja
7
+ datasets:
8
+ - jsut
9
+ license: cc-by-4.0
10
+ ---
11
+ ## Example ESPnet2 TTS model
12
+ ### `kan-bayashi/jsut_conformer_fastspeech2_accent_with_pause`
13
+ ♻️ Imported from https://zenodo.org/record/4436448/
14
+
15
+ This model was trained by kan-bayashi using jsut/tts1 recipe in [espnet](https://github.com/espnet/espnet/).
16
+ ### Demo: How to use in ESPnet2
17
+ ```python
18
+ # coming soon
19
+ ```
20
+ ### Citing ESPnet
21
+ ```BibTex
22
+ @inproceedings{watanabe2018espnet,
23
+ author={Shinji Watanabe and Takaaki Hori and Shigeki Karita and Tomoki Hayashi and Jiro Nishitoba and Yuya Unno and Nelson {Enrique Yalta Soplin} and Jahn Heymann and Matthew Wiesner and Nanxin Chen and Adithya Renduchintala and Tsubasa Ochiai},
24
+ title={{ESPnet}: End-to-End Speech Processing Toolkit},
25
+ year={2018},
26
+ booktitle={Proceedings of Interspeech},
27
+ pages={2207--2211},
28
+ doi={10.21437/Interspeech.2018-1456},
29
+ url={http://dx.doi.org/10.21437/Interspeech.2018-1456}
30
+ }
31
+ @inproceedings{hayashi2020espnet,
32
+ title={{Espnet-TTS}: Unified, reproducible, and integratable open source end-to-end text-to-speech toolkit},
33
+ author={Hayashi, Tomoki and Yamamoto, Ryuichi and Inoue, Katsuki and Yoshimura, Takenori and Watanabe, Shinji and Toda, Tomoki and Takeda, Kazuya and Zhang, Yu and Tan, Xu},
34
+ booktitle={Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
35
+ pages={7654--7658},
36
+ year={2020},
37
+ organization={IEEE}
38
+ }
39
+ ```
40
+ or arXiv:
41
+ ```bibtex
42
+ @misc{watanabe2018espnet,
43
+ title={ESPnet: End-to-End Speech Processing Toolkit},
44
+ author={Shinji Watanabe and Takaaki Hori and Shigeki Karita and Tomoki Hayashi and Jiro Nishitoba and Yuya Unno and Nelson Enrique Yalta Soplin and Jahn Heymann and Matthew Wiesner and Nanxin Chen and Adithya Renduchintala and Tsubasa Ochiai},
45
+ year={2018},
46
+ eprint={1804.00015},
47
+ archivePrefix={arXiv},
48
+ primaryClass={cs.CL}
49
+ }
50
+ ```
exp/tts_train_conformer_fastspeech2_tacotron2_teacher_raw_phn_jaconv_pyopenjtalk_accent_with_pause/config.yaml ADDED
@@ -0,0 +1,292 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ config: conf/tuning/train_conformer_fastspeech2.yaml
2
+ print_config: false
3
+ log_level: INFO
4
+ dry_run: false
5
+ iterator_type: sequence
6
+ output_dir: exp/tts_train_conformer_fastspeech2_tacotron2_teacher_raw_phn_jaconv_pyopenjtalk_accent_with_pause
7
+ ngpu: 1
8
+ seed: 0
9
+ num_workers: 1
10
+ num_att_plot: 3
11
+ dist_backend: nccl
12
+ dist_init_method: env://
13
+ dist_world_size: 4
14
+ dist_rank: 0
15
+ local_rank: 0
16
+ dist_master_addr: localhost
17
+ dist_master_port: 55944
18
+ dist_launcher: null
19
+ multiprocessing_distributed: true
20
+ cudnn_enabled: true
21
+ cudnn_benchmark: false
22
+ cudnn_deterministic: true
23
+ collect_stats: false
24
+ write_collected_feats: false
25
+ max_epoch: 200
26
+ patience: null
27
+ val_scheduler_criterion:
28
+ - valid
29
+ - loss
30
+ early_stopping_criterion:
31
+ - valid
32
+ - loss
33
+ - min
34
+ best_model_criterion:
35
+ - - valid
36
+ - loss
37
+ - min
38
+ - - train
39
+ - loss
40
+ - min
41
+ keep_nbest_models: 5
42
+ grad_clip: 1.0
43
+ grad_clip_type: 2.0
44
+ grad_noise: false
45
+ accum_grad: 1
46
+ no_forward_run: false
47
+ resume: true
48
+ train_dtype: float32
49
+ use_amp: false
50
+ log_interval: null
51
+ unused_parameters: false
52
+ use_tensorboard: true
53
+ use_wandb: false
54
+ wandb_project: null
55
+ wandb_id: null
56
+ pretrain_path: null
57
+ init_param: []
58
+ freeze_param: []
59
+ num_iters_per_epoch: 1000
60
+ batch_size: 20
61
+ valid_batch_size: null
62
+ batch_bins: 12000000
63
+ valid_batch_bins: null
64
+ train_shape_file:
65
+ - exp/tts_train_tacotron2_raw_phn_jaconv_pyopenjtalk_accent_with_pause/decode_use_teacher_forcingtrue_train.loss.ave/stats/train/text_shape.phn
66
+ - exp/tts_train_tacotron2_raw_phn_jaconv_pyopenjtalk_accent_with_pause/decode_use_teacher_forcingtrue_train.loss.ave/stats/train/speech_shape
67
+ valid_shape_file:
68
+ - exp/tts_train_tacotron2_raw_phn_jaconv_pyopenjtalk_accent_with_pause/decode_use_teacher_forcingtrue_train.loss.ave/stats/valid/text_shape.phn
69
+ - exp/tts_train_tacotron2_raw_phn_jaconv_pyopenjtalk_accent_with_pause/decode_use_teacher_forcingtrue_train.loss.ave/stats/valid/speech_shape
70
+ batch_type: numel
71
+ valid_batch_type: null
72
+ fold_length:
73
+ - 150
74
+ - 240000
75
+ sort_in_batch: descending
76
+ sort_batch: descending
77
+ multiple_iterator: false
78
+ chunk_length: 500
79
+ chunk_shift_ratio: 0.5
80
+ num_cache_chunks: 1024
81
+ train_data_path_and_name_and_type:
82
+ - - dump/raw/tr_no_dev/text
83
+ - text
84
+ - text
85
+ - - exp/tts_train_tacotron2_raw_phn_jaconv_pyopenjtalk_accent_with_pause/decode_use_teacher_forcingtrue_train.loss.ave/tr_no_dev/durations
86
+ - durations
87
+ - text_int
88
+ - - dump/raw/tr_no_dev/wav.scp
89
+ - speech
90
+ - sound
91
+ valid_data_path_and_name_and_type:
92
+ - - dump/raw/dev/text
93
+ - text
94
+ - text
95
+ - - exp/tts_train_tacotron2_raw_phn_jaconv_pyopenjtalk_accent_with_pause/decode_use_teacher_forcingtrue_train.loss.ave/dev/durations
96
+ - durations
97
+ - text_int
98
+ - - dump/raw/dev/wav.scp
99
+ - speech
100
+ - sound
101
+ allow_variable_data_keys: false
102
+ max_cache_size: 0.0
103
+ max_cache_fd: 32
104
+ valid_max_cache_size: null
105
+ optim: adam
106
+ optim_conf:
107
+ lr: 1.0
108
+ scheduler: noamlr
109
+ scheduler_conf:
110
+ model_size: 384
111
+ warmup_steps: 4000
112
+ token_list:
113
+ - <blank>
114
+ - <unk>
115
+ - '1'
116
+ - '2'
117
+ - '0'
118
+ - '3'
119
+ - '4'
120
+ - '-1'
121
+ - '5'
122
+ - a
123
+ - o
124
+ - '-2'
125
+ - i
126
+ - '-3'
127
+ - u
128
+ - e
129
+ - k
130
+ - n
131
+ - t
132
+ - '6'
133
+ - r
134
+ - '-4'
135
+ - s
136
+ - N
137
+ - m
138
+ - pau
139
+ - '7'
140
+ - sh
141
+ - d
142
+ - g
143
+ - w
144
+ - '8'
145
+ - U
146
+ - '-5'
147
+ - I
148
+ - cl
149
+ - h
150
+ - y
151
+ - b
152
+ - '9'
153
+ - j
154
+ - ts
155
+ - ch
156
+ - '-6'
157
+ - z
158
+ - p
159
+ - '-7'
160
+ - f
161
+ - ky
162
+ - ry
163
+ - '-8'
164
+ - gy
165
+ - '-9'
166
+ - hy
167
+ - ny
168
+ - '-10'
169
+ - by
170
+ - my
171
+ - '-11'
172
+ - '-12'
173
+ - '-13'
174
+ - py
175
+ - '-14'
176
+ - '-15'
177
+ - v
178
+ - '10'
179
+ - '-16'
180
+ - '-17'
181
+ - '11'
182
+ - '-21'
183
+ - '-20'
184
+ - '12'
185
+ - '-19'
186
+ - '13'
187
+ - '-18'
188
+ - '14'
189
+ - dy
190
+ - '15'
191
+ - ty
192
+ - '-22'
193
+ - '16'
194
+ - '18'
195
+ - '19'
196
+ - '17'
197
+ - <sos/eos>
198
+ odim: null
199
+ model_conf: {}
200
+ use_preprocessor: true
201
+ token_type: phn
202
+ bpemodel: null
203
+ non_linguistic_symbols: null
204
+ cleaner: jaconv
205
+ g2p: pyopenjtalk_accent_with_pause
206
+ feats_extract: fbank
207
+ feats_extract_conf:
208
+ fs: 24000
209
+ fmin: 80
210
+ fmax: 7600
211
+ n_mels: 80
212
+ hop_length: 300
213
+ n_fft: 2048
214
+ win_length: 1200
215
+ normalize: global_mvn
216
+ normalize_conf:
217
+ stats_file: exp/tts_train_tacotron2_raw_phn_jaconv_pyopenjtalk_accent_with_pause/decode_use_teacher_forcingtrue_train.loss.ave/stats/train/feats_stats.npz
218
+ tts: fastspeech2
219
+ tts_conf:
220
+ adim: 384
221
+ aheads: 2
222
+ elayers: 4
223
+ eunits: 1536
224
+ dlayers: 4
225
+ dunits: 1536
226
+ positionwise_layer_type: conv1d
227
+ positionwise_conv_kernel_size: 3
228
+ duration_predictor_layers: 2
229
+ duration_predictor_chans: 256
230
+ duration_predictor_kernel_size: 3
231
+ postnet_layers: 5
232
+ postnet_filts: 5
233
+ postnet_chans: 256
234
+ use_masking: true
235
+ encoder_normalize_before: true
236
+ decoder_normalize_before: true
237
+ reduction_factor: 1
238
+ encoder_type: conformer
239
+ decoder_type: conformer
240
+ conformer_pos_enc_layer_type: rel_pos
241
+ conformer_self_attn_layer_type: rel_selfattn
242
+ conformer_activation_type: swish
243
+ use_macaron_style_in_conformer: true
244
+ use_cnn_in_conformer: true
245
+ conformer_enc_kernel_size: 7
246
+ conformer_dec_kernel_size: 31
247
+ init_type: xavier_uniform
248
+ transformer_enc_dropout_rate: 0.2
249
+ transformer_enc_positional_dropout_rate: 0.2
250
+ transformer_enc_attn_dropout_rate: 0.2
251
+ transformer_dec_dropout_rate: 0.2
252
+ transformer_dec_positional_dropout_rate: 0.2
253
+ transformer_dec_attn_dropout_rate: 0.2
254
+ pitch_predictor_layers: 5
255
+ pitch_predictor_chans: 256
256
+ pitch_predictor_kernel_size: 5
257
+ pitch_predictor_dropout: 0.5
258
+ pitch_embed_kernel_size: 1
259
+ pitch_embed_dropout: 0.0
260
+ stop_gradient_from_pitch_predictor: true
261
+ energy_predictor_layers: 2
262
+ energy_predictor_chans: 256
263
+ energy_predictor_kernel_size: 3
264
+ energy_predictor_dropout: 0.5
265
+ energy_embed_kernel_size: 1
266
+ energy_embed_dropout: 0.0
267
+ stop_gradient_from_energy_predictor: false
268
+ pitch_extract: dio
269
+ pitch_extract_conf:
270
+ fs: 24000
271
+ n_fft: 2048
272
+ hop_length: 300
273
+ f0max: 400
274
+ f0min: 80
275
+ reduction_factor: 1
276
+ pitch_normalize: global_mvn
277
+ pitch_normalize_conf:
278
+ stats_file: exp/tts_train_tacotron2_raw_phn_jaconv_pyopenjtalk_accent_with_pause/decode_use_teacher_forcingtrue_train.loss.ave/stats/train/pitch_stats.npz
279
+ energy_extract: energy
280
+ energy_extract_conf:
281
+ fs: 24000
282
+ n_fft: 2048
283
+ hop_length: 300
284
+ win_length: 1200
285
+ reduction_factor: 1
286
+ energy_normalize: global_mvn
287
+ energy_normalize_conf:
288
+ stats_file: exp/tts_train_tacotron2_raw_phn_jaconv_pyopenjtalk_accent_with_pause/decode_use_teacher_forcingtrue_train.loss.ave/stats/train/energy_stats.npz
289
+ required:
290
+ - output_dir
291
+ - token_list
292
+ distributed: true
exp/tts_train_conformer_fastspeech2_tacotron2_teacher_raw_phn_jaconv_pyopenjtalk_accent_with_pause/images/backward_time.png ADDED
exp/tts_train_conformer_fastspeech2_tacotron2_teacher_raw_phn_jaconv_pyopenjtalk_accent_with_pause/images/duration_loss.png ADDED
exp/tts_train_conformer_fastspeech2_tacotron2_teacher_raw_phn_jaconv_pyopenjtalk_accent_with_pause/images/energy_loss.png ADDED
exp/tts_train_conformer_fastspeech2_tacotron2_teacher_raw_phn_jaconv_pyopenjtalk_accent_with_pause/images/forward_time.png ADDED
exp/tts_train_conformer_fastspeech2_tacotron2_teacher_raw_phn_jaconv_pyopenjtalk_accent_with_pause/images/iter_time.png ADDED
exp/tts_train_conformer_fastspeech2_tacotron2_teacher_raw_phn_jaconv_pyopenjtalk_accent_with_pause/images/l1_loss.png ADDED
exp/tts_train_conformer_fastspeech2_tacotron2_teacher_raw_phn_jaconv_pyopenjtalk_accent_with_pause/images/loss.png ADDED
exp/tts_train_conformer_fastspeech2_tacotron2_teacher_raw_phn_jaconv_pyopenjtalk_accent_with_pause/images/lr_0.png ADDED
exp/tts_train_conformer_fastspeech2_tacotron2_teacher_raw_phn_jaconv_pyopenjtalk_accent_with_pause/images/optim_step_time.png ADDED
exp/tts_train_conformer_fastspeech2_tacotron2_teacher_raw_phn_jaconv_pyopenjtalk_accent_with_pause/images/pitch_loss.png ADDED
exp/tts_train_conformer_fastspeech2_tacotron2_teacher_raw_phn_jaconv_pyopenjtalk_accent_with_pause/images/train_time.png ADDED
exp/tts_train_conformer_fastspeech2_tacotron2_teacher_raw_phn_jaconv_pyopenjtalk_accent_with_pause/train.loss.ave_5best.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9e72634b7d48b5d5cb88e0382e33a832b4cbe0f108cfaabd5176f340f366af86
3
+ size 281518122
exp/tts_train_tacotron2_raw_phn_jaconv_pyopenjtalk_accent_with_pause/decode_use_teacher_forcingtrue_train.loss.ave/stats/train/energy_stats.npz ADDED
Binary file (770 Bytes). View file
 
exp/tts_train_tacotron2_raw_phn_jaconv_pyopenjtalk_accent_with_pause/decode_use_teacher_forcingtrue_train.loss.ave/stats/train/feats_stats.npz ADDED
Binary file (1.4 kB). View file
 
exp/tts_train_tacotron2_raw_phn_jaconv_pyopenjtalk_accent_with_pause/decode_use_teacher_forcingtrue_train.loss.ave/stats/train/pitch_stats.npz ADDED
Binary file (770 Bytes). View file
 
meta.yaml ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ espnet: 0.8.0
2
+ files:
3
+ model_file: exp/tts_train_conformer_fastspeech2_tacotron2_teacher_raw_phn_jaconv_pyopenjtalk_accent_with_pause/train.loss.ave_5best.pth
4
+ python: "3.7.3 (default, Mar 27 2019, 22:11:17) \n[GCC 7.3.0]"
5
+ timestamp: 1610543773.910821
6
+ torch: 1.5.1
7
+ yaml_files:
8
+ train_config: exp/tts_train_conformer_fastspeech2_tacotron2_teacher_raw_phn_jaconv_pyopenjtalk_accent_with_pause/config.yaml