ESPnet
jp
audio
singing-voice-synthesis
ftshijt commited on
Commit
1c6dc1a
·
1 Parent(s): 032a586

Update model

Browse files
Files changed (30) hide show
  1. README.md +509 -1
  2. exp/svs_stats_raw_phn_pyopenjtalk_jp/train/feats_stats.npz +3 -0
  3. exp/svs_stats_raw_phn_pyopenjtalk_jp/train/pitch_stats.npz +3 -0
  4. exp/svs_train_visinger_24_raw_phn_pyopenjtalk_jp/200epoch.pth +3 -0
  5. exp/svs_train_visinger_24_raw_phn_pyopenjtalk_jp/config.yaml +428 -0
  6. exp/svs_train_visinger_24_raw_phn_pyopenjtalk_jp/images/discriminator_backward_time.png +0 -0
  7. exp/svs_train_visinger_24_raw_phn_pyopenjtalk_jp/images/discriminator_fake_loss.png +0 -0
  8. exp/svs_train_visinger_24_raw_phn_pyopenjtalk_jp/images/discriminator_forward_time.png +0 -0
  9. exp/svs_train_visinger_24_raw_phn_pyopenjtalk_jp/images/discriminator_loss.png +0 -0
  10. exp/svs_train_visinger_24_raw_phn_pyopenjtalk_jp/images/discriminator_optim_step_time.png +0 -0
  11. exp/svs_train_visinger_24_raw_phn_pyopenjtalk_jp/images/discriminator_real_loss.png +0 -0
  12. exp/svs_train_visinger_24_raw_phn_pyopenjtalk_jp/images/discriminator_train_time.png +0 -0
  13. exp/svs_train_visinger_24_raw_phn_pyopenjtalk_jp/images/generator_adv_loss.png +0 -0
  14. exp/svs_train_visinger_24_raw_phn_pyopenjtalk_jp/images/generator_backward_time.png +0 -0
  15. exp/svs_train_visinger_24_raw_phn_pyopenjtalk_jp/images/generator_feat_match_loss.png +0 -0
  16. exp/svs_train_visinger_24_raw_phn_pyopenjtalk_jp/images/generator_forward_time.png +0 -0
  17. exp/svs_train_visinger_24_raw_phn_pyopenjtalk_jp/images/generator_kl_loss.png +0 -0
  18. exp/svs_train_visinger_24_raw_phn_pyopenjtalk_jp/images/generator_loss.png +0 -0
  19. exp/svs_train_visinger_24_raw_phn_pyopenjtalk_jp/images/generator_mel_loss.png +0 -0
  20. exp/svs_train_visinger_24_raw_phn_pyopenjtalk_jp/images/generator_optim_step_time.png +0 -0
  21. exp/svs_train_visinger_24_raw_phn_pyopenjtalk_jp/images/generator_phn_dur_loss.png +0 -0
  22. exp/svs_train_visinger_24_raw_phn_pyopenjtalk_jp/images/generator_pitch_loss.png +0 -0
  23. exp/svs_train_visinger_24_raw_phn_pyopenjtalk_jp/images/generator_score_dur_loss.png +0 -0
  24. exp/svs_train_visinger_24_raw_phn_pyopenjtalk_jp/images/generator_train_time.png +0 -0
  25. exp/svs_train_visinger_24_raw_phn_pyopenjtalk_jp/images/gpu_max_cached_mem_GB.png +0 -0
  26. exp/svs_train_visinger_24_raw_phn_pyopenjtalk_jp/images/iter_time.png +0 -0
  27. exp/svs_train_visinger_24_raw_phn_pyopenjtalk_jp/images/optim0_lr0.png +0 -0
  28. exp/svs_train_visinger_24_raw_phn_pyopenjtalk_jp/images/optim1_lr0.png +0 -0
  29. exp/svs_train_visinger_24_raw_phn_pyopenjtalk_jp/images/train_time.png +0 -0
  30. meta.yaml +8 -0
README.md CHANGED
@@ -1,3 +1,511 @@
1
  ---
2
- license: apache-2.0
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ tags:
3
+ - espnet
4
+ - audio
5
+ - singing-voice-synthesis
6
+ language: jp
7
+ datasets:
8
+ - kiritan
9
+ license: cc-by-4.0
10
  ---
11
+
12
+ ## ESPnet2 SVS model
13
+
14
+ ### `espnet/kiritan_svs_visinger`
15
+
16
+ This model was trained by ftshijt using kiritan recipe in [espnet](https://github.com/espnet/espnet/).
17
+
18
+ ### Demo: How to use in ESPnet2
19
+
20
+ Follow the [ESPnet installation instructions](https://espnet.github.io/espnet/installation.html)
21
+ if you haven't done that already.
22
+
23
+ ```bash
24
+ cd espnet
25
+ git checkout 5c4d7cf7feba8461de2e1080bf82182f0efaef38
26
+ pip install -e .
27
+ cd egs2/kiritan/svs1
28
+ ./run.sh --skip_data_prep false --skip_train true --download_model espnet/kiritan_svs_visinger
29
+ ```
30
+
31
+
32
+
33
+ ## SVS config
34
+
35
+ <details><summary>expand</summary>
36
+
37
+ ```
38
+ config: conf/tuning/train_visinger_24.yaml
39
+ print_config: false
40
+ log_level: INFO
41
+ drop_last_iter: false
42
+ dry_run: false
43
+ iterator_type: sequence
44
+ valid_iterator_type: null
45
+ output_dir: exp/svs_train_visinger_24_raw_phn_pyopenjtalk_jp
46
+ ngpu: 1
47
+ seed: 777
48
+ num_workers: 2
49
+ num_att_plot: 3
50
+ dist_backend: nccl
51
+ dist_init_method: env://
52
+ dist_world_size: null
53
+ dist_rank: null
54
+ local_rank: 0
55
+ dist_master_addr: null
56
+ dist_master_port: null
57
+ dist_launcher: null
58
+ multiprocessing_distributed: false
59
+ unused_parameters: true
60
+ sharded_ddp: false
61
+ cudnn_enabled: true
62
+ cudnn_benchmark: false
63
+ cudnn_deterministic: false
64
+ collect_stats: false
65
+ write_collected_feats: false
66
+ max_epoch: 200
67
+ patience: null
68
+ val_scheduler_criterion:
69
+ - valid
70
+ - loss
71
+ early_stopping_criterion:
72
+ - valid
73
+ - loss
74
+ - min
75
+ best_model_criterion:
76
+ - - train
77
+ - total_count
78
+ - max
79
+ keep_nbest_models: 10
80
+ nbest_averaging_interval: 0
81
+ grad_clip: -1
82
+ grad_clip_type: 2.0
83
+ grad_noise: false
84
+ accum_grad: 1
85
+ no_forward_run: false
86
+ resume: true
87
+ train_dtype: float32
88
+ use_amp: false
89
+ log_interval: 50
90
+ use_matplotlib: true
91
+ use_tensorboard: true
92
+ create_graph_in_tensorboard: false
93
+ use_wandb: false
94
+ wandb_project: null
95
+ wandb_id: null
96
+ wandb_entity: null
97
+ wandb_name: null
98
+ wandb_model_log_interval: -1
99
+ detect_anomaly: false
100
+ use_lora: false
101
+ save_lora_only: true
102
+ lora_conf: {}
103
+ pretrain_path: null
104
+ init_param: []
105
+ ignore_init_mismatch: false
106
+ freeze_param: []
107
+ num_iters_per_epoch: 1000
108
+ batch_size: 4
109
+ valid_batch_size: null
110
+ batch_bins: 1000000
111
+ valid_batch_bins: null
112
+ train_shape_file:
113
+ - exp/svs_stats_raw_phn_pyopenjtalk_jp/train/text_shape.phn
114
+ - exp/svs_stats_raw_phn_pyopenjtalk_jp/train/singing_shape
115
+ valid_shape_file:
116
+ - exp/svs_stats_raw_phn_pyopenjtalk_jp/valid/text_shape.phn
117
+ - exp/svs_stats_raw_phn_pyopenjtalk_jp/valid/singing_shape
118
+ batch_type: sorted
119
+ valid_batch_type: null
120
+ fold_length:
121
+ - 150
122
+ - 240000
123
+ sort_in_batch: descending
124
+ shuffle_within_batch: false
125
+ sort_batch: descending
126
+ multiple_iterator: false
127
+ chunk_length: 500
128
+ chunk_shift_ratio: 0.5
129
+ num_cache_chunks: 1024
130
+ chunk_excluded_key_prefixes: []
131
+ chunk_default_fs: null
132
+ train_data_path_and_name_and_type:
133
+ - - dump/raw/tr_no_dev/text
134
+ - text
135
+ - text
136
+ - - dump/raw/tr_no_dev/wav.scp
137
+ - singing
138
+ - sound
139
+ - - dump/raw/tr_no_dev/label
140
+ - label
141
+ - duration
142
+ - - dump/raw/tr_no_dev/score.scp
143
+ - score
144
+ - score
145
+ valid_data_path_and_name_and_type:
146
+ - - dump/raw/dev/text
147
+ - text
148
+ - text
149
+ - - dump/raw/dev/wav.scp
150
+ - singing
151
+ - sound
152
+ - - dump/raw/dev/label
153
+ - label
154
+ - duration
155
+ - - dump/raw/dev/score.scp
156
+ - score
157
+ - score
158
+ allow_variable_data_keys: false
159
+ max_cache_size: 0.0
160
+ max_cache_fd: 32
161
+ allow_multi_rates: false
162
+ valid_max_cache_size: null
163
+ exclude_weight_decay: false
164
+ exclude_weight_decay_conf: {}
165
+ optim: adamw
166
+ optim_conf:
167
+ lr: 0.0002
168
+ betas:
169
+ - 0.8
170
+ - 0.99
171
+ eps: 1.0e-09
172
+ weight_decay: 0.0
173
+ scheduler: exponentiallr
174
+ scheduler_conf:
175
+ gamma: 0.998
176
+ optim2: adamw
177
+ optim2_conf:
178
+ lr: 0.0002
179
+ betas:
180
+ - 0.8
181
+ - 0.99
182
+ eps: 1.0e-09
183
+ weight_decay: 0.0
184
+ scheduler2: exponentiallr
185
+ scheduler2_conf:
186
+ gamma: 0.998
187
+ generator_first: false
188
+ token_list:
189
+ - <blank>
190
+ - <unk>
191
+ - pau
192
+ - a
193
+ - i
194
+ - o
195
+ - e
196
+ - u
197
+ - k
198
+ - n
199
+ - r
200
+ - t
201
+ - m
202
+ - d
203
+ - s
204
+ - N
205
+ - sh
206
+ - g
207
+ - y
208
+ - b
209
+ - w
210
+ - cl
211
+ - ts
212
+ - z
213
+ - ch
214
+ - j
215
+ - h
216
+ - f
217
+ - p
218
+ - ky
219
+ - ry
220
+ - hy
221
+ - py
222
+ - ny
223
+ - <sos/eos>
224
+ odim: null
225
+ model_conf: {}
226
+ use_preprocessor: true
227
+ token_type: phn
228
+ bpemodel: null
229
+ non_linguistic_symbols: null
230
+ cleaner: null
231
+ g2p: pyopenjtalk
232
+ fs: 24000
233
+ score_feats_extract: syllable_score_feats
234
+ score_feats_extract_conf:
235
+ fs: 24000
236
+ n_fft: 2048
237
+ win_length: 1200
238
+ hop_length: 300
239
+ feats_extract: fbank
240
+ feats_extract_conf:
241
+ n_fft: 2048
242
+ hop_length: 300
243
+ win_length: 1200
244
+ fs: 24000
245
+ fmin: 80
246
+ fmax: 7600
247
+ n_mels: 80
248
+ normalize: global_mvn
249
+ normalize_conf:
250
+ stats_file: exp/svs_stats_raw_phn_pyopenjtalk_jp/train/feats_stats.npz
251
+ svs: vits
252
+ svs_conf:
253
+ generator_type: visinger
254
+ vocoder_generator_type: hifigan
255
+ generator_params:
256
+ hidden_channels: 192
257
+ spks: -1
258
+ global_channels: -1
259
+ segment_size: 20
260
+ text_encoder_attention_heads: 2
261
+ text_encoder_ffn_expand: 4
262
+ text_encoder_blocks: 6
263
+ text_encoder_positionwise_layer_type: conv1d
264
+ text_encoder_positionwise_conv_kernel_size: 3
265
+ text_encoder_positional_encoding_layer_type: rel_pos
266
+ text_encoder_self_attention_layer_type: rel_selfattn
267
+ text_encoder_activation_type: swish
268
+ text_encoder_normalize_before: true
269
+ text_encoder_dropout_rate: 0.1
270
+ text_encoder_positional_dropout_rate: 0.0
271
+ text_encoder_attention_dropout_rate: 0.1
272
+ use_macaron_style_in_text_encoder: true
273
+ use_conformer_conv_in_text_encoder: false
274
+ text_encoder_conformer_kernel_size: -1
275
+ decoder_kernel_size: 7
276
+ decoder_channels: 512
277
+ decoder_upsample_scales:
278
+ - 5
279
+ - 5
280
+ - 4
281
+ - 3
282
+ decoder_upsample_kernel_sizes:
283
+ - 10
284
+ - 10
285
+ - 8
286
+ - 6
287
+ decoder_resblock_kernel_sizes:
288
+ - 3
289
+ - 7
290
+ - 11
291
+ decoder_resblock_dilations:
292
+ - - 1
293
+ - 3
294
+ - 5
295
+ - - 1
296
+ - 3
297
+ - 5
298
+ - - 1
299
+ - 3
300
+ - 5
301
+ use_weight_norm_in_decoder: true
302
+ posterior_encoder_kernel_size: 3
303
+ posterior_encoder_layers: 8
304
+ posterior_encoder_stacks: 1
305
+ posterior_encoder_base_dilation: 1
306
+ posterior_encoder_dropout_rate: 0.0
307
+ use_weight_norm_in_posterior_encoder: true
308
+ flow_flows: -1
309
+ flow_kernel_size: 5
310
+ flow_base_dilation: 1
311
+ flow_layers: 4
312
+ flow_dropout_rate: 0.0
313
+ use_weight_norm_in_flow: true
314
+ use_only_mean_in_flow: true
315
+ use_phoneme_predictor: false
316
+ vocabs: 35
317
+ aux_channels: 80
318
+ generator_type: visinger
319
+ vocoder_generator_type: hifigan
320
+ fs: 24000
321
+ hop_length: 300
322
+ win_length: 1024
323
+ n_fft: 2048
324
+ discriminator_type: visinger2
325
+ discriminator_params:
326
+ scales: 1
327
+ scale_downsample_pooling: AvgPool1d
328
+ scale_downsample_pooling_params:
329
+ kernel_size: 4
330
+ stride: 2
331
+ padding: 2
332
+ scale_discriminator_params:
333
+ in_channels: 1
334
+ out_channels: 1
335
+ kernel_sizes:
336
+ - 15
337
+ - 41
338
+ - 5
339
+ - 3
340
+ channels: 128
341
+ max_downsample_channels: 1024
342
+ max_groups: 256
343
+ bias: true
344
+ downsample_scales:
345
+ - 4
346
+ - 4
347
+ - 4
348
+ - 4
349
+ nonlinear_activation: LeakyReLU
350
+ nonlinear_activation_params:
351
+ negative_slope: 0.1
352
+ use_weight_norm: true
353
+ use_spectral_norm: false
354
+ follow_official_norm: false
355
+ periods:
356
+ - 2
357
+ - 3
358
+ - 5
359
+ - 7
360
+ - 11
361
+ period_discriminator_params:
362
+ in_channels: 1
363
+ out_channels: 1
364
+ kernel_sizes:
365
+ - 5
366
+ - 3
367
+ channels: 32
368
+ downsample_scales:
369
+ - 3
370
+ - 3
371
+ - 3
372
+ - 3
373
+ - 1
374
+ max_downsample_channels: 1024
375
+ bias: true
376
+ nonlinear_activation: LeakyReLU
377
+ nonlinear_activation_params:
378
+ negative_slope: 0.1
379
+ use_weight_norm: true
380
+ use_spectral_norm: false
381
+ multi_freq_disc_params:
382
+ hidden_channels:
383
+ - 256
384
+ - 256
385
+ - 256
386
+ - 256
387
+ - 256
388
+ domain: double
389
+ mel_scale: true
390
+ divisors:
391
+ - 32
392
+ - 16
393
+ - 8
394
+ - 4
395
+ - 2
396
+ - 1
397
+ - 1
398
+ strides:
399
+ - 1
400
+ - 2
401
+ - 1
402
+ - 2
403
+ - 1
404
+ - 2
405
+ - 1
406
+ sample_rate: 24000
407
+ hop_lengths:
408
+ - 110
409
+ - 220
410
+ - 330
411
+ - 441
412
+ - 551
413
+ - 661
414
+ generator_adv_loss_params:
415
+ average_by_discriminators: false
416
+ loss_type: mse
417
+ discriminator_adv_loss_params:
418
+ average_by_discriminators: false
419
+ loss_type: mse
420
+ feat_match_loss_params:
421
+ average_by_discriminators: false
422
+ average_by_layers: false
423
+ include_final_outputs: true
424
+ mel_loss_params:
425
+ fs: 24000
426
+ n_fft: 2048
427
+ hop_length: 300
428
+ win_length: 1024
429
+ window: hann
430
+ n_mels: 80
431
+ fmin: 0
432
+ fmax: 7600
433
+ log_base: null
434
+ lambda_adv: 1.0
435
+ lambda_mel: 45.0
436
+ lambda_feat_match: 2.0
437
+ lambda_dur: 0.1
438
+ lambda_pitch: 10.0
439
+ lambda_phoneme: 1.0
440
+ lambda_kl: 1.0
441
+ sampling_rate: 24000
442
+ cache_generator_outputs: true
443
+ pitch_extract: dio
444
+ pitch_extract_conf:
445
+ use_token_averaged_f0: false
446
+ use_log_f0: false
447
+ fs: 24000
448
+ n_fft: 2048
449
+ hop_length: 300
450
+ f0max: 800
451
+ f0min: 80
452
+ pitch_normalize: null
453
+ pitch_normalize_conf:
454
+ stats_file: exp/svs_stats_raw_phn_pyopenjtalk_jp/train/pitch_stats.npz
455
+ ying_extract: null
456
+ ying_extract_conf: {}
457
+ energy_extract: null
458
+ energy_extract_conf: {}
459
+ energy_normalize: null
460
+ energy_normalize_conf: {}
461
+ required:
462
+ - output_dir
463
+ - token_list
464
+ version: '202310'
465
+ distributed: false
466
+ ```
467
+
468
+ </details>
469
+
470
+
471
+
472
+ ### Citing ESPnet
473
+
474
+ ```BibTex
475
+ @inproceedings{watanabe2018espnet,
476
+ author={Shinji Watanabe and Takaaki Hori and Shigeki Karita and Tomoki Hayashi and Jiro Nishitoba and Yuya Unno and Nelson Yalta and Jahn Heymann and Matthew Wiesner and Nanxin Chen and Adithya Renduchintala and Tsubasa Ochiai},
477
+ title={{ESPnet}: End-to-End Speech Processing Toolkit},
478
+ year={2018},
479
+ booktitle={Proceedings of Interspeech},
480
+ pages={2207--2211},
481
+ doi={10.21437/Interspeech.2018-1456},
482
+ url={http://dx.doi.org/10.21437/Interspeech.2018-1456}
483
+ }
484
+
485
+
486
+
487
+
488
+
489
+
490
+ @inproceedings{shi22d_interspeech,
491
+ author={Jiatong Shi and Shuai Guo and Tao Qian and Tomoki Hayashi and Yuning Wu and Fangzheng Xu and Xuankai Chang and Huazhe Li and Peter Wu and Shinji Watanabe and Qin Jin},
492
+ title={{Muskits: an End-to-end Music Processing Toolkit for Singing Voice Synthesis}},
493
+ year=2022,
494
+ booktitle={Proc. Interspeech 2022},
495
+ pages={4277--4281},
496
+ doi={10.21437/Interspeech.2022-10039}
497
+ }
498
+ ```
499
+
500
+ or arXiv:
501
+
502
+ ```bibtex
503
+ @misc{watanabe2018espnet,
504
+ title={ESPnet: End-to-End Speech Processing Toolkit},
505
+ author={Shinji Watanabe and Takaaki Hori and Shigeki Karita and Tomoki Hayashi and Jiro Nishitoba and Yuya Unno and Nelson Yalta and Jahn Heymann and Matthew Wiesner and Nanxin Chen and Adithya Renduchintala and Tsubasa Ochiai},
506
+ year={2018},
507
+ eprint={1804.00015},
508
+ archivePrefix={arXiv},
509
+ primaryClass={cs.CL}
510
+ }
511
+ ```
exp/svs_stats_raw_phn_pyopenjtalk_jp/train/feats_stats.npz ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b3f8760a03a38e9f6aedceafc940853e135de03be88dc0f400f80111012ae2f4
3
+ size 1402
exp/svs_stats_raw_phn_pyopenjtalk_jp/train/pitch_stats.npz ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:472f44554816456248d361dd80b0a2a3d17c6dc420486a72fd7a0eedb2144f99
3
+ size 770
exp/svs_train_visinger_24_raw_phn_pyopenjtalk_jp/200epoch.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:53481b7b2dd9609cefa8bc63adafaf6f5fecda8ef2a62b9705b53cb976108a1a
3
+ size 422717147
exp/svs_train_visinger_24_raw_phn_pyopenjtalk_jp/config.yaml ADDED
@@ -0,0 +1,428 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ config: conf/tuning/train_visinger_24.yaml
2
+ print_config: false
3
+ log_level: INFO
4
+ drop_last_iter: false
5
+ dry_run: false
6
+ iterator_type: sequence
7
+ valid_iterator_type: null
8
+ output_dir: exp/svs_train_visinger_24_raw_phn_pyopenjtalk_jp
9
+ ngpu: 1
10
+ seed: 777
11
+ num_workers: 2
12
+ num_att_plot: 3
13
+ dist_backend: nccl
14
+ dist_init_method: env://
15
+ dist_world_size: null
16
+ dist_rank: null
17
+ local_rank: 0
18
+ dist_master_addr: null
19
+ dist_master_port: null
20
+ dist_launcher: null
21
+ multiprocessing_distributed: false
22
+ unused_parameters: true
23
+ sharded_ddp: false
24
+ cudnn_enabled: true
25
+ cudnn_benchmark: false
26
+ cudnn_deterministic: false
27
+ collect_stats: false
28
+ write_collected_feats: false
29
+ max_epoch: 200
30
+ patience: null
31
+ val_scheduler_criterion:
32
+ - valid
33
+ - loss
34
+ early_stopping_criterion:
35
+ - valid
36
+ - loss
37
+ - min
38
+ best_model_criterion:
39
+ - - train
40
+ - total_count
41
+ - max
42
+ keep_nbest_models: 10
43
+ nbest_averaging_interval: 0
44
+ grad_clip: -1
45
+ grad_clip_type: 2.0
46
+ grad_noise: false
47
+ accum_grad: 1
48
+ no_forward_run: false
49
+ resume: true
50
+ train_dtype: float32
51
+ use_amp: false
52
+ log_interval: 50
53
+ use_matplotlib: true
54
+ use_tensorboard: true
55
+ create_graph_in_tensorboard: false
56
+ use_wandb: false
57
+ wandb_project: null
58
+ wandb_id: null
59
+ wandb_entity: null
60
+ wandb_name: null
61
+ wandb_model_log_interval: -1
62
+ detect_anomaly: false
63
+ use_lora: false
64
+ save_lora_only: true
65
+ lora_conf: {}
66
+ pretrain_path: null
67
+ init_param: []
68
+ ignore_init_mismatch: false
69
+ freeze_param: []
70
+ num_iters_per_epoch: 1000
71
+ batch_size: 4
72
+ valid_batch_size: null
73
+ batch_bins: 1000000
74
+ valid_batch_bins: null
75
+ train_shape_file:
76
+ - exp/svs_stats_raw_phn_pyopenjtalk_jp/train/text_shape.phn
77
+ - exp/svs_stats_raw_phn_pyopenjtalk_jp/train/singing_shape
78
+ valid_shape_file:
79
+ - exp/svs_stats_raw_phn_pyopenjtalk_jp/valid/text_shape.phn
80
+ - exp/svs_stats_raw_phn_pyopenjtalk_jp/valid/singing_shape
81
+ batch_type: sorted
82
+ valid_batch_type: null
83
+ fold_length:
84
+ - 150
85
+ - 240000
86
+ sort_in_batch: descending
87
+ shuffle_within_batch: false
88
+ sort_batch: descending
89
+ multiple_iterator: false
90
+ chunk_length: 500
91
+ chunk_shift_ratio: 0.5
92
+ num_cache_chunks: 1024
93
+ chunk_excluded_key_prefixes: []
94
+ chunk_default_fs: null
95
+ train_data_path_and_name_and_type:
96
+ - - dump/raw/tr_no_dev/text
97
+ - text
98
+ - text
99
+ - - dump/raw/tr_no_dev/wav.scp
100
+ - singing
101
+ - sound
102
+ - - dump/raw/tr_no_dev/label
103
+ - label
104
+ - duration
105
+ - - dump/raw/tr_no_dev/score.scp
106
+ - score
107
+ - score
108
+ valid_data_path_and_name_and_type:
109
+ - - dump/raw/dev/text
110
+ - text
111
+ - text
112
+ - - dump/raw/dev/wav.scp
113
+ - singing
114
+ - sound
115
+ - - dump/raw/dev/label
116
+ - label
117
+ - duration
118
+ - - dump/raw/dev/score.scp
119
+ - score
120
+ - score
121
+ allow_variable_data_keys: false
122
+ max_cache_size: 0.0
123
+ max_cache_fd: 32
124
+ allow_multi_rates: false
125
+ valid_max_cache_size: null
126
+ exclude_weight_decay: false
127
+ exclude_weight_decay_conf: {}
128
+ optim: adamw
129
+ optim_conf:
130
+ lr: 0.0002
131
+ betas:
132
+ - 0.8
133
+ - 0.99
134
+ eps: 1.0e-09
135
+ weight_decay: 0.0
136
+ scheduler: exponentiallr
137
+ scheduler_conf:
138
+ gamma: 0.998
139
+ optim2: adamw
140
+ optim2_conf:
141
+ lr: 0.0002
142
+ betas:
143
+ - 0.8
144
+ - 0.99
145
+ eps: 1.0e-09
146
+ weight_decay: 0.0
147
+ scheduler2: exponentiallr
148
+ scheduler2_conf:
149
+ gamma: 0.998
150
+ generator_first: false
151
+ token_list:
152
+ - <blank>
153
+ - <unk>
154
+ - pau
155
+ - a
156
+ - i
157
+ - o
158
+ - e
159
+ - u
160
+ - k
161
+ - n
162
+ - r
163
+ - t
164
+ - m
165
+ - d
166
+ - s
167
+ - N
168
+ - sh
169
+ - g
170
+ - y
171
+ - b
172
+ - w
173
+ - cl
174
+ - ts
175
+ - z
176
+ - ch
177
+ - j
178
+ - h
179
+ - f
180
+ - p
181
+ - ky
182
+ - ry
183
+ - hy
184
+ - py
185
+ - ny
186
+ - <sos/eos>
187
+ odim: null
188
+ model_conf: {}
189
+ use_preprocessor: true
190
+ token_type: phn
191
+ bpemodel: null
192
+ non_linguistic_symbols: null
193
+ cleaner: null
194
+ g2p: pyopenjtalk
195
+ fs: 24000
196
+ score_feats_extract: syllable_score_feats
197
+ score_feats_extract_conf:
198
+ fs: 24000
199
+ n_fft: 2048
200
+ win_length: 1200
201
+ hop_length: 300
202
+ feats_extract: fbank
203
+ feats_extract_conf:
204
+ n_fft: 2048
205
+ hop_length: 300
206
+ win_length: 1200
207
+ fs: 24000
208
+ fmin: 80
209
+ fmax: 7600
210
+ n_mels: 80
211
+ normalize: global_mvn
212
+ normalize_conf:
213
+ stats_file: exp/svs_stats_raw_phn_pyopenjtalk_jp/train/feats_stats.npz
214
+ svs: vits
215
+ svs_conf:
216
+ generator_type: visinger
217
+ vocoder_generator_type: hifigan
218
+ generator_params:
219
+ hidden_channels: 192
220
+ spks: -1
221
+ global_channels: -1
222
+ segment_size: 20
223
+ text_encoder_attention_heads: 2
224
+ text_encoder_ffn_expand: 4
225
+ text_encoder_blocks: 6
226
+ text_encoder_positionwise_layer_type: conv1d
227
+ text_encoder_positionwise_conv_kernel_size: 3
228
+ text_encoder_positional_encoding_layer_type: rel_pos
229
+ text_encoder_self_attention_layer_type: rel_selfattn
230
+ text_encoder_activation_type: swish
231
+ text_encoder_normalize_before: true
232
+ text_encoder_dropout_rate: 0.1
233
+ text_encoder_positional_dropout_rate: 0.0
234
+ text_encoder_attention_dropout_rate: 0.1
235
+ use_macaron_style_in_text_encoder: true
236
+ use_conformer_conv_in_text_encoder: false
237
+ text_encoder_conformer_kernel_size: -1
238
+ decoder_kernel_size: 7
239
+ decoder_channels: 512
240
+ decoder_upsample_scales:
241
+ - 5
242
+ - 5
243
+ - 4
244
+ - 3
245
+ decoder_upsample_kernel_sizes:
246
+ - 10
247
+ - 10
248
+ - 8
249
+ - 6
250
+ decoder_resblock_kernel_sizes:
251
+ - 3
252
+ - 7
253
+ - 11
254
+ decoder_resblock_dilations:
255
+ - - 1
256
+ - 3
257
+ - 5
258
+ - - 1
259
+ - 3
260
+ - 5
261
+ - - 1
262
+ - 3
263
+ - 5
264
+ use_weight_norm_in_decoder: true
265
+ posterior_encoder_kernel_size: 3
266
+ posterior_encoder_layers: 8
267
+ posterior_encoder_stacks: 1
268
+ posterior_encoder_base_dilation: 1
269
+ posterior_encoder_dropout_rate: 0.0
270
+ use_weight_norm_in_posterior_encoder: true
271
+ flow_flows: -1
272
+ flow_kernel_size: 5
273
+ flow_base_dilation: 1
274
+ flow_layers: 4
275
+ flow_dropout_rate: 0.0
276
+ use_weight_norm_in_flow: true
277
+ use_only_mean_in_flow: true
278
+ use_phoneme_predictor: false
279
+ vocabs: 35
280
+ aux_channels: 80
281
+ generator_type: visinger
282
+ vocoder_generator_type: hifigan
283
+ fs: 24000
284
+ hop_length: 300
285
+ win_length: 1024
286
+ n_fft: 2048
287
+ discriminator_type: visinger2
288
+ discriminator_params:
289
+ scales: 1
290
+ scale_downsample_pooling: AvgPool1d
291
+ scale_downsample_pooling_params:
292
+ kernel_size: 4
293
+ stride: 2
294
+ padding: 2
295
+ scale_discriminator_params:
296
+ in_channels: 1
297
+ out_channels: 1
298
+ kernel_sizes:
299
+ - 15
300
+ - 41
301
+ - 5
302
+ - 3
303
+ channels: 128
304
+ max_downsample_channels: 1024
305
+ max_groups: 256
306
+ bias: true
307
+ downsample_scales:
308
+ - 4
309
+ - 4
310
+ - 4
311
+ - 4
312
+ nonlinear_activation: LeakyReLU
313
+ nonlinear_activation_params:
314
+ negative_slope: 0.1
315
+ use_weight_norm: true
316
+ use_spectral_norm: false
317
+ follow_official_norm: false
318
+ periods:
319
+ - 2
320
+ - 3
321
+ - 5
322
+ - 7
323
+ - 11
324
+ period_discriminator_params:
325
+ in_channels: 1
326
+ out_channels: 1
327
+ kernel_sizes:
328
+ - 5
329
+ - 3
330
+ channels: 32
331
+ downsample_scales:
332
+ - 3
333
+ - 3
334
+ - 3
335
+ - 3
336
+ - 1
337
+ max_downsample_channels: 1024
338
+ bias: true
339
+ nonlinear_activation: LeakyReLU
340
+ nonlinear_activation_params:
341
+ negative_slope: 0.1
342
+ use_weight_norm: true
343
+ use_spectral_norm: false
344
+ multi_freq_disc_params:
345
+ hidden_channels:
346
+ - 256
347
+ - 256
348
+ - 256
349
+ - 256
350
+ - 256
351
+ domain: double
352
+ mel_scale: true
353
+ divisors:
354
+ - 32
355
+ - 16
356
+ - 8
357
+ - 4
358
+ - 2
359
+ - 1
360
+ - 1
361
+ strides:
362
+ - 1
363
+ - 2
364
+ - 1
365
+ - 2
366
+ - 1
367
+ - 2
368
+ - 1
369
+ sample_rate: 24000
370
+ hop_lengths:
371
+ - 110
372
+ - 220
373
+ - 330
374
+ - 441
375
+ - 551
376
+ - 661
377
+ generator_adv_loss_params:
378
+ average_by_discriminators: false
379
+ loss_type: mse
380
+ discriminator_adv_loss_params:
381
+ average_by_discriminators: false
382
+ loss_type: mse
383
+ feat_match_loss_params:
384
+ average_by_discriminators: false
385
+ average_by_layers: false
386
+ include_final_outputs: true
387
+ mel_loss_params:
388
+ fs: 24000
389
+ n_fft: 2048
390
+ hop_length: 300
391
+ win_length: 1024
392
+ window: hann
393
+ n_mels: 80
394
+ fmin: 0
395
+ fmax: 7600
396
+ log_base: null
397
+ lambda_adv: 1.0
398
+ lambda_mel: 45.0
399
+ lambda_feat_match: 2.0
400
+ lambda_dur: 0.1
401
+ lambda_pitch: 10.0
402
+ lambda_phoneme: 1.0
403
+ lambda_kl: 1.0
404
+ sampling_rate: 24000
405
+ cache_generator_outputs: true
406
+ pitch_extract: dio
407
+ pitch_extract_conf:
408
+ use_token_averaged_f0: false
409
+ use_log_f0: false
410
+ fs: 24000
411
+ n_fft: 2048
412
+ hop_length: 300
413
+ f0max: 800
414
+ f0min: 80
415
+ pitch_normalize: null
416
+ pitch_normalize_conf:
417
+ stats_file: exp/svs_stats_raw_phn_pyopenjtalk_jp/train/pitch_stats.npz
418
+ ying_extract: null
419
+ ying_extract_conf: {}
420
+ energy_extract: null
421
+ energy_extract_conf: {}
422
+ energy_normalize: null
423
+ energy_normalize_conf: {}
424
+ required:
425
+ - output_dir
426
+ - token_list
427
+ version: '202310'
428
+ distributed: false
exp/svs_train_visinger_24_raw_phn_pyopenjtalk_jp/images/discriminator_backward_time.png ADDED
exp/svs_train_visinger_24_raw_phn_pyopenjtalk_jp/images/discriminator_fake_loss.png ADDED
exp/svs_train_visinger_24_raw_phn_pyopenjtalk_jp/images/discriminator_forward_time.png ADDED
exp/svs_train_visinger_24_raw_phn_pyopenjtalk_jp/images/discriminator_loss.png ADDED
exp/svs_train_visinger_24_raw_phn_pyopenjtalk_jp/images/discriminator_optim_step_time.png ADDED
exp/svs_train_visinger_24_raw_phn_pyopenjtalk_jp/images/discriminator_real_loss.png ADDED
exp/svs_train_visinger_24_raw_phn_pyopenjtalk_jp/images/discriminator_train_time.png ADDED
exp/svs_train_visinger_24_raw_phn_pyopenjtalk_jp/images/generator_adv_loss.png ADDED
exp/svs_train_visinger_24_raw_phn_pyopenjtalk_jp/images/generator_backward_time.png ADDED
exp/svs_train_visinger_24_raw_phn_pyopenjtalk_jp/images/generator_feat_match_loss.png ADDED
exp/svs_train_visinger_24_raw_phn_pyopenjtalk_jp/images/generator_forward_time.png ADDED
exp/svs_train_visinger_24_raw_phn_pyopenjtalk_jp/images/generator_kl_loss.png ADDED
exp/svs_train_visinger_24_raw_phn_pyopenjtalk_jp/images/generator_loss.png ADDED
exp/svs_train_visinger_24_raw_phn_pyopenjtalk_jp/images/generator_mel_loss.png ADDED
exp/svs_train_visinger_24_raw_phn_pyopenjtalk_jp/images/generator_optim_step_time.png ADDED
exp/svs_train_visinger_24_raw_phn_pyopenjtalk_jp/images/generator_phn_dur_loss.png ADDED
exp/svs_train_visinger_24_raw_phn_pyopenjtalk_jp/images/generator_pitch_loss.png ADDED
exp/svs_train_visinger_24_raw_phn_pyopenjtalk_jp/images/generator_score_dur_loss.png ADDED
exp/svs_train_visinger_24_raw_phn_pyopenjtalk_jp/images/generator_train_time.png ADDED
exp/svs_train_visinger_24_raw_phn_pyopenjtalk_jp/images/gpu_max_cached_mem_GB.png ADDED
exp/svs_train_visinger_24_raw_phn_pyopenjtalk_jp/images/iter_time.png ADDED
exp/svs_train_visinger_24_raw_phn_pyopenjtalk_jp/images/optim0_lr0.png ADDED
exp/svs_train_visinger_24_raw_phn_pyopenjtalk_jp/images/optim1_lr0.png ADDED
exp/svs_train_visinger_24_raw_phn_pyopenjtalk_jp/images/train_time.png ADDED
meta.yaml ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ espnet: '202310'
2
+ files:
3
+ model_file: exp/svs_train_visinger_24_raw_phn_pyopenjtalk_jp/200epoch.pth
4
+ python: "3.9.16 (main, Mar 8 2023, 14:00:05) \n[GCC 11.2.0]"
5
+ timestamp: 1703352763.255802
6
+ torch: 1.13.1+cu117
7
+ yaml_files:
8
+ train_config: exp/svs_train_visinger_24_raw_phn_pyopenjtalk_jp/config.yaml