Yuning Wu commited on
Commit
ed470d8
·
1 Parent(s): 4e6a55a

Update model

Browse files
Files changed (30) hide show
  1. README.md +531 -0
  2. exp/44k/svs_train_visinger2_raw_phn_None_zh/100epoch.pth +3 -0
  3. exp/44k/svs_train_visinger2_raw_phn_None_zh/config.yaml +458 -0
  4. exp/44k/svs_train_visinger2_raw_phn_None_zh/images/discriminator_backward_time.png +0 -0
  5. exp/44k/svs_train_visinger2_raw_phn_None_zh/images/discriminator_fake_loss.png +0 -0
  6. exp/44k/svs_train_visinger2_raw_phn_None_zh/images/discriminator_forward_time.png +0 -0
  7. exp/44k/svs_train_visinger2_raw_phn_None_zh/images/discriminator_loss.png +0 -0
  8. exp/44k/svs_train_visinger2_raw_phn_None_zh/images/discriminator_optim_step_time.png +0 -0
  9. exp/44k/svs_train_visinger2_raw_phn_None_zh/images/discriminator_real_loss.png +0 -0
  10. exp/44k/svs_train_visinger2_raw_phn_None_zh/images/discriminator_train_time.png +0 -0
  11. exp/44k/svs_train_visinger2_raw_phn_None_zh/images/generator_adv_loss.png +0 -0
  12. exp/44k/svs_train_visinger2_raw_phn_None_zh/images/generator_backward_time.png +0 -0
  13. exp/44k/svs_train_visinger2_raw_phn_None_zh/images/generator_feat_match_loss.png +0 -0
  14. exp/44k/svs_train_visinger2_raw_phn_None_zh/images/generator_forward_time.png +0 -0
  15. exp/44k/svs_train_visinger2_raw_phn_None_zh/images/generator_kl_loss.png +0 -0
  16. exp/44k/svs_train_visinger2_raw_phn_None_zh/images/generator_loss.png +0 -0
  17. exp/44k/svs_train_visinger2_raw_phn_None_zh/images/generator_mel_am_loss.png +0 -0
  18. exp/44k/svs_train_visinger2_raw_phn_None_zh/images/generator_mel_ddsp_loss.png +0 -0
  19. exp/44k/svs_train_visinger2_raw_phn_None_zh/images/generator_mel_loss.png +0 -0
  20. exp/44k/svs_train_visinger2_raw_phn_None_zh/images/generator_optim_step_time.png +0 -0
  21. exp/44k/svs_train_visinger2_raw_phn_None_zh/images/generator_phn_dur_loss.png +0 -0
  22. exp/44k/svs_train_visinger2_raw_phn_None_zh/images/generator_pitch_loss.png +0 -0
  23. exp/44k/svs_train_visinger2_raw_phn_None_zh/images/generator_score_dur_loss.png +0 -0
  24. exp/44k/svs_train_visinger2_raw_phn_None_zh/images/generator_train_time.png +0 -0
  25. exp/44k/svs_train_visinger2_raw_phn_None_zh/images/gpu_max_cached_mem_GB.png +0 -0
  26. exp/44k/svs_train_visinger2_raw_phn_None_zh/images/iter_time.png +0 -0
  27. exp/44k/svs_train_visinger2_raw_phn_None_zh/images/optim0_lr0.png +0 -0
  28. exp/44k/svs_train_visinger2_raw_phn_None_zh/images/optim1_lr0.png +0 -0
  29. exp/44k/svs_train_visinger2_raw_phn_None_zh/images/train_time.png +0 -0
  30. meta.yaml +8 -0
README.md ADDED
@@ -0,0 +1,531 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - espnet
4
+ - audio
5
+ - singing-voice-synthesis
6
+ language: zh
7
+ datasets:
8
+ - opencpop
9
+ license: cc-by-4.0
10
+ ---
11
+
12
+ ## ESPnet2 SVS model
13
+
14
+ ### `AQuarterMile/opencpop-visinger2`
15
+
16
+ This model was trained by Yuning Wu using opencpop recipe in [espnet](https://github.com/espnet/espnet/).
17
+
18
+ ### Demo: How to use in ESPnet2
19
+
20
+ Follow the [ESPnet installation instructions](https://espnet.github.io/espnet/installation.html)
21
+ if you haven't done that already.
22
+
23
+ ```bash
24
+ cd espnet
25
+ git checkout 13b986ce0595b4366c6e3381a981379600b958a4
26
+ pip install -e .
27
+ cd egs2/opencpop/svs1
28
+ ./run.sh --skip_data_prep false --skip_train true --download_model AQuarterMile/opencpop-visinger2
29
+ ```
30
+
31
+
32
+
33
+ ## SVS config
34
+
35
+ <details><summary>expand</summary>
36
+
37
+ ```
38
+ config: ./conf/tuning/train_visinger2.yaml
39
+ print_config: false
40
+ log_level: INFO
41
+ dry_run: false
42
+ iterator_type: sequence
43
+ output_dir: exp/44k/svs_train_visinger2_raw_phn_None_zh
44
+ ngpu: 1
45
+ seed: 777
46
+ num_workers: 4
47
+ num_att_plot: 3
48
+ dist_backend: nccl
49
+ dist_init_method: env://
50
+ dist_world_size: null
51
+ dist_rank: null
52
+ local_rank: 0
53
+ dist_master_addr: null
54
+ dist_master_port: null
55
+ dist_launcher: null
56
+ multiprocessing_distributed: false
57
+ unused_parameters: true
58
+ sharded_ddp: false
59
+ cudnn_enabled: true
60
+ cudnn_benchmark: false
61
+ cudnn_deterministic: false
62
+ collect_stats: false
63
+ write_collected_feats: false
64
+ max_epoch: 100
65
+ patience: null
66
+ val_scheduler_criterion:
67
+ - valid
68
+ - loss
69
+ early_stopping_criterion:
70
+ - valid
71
+ - loss
72
+ - min
73
+ best_model_criterion:
74
+ - - train
75
+ - total_count
76
+ - max
77
+ keep_nbest_models: 10
78
+ nbest_averaging_interval: 0
79
+ grad_clip: -1
80
+ grad_clip_type: 2.0
81
+ grad_noise: false
82
+ accum_grad: 1
83
+ no_forward_run: false
84
+ resume: true
85
+ train_dtype: float32
86
+ use_amp: false
87
+ log_interval: 50
88
+ use_matplotlib: true
89
+ use_tensorboard: true
90
+ create_graph_in_tensorboard: false
91
+ use_wandb: false
92
+ wandb_project: null
93
+ wandb_id: null
94
+ wandb_entity: null
95
+ wandb_name: null
96
+ wandb_model_log_interval: -1
97
+ detect_anomaly: false
98
+ pretrain_path: null
99
+ init_param: []
100
+ ignore_init_mismatch: false
101
+ freeze_param: []
102
+ num_iters_per_epoch: 1000
103
+ batch_size: 8
104
+ valid_batch_size: null
105
+ batch_bins: 1000000
106
+ valid_batch_bins: null
107
+ train_shape_file:
108
+ - exp/44k/svs_stats_raw_phn_None_zh/train/text_shape.phn
109
+ - exp/44k/svs_stats_raw_phn_None_zh/train/singing_shape
110
+ valid_shape_file:
111
+ - exp/44k/svs_stats_raw_phn_None_zh/valid/text_shape.phn
112
+ - exp/44k/svs_stats_raw_phn_None_zh/valid/singing_shape
113
+ batch_type: sorted
114
+ valid_batch_type: null
115
+ fold_length:
116
+ - 150
117
+ - 409600
118
+ sort_in_batch: descending
119
+ sort_batch: descending
120
+ multiple_iterator: false
121
+ chunk_length: 500
122
+ chunk_shift_ratio: 0.5
123
+ num_cache_chunks: 1024
124
+ chunk_excluded_key_prefixes: []
125
+ train_data_path_and_name_and_type:
126
+ - - dump/44k/raw/tr_no_dev/text
127
+ - text
128
+ - text
129
+ - - dump/44k/raw/tr_no_dev/wav.scp
130
+ - singing
131
+ - sound
132
+ - - dump/44k/raw/tr_no_dev/label
133
+ - label
134
+ - duration
135
+ - - dump/44k/raw/tr_no_dev/score.scp
136
+ - score
137
+ - score
138
+ - - exp/44k/svs_stats_raw_phn_None_zh/train/collect_feats/pitch.scp
139
+ - pitch
140
+ - npy
141
+ - - exp/44k/svs_stats_raw_phn_None_zh/train/collect_feats/feats.scp
142
+ - feats
143
+ - npy
144
+ valid_data_path_and_name_and_type:
145
+ - - dump/44k/raw/dev/text
146
+ - text
147
+ - text
148
+ - - dump/44k/raw/dev/wav.scp
149
+ - singing
150
+ - sound
151
+ - - dump/44k/raw/dev/label
152
+ - label
153
+ - duration
154
+ - - dump/44k/raw/dev/score.scp
155
+ - score
156
+ - score
157
+ - - exp/44k/svs_stats_raw_phn_None_zh/valid/collect_feats/pitch.scp
158
+ - pitch
159
+ - npy
160
+ - - exp/44k/svs_stats_raw_phn_None_zh/valid/collect_feats/feats.scp
161
+ - feats
162
+ - npy
163
+ allow_variable_data_keys: false
164
+ max_cache_size: 0.0
165
+ max_cache_fd: 32
166
+ valid_max_cache_size: null
167
+ exclude_weight_decay: false
168
+ exclude_weight_decay_conf: {}
169
+ optim: adamw
170
+ optim_conf:
171
+ lr: 0.0002
172
+ betas:
173
+ - 0.8
174
+ - 0.99
175
+ eps: 1.0e-09
176
+ weight_decay: 0.0
177
+ scheduler: exponentiallr
178
+ scheduler_conf:
179
+ gamma: 0.998
180
+ optim2: adamw
181
+ optim2_conf:
182
+ lr: 0.0002
183
+ betas:
184
+ - 0.8
185
+ - 0.99
186
+ eps: 1.0e-09
187
+ weight_decay: 0.0
188
+ scheduler2: exponentiallr
189
+ scheduler2_conf:
190
+ gamma: 0.998
191
+ generator_first: false
192
+ token_list:
193
+ - <blank>
194
+ - <unk>
195
+ - SP
196
+ - i
197
+ - AP
198
+ - e
199
+ - y
200
+ - d
201
+ - w
202
+ - sh
203
+ - ai
204
+ - n
205
+ - x
206
+ - j
207
+ - ian
208
+ - u
209
+ - l
210
+ - h
211
+ - b
212
+ - o
213
+ - zh
214
+ - an
215
+ - ou
216
+ - m
217
+ - q
218
+ - z
219
+ - en
220
+ - g
221
+ - ing
222
+ - ei
223
+ - ao
224
+ - ang
225
+ - uo
226
+ - eng
227
+ - t
228
+ - a
229
+ - ong
230
+ - ui
231
+ - k
232
+ - f
233
+ - r
234
+ - iang
235
+ - ch
236
+ - v
237
+ - in
238
+ - iao
239
+ - ie
240
+ - iu
241
+ - c
242
+ - s
243
+ - van
244
+ - p
245
+ - ve
246
+ - uan
247
+ - uang
248
+ - ia
249
+ - ua
250
+ - uai
251
+ - un
252
+ - er
253
+ - vn
254
+ - iong
255
+ - <sos/eos>
256
+ odim: null
257
+ model_conf: {}
258
+ use_preprocessor: true
259
+ token_type: phn
260
+ bpemodel: null
261
+ non_linguistic_symbols: null
262
+ cleaner: null
263
+ g2p: null
264
+ fs: 44100
265
+ score_feats_extract: syllable_score_feats
266
+ score_feats_extract_conf:
267
+ fs: 44100
268
+ n_fft: 2048
269
+ win_length: 2048
270
+ hop_length: 512
271
+ feats_extract: fbank
272
+ feats_extract_conf:
273
+ n_fft: 2048
274
+ hop_length: 512
275
+ win_length: 2048
276
+ fs: 44100
277
+ fmin: 0
278
+ fmax: 22050
279
+ n_mels: 80
280
+ normalize: null
281
+ normalize_conf: {}
282
+ svs: vits
283
+ svs_conf:
284
+ generator_type: visinger2
285
+ vocoder_generator_type: visinger2
286
+ generator_params:
287
+ hidden_channels: 192
288
+ spks: -1
289
+ global_channels: -1
290
+ segment_size: 20
291
+ text_encoder_attention_heads: 2
292
+ text_encoder_ffn_expand: 4
293
+ text_encoder_blocks: 6
294
+ text_encoder_positionwise_layer_type: conv1d
295
+ text_encoder_positionwise_conv_kernel_size: 3
296
+ text_encoder_positional_encoding_layer_type: rel_pos
297
+ text_encoder_self_attention_layer_type: rel_selfattn
298
+ text_encoder_activation_type: swish
299
+ text_encoder_normalize_before: true
300
+ text_encoder_dropout_rate: 0.1
301
+ text_encoder_positional_dropout_rate: 0.0
302
+ text_encoder_attention_dropout_rate: 0.1
303
+ use_macaron_style_in_text_encoder: true
304
+ use_conformer_conv_in_text_encoder: false
305
+ text_encoder_conformer_kernel_size: -1
306
+ decoder_kernel_size: 7
307
+ decoder_channels: 256
308
+ decoder_upsample_scales:
309
+ - 8
310
+ - 8
311
+ - 4
312
+ - 2
313
+ decoder_upsample_kernel_sizes:
314
+ - 16
315
+ - 16
316
+ - 8
317
+ - 4
318
+ n_harmonic: 64
319
+ decoder_resblock_kernel_sizes:
320
+ - 3
321
+ - 7
322
+ - 11
323
+ decoder_resblock_dilations:
324
+ - - 1
325
+ - 3
326
+ - 5
327
+ - - 1
328
+ - 3
329
+ - 5
330
+ - - 1
331
+ - 3
332
+ - 5
333
+ use_weight_norm_in_decoder: true
334
+ posterior_encoder_kernel_size: 3
335
+ posterior_encoder_layers: 8
336
+ posterior_encoder_stacks: 1
337
+ posterior_encoder_base_dilation: 1
338
+ posterior_encoder_dropout_rate: 0.0
339
+ use_weight_norm_in_posterior_encoder: true
340
+ flow_flows: -1
341
+ flow_kernel_size: 5
342
+ flow_base_dilation: 1
343
+ flow_layers: 4
344
+ flow_dropout_rate: 0.0
345
+ use_weight_norm_in_flow: true
346
+ use_only_mean_in_flow: true
347
+ use_phoneme_predictor: false
348
+ vocabs: 63
349
+ aux_channels: 80
350
+ generator_type: visinger2
351
+ vocoder_generator_type: visinger2
352
+ fs: 44100
353
+ hop_length: 512
354
+ win_length: 2048
355
+ n_fft: 2048
356
+ discriminator_type: visinger2
357
+ discriminator_params:
358
+ scales: 1
359
+ scale_downsample_pooling: AvgPool1d
360
+ scale_downsample_pooling_params:
361
+ kernel_size: 4
362
+ stride: 2
363
+ padding: 2
364
+ scale_discriminator_params:
365
+ in_channels: 1
366
+ out_channels: 1
367
+ kernel_sizes:
368
+ - 15
369
+ - 41
370
+ - 5
371
+ - 3
372
+ channels: 128
373
+ max_downsample_channels: 1024
374
+ max_groups: 256
375
+ bias: true
376
+ downsample_scales:
377
+ - 4
378
+ - 4
379
+ - 4
380
+ - 4
381
+ nonlinear_activation: LeakyReLU
382
+ nonlinear_activation_params:
383
+ negative_slope: 0.1
384
+ use_weight_norm: true
385
+ use_spectral_norm: false
386
+ follow_official_norm: false
387
+ periods:
388
+ - 2
389
+ - 3
390
+ - 5
391
+ - 7
392
+ - 11
393
+ period_discriminator_params:
394
+ in_channels: 1
395
+ out_channels: 1
396
+ kernel_sizes:
397
+ - 5
398
+ - 3
399
+ channels: 32
400
+ downsample_scales:
401
+ - 3
402
+ - 3
403
+ - 3
404
+ - 3
405
+ - 1
406
+ max_downsample_channels: 1024
407
+ bias: true
408
+ nonlinear_activation: LeakyReLU
409
+ nonlinear_activation_params:
410
+ negative_slope: 0.1
411
+ use_weight_norm: true
412
+ use_spectral_norm: false
413
+ multi_freq_disc_params:
414
+ hidden_channels:
415
+ - 256
416
+ - 256
417
+ - 256
418
+ - 256
419
+ - 256
420
+ domain: double
421
+ mel_scale: true
422
+ divisors:
423
+ - 32
424
+ - 16
425
+ - 8
426
+ - 4
427
+ - 2
428
+ - 1
429
+ - 1
430
+ strides:
431
+ - 1
432
+ - 2
433
+ - 1
434
+ - 2
435
+ - 1
436
+ - 2
437
+ - 1
438
+ hop_lengths:
439
+ - 55
440
+ - 110
441
+ - 165
442
+ - 220
443
+ - 275
444
+ - 330
445
+ generator_adv_loss_params:
446
+ average_by_discriminators: false
447
+ loss_type: mse
448
+ discriminator_adv_loss_params:
449
+ average_by_discriminators: false
450
+ loss_type: mse
451
+ feat_match_loss_params:
452
+ average_by_discriminators: false
453
+ average_by_layers: false
454
+ include_final_outputs: true
455
+ mel_loss_params:
456
+ fs: 44100
457
+ n_fft: 2048
458
+ hop_length: 512
459
+ win_length: 2048
460
+ window: hann
461
+ n_mels: 80
462
+ fmin: 0
463
+ fmax: 22050
464
+ log_base: null
465
+ lambda_adv: 1.0
466
+ lambda_mel: 45.0
467
+ lambda_feat_match: 2.0
468
+ lambda_dur: 0.1
469
+ lambda_pitch: 10.0
470
+ lambda_phoneme: 1.0
471
+ lambda_kl: 1.0
472
+ sampling_rate: 44100
473
+ cache_generator_outputs: true
474
+ pitch_extract: dio
475
+ pitch_extract_conf:
476
+ use_token_averaged_f0: false
477
+ use_log_f0: false
478
+ fs: 44100
479
+ n_fft: 2048
480
+ hop_length: 512
481
+ f0max: 800
482
+ f0min: 80
483
+ pitch_normalize: null
484
+ pitch_normalize_conf: {}
485
+ ying_extract: null
486
+ ying_extract_conf: {}
487
+ energy_extract: null
488
+ energy_extract_conf: {}
489
+ energy_normalize: null
490
+ energy_normalize_conf: {}
491
+ required:
492
+ - output_dir
493
+ - token_list
494
+ version: '202301'
495
+ distributed: false
496
+ ```
497
+
498
+ </details>
499
+
500
+
501
+
502
+ ### Citing ESPnet
503
+
504
+ ```BibTex
505
+ @inproceedings{watanabe2018espnet,
506
+ author={Shinji Watanabe and Takaaki Hori and Shigeki Karita and Tomoki Hayashi and Jiro Nishitoba and Yuya Unno and Nelson Yalta and Jahn Heymann and Matthew Wiesner and Nanxin Chen and Adithya Renduchintala and Tsubasa Ochiai},
507
+ title={{ESPnet}: End-to-End Speech Processing Toolkit},
508
+ year={2018},
509
+ booktitle={Proceedings of Interspeech},
510
+ pages={2207--2211},
511
+ doi={10.21437/Interspeech.2018-1456},
512
+ url={http://dx.doi.org/10.21437/Interspeech.2018-1456}
513
+ }
514
+
515
+
516
+
517
+
518
+ ```
519
+
520
+ or arXiv:
521
+
522
+ ```bibtex
523
+ @misc{watanabe2018espnet,
524
+ title={ESPnet: End-to-End Speech Processing Toolkit},
525
+ author={Shinji Watanabe and Takaaki Hori and Shigeki Karita and Tomoki Hayashi and Jiro Nishitoba and Yuya Unno and Nelson Yalta and Jahn Heymann and Matthew Wiesner and Nanxin Chen and Adithya Renduchintala and Tsubasa Ochiai},
526
+ year={2018},
527
+ eprint={1804.00015},
528
+ archivePrefix={arXiv},
529
+ primaryClass={cs.CL}
530
+ }
531
+ ```
exp/44k/svs_train_visinger2_raw_phn_None_zh/100epoch.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:7d9ad81e9ae016b6db45bdb83fda3264823a917593a0b1ebf82e597a964f85f5
3
+ size 430659035
exp/44k/svs_train_visinger2_raw_phn_None_zh/config.yaml ADDED
@@ -0,0 +1,458 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ config: ./conf/tuning/train_visinger2.yaml
2
+ print_config: false
3
+ log_level: INFO
4
+ dry_run: false
5
+ iterator_type: sequence
6
+ output_dir: exp/44k/svs_train_visinger2_raw_phn_None_zh
7
+ ngpu: 1
8
+ seed: 777
9
+ num_workers: 4
10
+ num_att_plot: 3
11
+ dist_backend: nccl
12
+ dist_init_method: env://
13
+ dist_world_size: null
14
+ dist_rank: null
15
+ local_rank: 0
16
+ dist_master_addr: null
17
+ dist_master_port: null
18
+ dist_launcher: null
19
+ multiprocessing_distributed: false
20
+ unused_parameters: true
21
+ sharded_ddp: false
22
+ cudnn_enabled: true
23
+ cudnn_benchmark: false
24
+ cudnn_deterministic: false
25
+ collect_stats: false
26
+ write_collected_feats: false
27
+ max_epoch: 100
28
+ patience: null
29
+ val_scheduler_criterion:
30
+ - valid
31
+ - loss
32
+ early_stopping_criterion:
33
+ - valid
34
+ - loss
35
+ - min
36
+ best_model_criterion:
37
+ - - train
38
+ - total_count
39
+ - max
40
+ keep_nbest_models: 10
41
+ nbest_averaging_interval: 0
42
+ grad_clip: -1
43
+ grad_clip_type: 2.0
44
+ grad_noise: false
45
+ accum_grad: 1
46
+ no_forward_run: false
47
+ resume: true
48
+ train_dtype: float32
49
+ use_amp: false
50
+ log_interval: 50
51
+ use_matplotlib: true
52
+ use_tensorboard: true
53
+ create_graph_in_tensorboard: false
54
+ use_wandb: false
55
+ wandb_project: null
56
+ wandb_id: null
57
+ wandb_entity: null
58
+ wandb_name: null
59
+ wandb_model_log_interval: -1
60
+ detect_anomaly: false
61
+ pretrain_path: null
62
+ init_param: []
63
+ ignore_init_mismatch: false
64
+ freeze_param: []
65
+ num_iters_per_epoch: 1000
66
+ batch_size: 8
67
+ valid_batch_size: null
68
+ batch_bins: 1000000
69
+ valid_batch_bins: null
70
+ train_shape_file:
71
+ - exp/44k/svs_stats_raw_phn_None_zh/train/text_shape.phn
72
+ - exp/44k/svs_stats_raw_phn_None_zh/train/singing_shape
73
+ valid_shape_file:
74
+ - exp/44k/svs_stats_raw_phn_None_zh/valid/text_shape.phn
75
+ - exp/44k/svs_stats_raw_phn_None_zh/valid/singing_shape
76
+ batch_type: sorted
77
+ valid_batch_type: null
78
+ fold_length:
79
+ - 150
80
+ - 409600
81
+ sort_in_batch: descending
82
+ sort_batch: descending
83
+ multiple_iterator: false
84
+ chunk_length: 500
85
+ chunk_shift_ratio: 0.5
86
+ num_cache_chunks: 1024
87
+ chunk_excluded_key_prefixes: []
88
+ train_data_path_and_name_and_type:
89
+ - - dump/44k/raw/tr_no_dev/text
90
+ - text
91
+ - text
92
+ - - dump/44k/raw/tr_no_dev/wav.scp
93
+ - singing
94
+ - sound
95
+ - - dump/44k/raw/tr_no_dev/label
96
+ - label
97
+ - duration
98
+ - - dump/44k/raw/tr_no_dev/score.scp
99
+ - score
100
+ - score
101
+ - - exp/44k/svs_stats_raw_phn_None_zh/train/collect_feats/pitch.scp
102
+ - pitch
103
+ - npy
104
+ - - exp/44k/svs_stats_raw_phn_None_zh/train/collect_feats/feats.scp
105
+ - feats
106
+ - npy
107
+ valid_data_path_and_name_and_type:
108
+ - - dump/44k/raw/dev/text
109
+ - text
110
+ - text
111
+ - - dump/44k/raw/dev/wav.scp
112
+ - singing
113
+ - sound
114
+ - - dump/44k/raw/dev/label
115
+ - label
116
+ - duration
117
+ - - dump/44k/raw/dev/score.scp
118
+ - score
119
+ - score
120
+ - - exp/44k/svs_stats_raw_phn_None_zh/valid/collect_feats/pitch.scp
121
+ - pitch
122
+ - npy
123
+ - - exp/44k/svs_stats_raw_phn_None_zh/valid/collect_feats/feats.scp
124
+ - feats
125
+ - npy
126
+ allow_variable_data_keys: false
127
+ max_cache_size: 0.0
128
+ max_cache_fd: 32
129
+ valid_max_cache_size: null
130
+ exclude_weight_decay: false
131
+ exclude_weight_decay_conf: {}
132
+ optim: adamw
133
+ optim_conf:
134
+ lr: 0.0002
135
+ betas:
136
+ - 0.8
137
+ - 0.99
138
+ eps: 1.0e-09
139
+ weight_decay: 0.0
140
+ scheduler: exponentiallr
141
+ scheduler_conf:
142
+ gamma: 0.998
143
+ optim2: adamw
144
+ optim2_conf:
145
+ lr: 0.0002
146
+ betas:
147
+ - 0.8
148
+ - 0.99
149
+ eps: 1.0e-09
150
+ weight_decay: 0.0
151
+ scheduler2: exponentiallr
152
+ scheduler2_conf:
153
+ gamma: 0.998
154
+ generator_first: false
155
+ token_list:
156
+ - <blank>
157
+ - <unk>
158
+ - SP
159
+ - i
160
+ - AP
161
+ - e
162
+ - y
163
+ - d
164
+ - w
165
+ - sh
166
+ - ai
167
+ - n
168
+ - x
169
+ - j
170
+ - ian
171
+ - u
172
+ - l
173
+ - h
174
+ - b
175
+ - o
176
+ - zh
177
+ - an
178
+ - ou
179
+ - m
180
+ - q
181
+ - z
182
+ - en
183
+ - g
184
+ - ing
185
+ - ei
186
+ - ao
187
+ - ang
188
+ - uo
189
+ - eng
190
+ - t
191
+ - a
192
+ - ong
193
+ - ui
194
+ - k
195
+ - f
196
+ - r
197
+ - iang
198
+ - ch
199
+ - v
200
+ - in
201
+ - iao
202
+ - ie
203
+ - iu
204
+ - c
205
+ - s
206
+ - van
207
+ - p
208
+ - ve
209
+ - uan
210
+ - uang
211
+ - ia
212
+ - ua
213
+ - uai
214
+ - un
215
+ - er
216
+ - vn
217
+ - iong
218
+ - <sos/eos>
219
+ odim: null
220
+ model_conf: {}
221
+ use_preprocessor: true
222
+ token_type: phn
223
+ bpemodel: null
224
+ non_linguistic_symbols: null
225
+ cleaner: null
226
+ g2p: null
227
+ fs: 44100
228
+ score_feats_extract: syllable_score_feats
229
+ score_feats_extract_conf:
230
+ fs: 44100
231
+ n_fft: 2048
232
+ win_length: 2048
233
+ hop_length: 512
234
+ feats_extract: fbank
235
+ feats_extract_conf:
236
+ n_fft: 2048
237
+ hop_length: 512
238
+ win_length: 2048
239
+ fs: 44100
240
+ fmin: 0
241
+ fmax: 22050
242
+ n_mels: 80
243
+ normalize: null
244
+ normalize_conf: {}
245
+ svs: vits
246
+ svs_conf:
247
+ generator_type: visinger2
248
+ vocoder_generator_type: visinger2
249
+ generator_params:
250
+ hidden_channels: 192
251
+ spks: -1
252
+ global_channels: -1
253
+ segment_size: 20
254
+ text_encoder_attention_heads: 2
255
+ text_encoder_ffn_expand: 4
256
+ text_encoder_blocks: 6
257
+ text_encoder_positionwise_layer_type: conv1d
258
+ text_encoder_positionwise_conv_kernel_size: 3
259
+ text_encoder_positional_encoding_layer_type: rel_pos
260
+ text_encoder_self_attention_layer_type: rel_selfattn
261
+ text_encoder_activation_type: swish
262
+ text_encoder_normalize_before: true
263
+ text_encoder_dropout_rate: 0.1
264
+ text_encoder_positional_dropout_rate: 0.0
265
+ text_encoder_attention_dropout_rate: 0.1
266
+ use_macaron_style_in_text_encoder: true
267
+ use_conformer_conv_in_text_encoder: false
268
+ text_encoder_conformer_kernel_size: -1
269
+ decoder_kernel_size: 7
270
+ decoder_channels: 256
271
+ decoder_upsample_scales:
272
+ - 8
273
+ - 8
274
+ - 4
275
+ - 2
276
+ decoder_upsample_kernel_sizes:
277
+ - 16
278
+ - 16
279
+ - 8
280
+ - 4
281
+ n_harmonic: 64
282
+ decoder_resblock_kernel_sizes:
283
+ - 3
284
+ - 7
285
+ - 11
286
+ decoder_resblock_dilations:
287
+ - - 1
288
+ - 3
289
+ - 5
290
+ - - 1
291
+ - 3
292
+ - 5
293
+ - - 1
294
+ - 3
295
+ - 5
296
+ use_weight_norm_in_decoder: true
297
+ posterior_encoder_kernel_size: 3
298
+ posterior_encoder_layers: 8
299
+ posterior_encoder_stacks: 1
300
+ posterior_encoder_base_dilation: 1
301
+ posterior_encoder_dropout_rate: 0.0
302
+ use_weight_norm_in_posterior_encoder: true
303
+ flow_flows: -1
304
+ flow_kernel_size: 5
305
+ flow_base_dilation: 1
306
+ flow_layers: 4
307
+ flow_dropout_rate: 0.0
308
+ use_weight_norm_in_flow: true
309
+ use_only_mean_in_flow: true
310
+ use_phoneme_predictor: false
311
+ vocabs: 63
312
+ aux_channels: 80
313
+ generator_type: visinger2
314
+ vocoder_generator_type: visinger2
315
+ fs: 44100
316
+ hop_length: 512
317
+ win_length: 2048
318
+ n_fft: 2048
319
+ discriminator_type: visinger2
320
+ discriminator_params:
321
+ scales: 1
322
+ scale_downsample_pooling: AvgPool1d
323
+ scale_downsample_pooling_params:
324
+ kernel_size: 4
325
+ stride: 2
326
+ padding: 2
327
+ scale_discriminator_params:
328
+ in_channels: 1
329
+ out_channels: 1
330
+ kernel_sizes:
331
+ - 15
332
+ - 41
333
+ - 5
334
+ - 3
335
+ channels: 128
336
+ max_downsample_channels: 1024
337
+ max_groups: 256
338
+ bias: true
339
+ downsample_scales:
340
+ - 4
341
+ - 4
342
+ - 4
343
+ - 4
344
+ nonlinear_activation: LeakyReLU
345
+ nonlinear_activation_params:
346
+ negative_slope: 0.1
347
+ use_weight_norm: true
348
+ use_spectral_norm: false
349
+ follow_official_norm: false
350
+ periods:
351
+ - 2
352
+ - 3
353
+ - 5
354
+ - 7
355
+ - 11
356
+ period_discriminator_params:
357
+ in_channels: 1
358
+ out_channels: 1
359
+ kernel_sizes:
360
+ - 5
361
+ - 3
362
+ channels: 32
363
+ downsample_scales:
364
+ - 3
365
+ - 3
366
+ - 3
367
+ - 3
368
+ - 1
369
+ max_downsample_channels: 1024
370
+ bias: true
371
+ nonlinear_activation: LeakyReLU
372
+ nonlinear_activation_params:
373
+ negative_slope: 0.1
374
+ use_weight_norm: true
375
+ use_spectral_norm: false
376
+ multi_freq_disc_params:
377
+ hidden_channels:
378
+ - 256
379
+ - 256
380
+ - 256
381
+ - 256
382
+ - 256
383
+ domain: double
384
+ mel_scale: true
385
+ divisors:
386
+ - 32
387
+ - 16
388
+ - 8
389
+ - 4
390
+ - 2
391
+ - 1
392
+ - 1
393
+ strides:
394
+ - 1
395
+ - 2
396
+ - 1
397
+ - 2
398
+ - 1
399
+ - 2
400
+ - 1
401
+ hop_lengths:
402
+ - 55
403
+ - 110
404
+ - 165
405
+ - 220
406
+ - 275
407
+ - 330
408
+ generator_adv_loss_params:
409
+ average_by_discriminators: false
410
+ loss_type: mse
411
+ discriminator_adv_loss_params:
412
+ average_by_discriminators: false
413
+ loss_type: mse
414
+ feat_match_loss_params:
415
+ average_by_discriminators: false
416
+ average_by_layers: false
417
+ include_final_outputs: true
418
+ mel_loss_params:
419
+ fs: 44100
420
+ n_fft: 2048
421
+ hop_length: 512
422
+ win_length: 2048
423
+ window: hann
424
+ n_mels: 80
425
+ fmin: 0
426
+ fmax: 22050
427
+ log_base: null
428
+ lambda_adv: 1.0
429
+ lambda_mel: 45.0
430
+ lambda_feat_match: 2.0
431
+ lambda_dur: 0.1
432
+ lambda_pitch: 10.0
433
+ lambda_phoneme: 1.0
434
+ lambda_kl: 1.0
435
+ sampling_rate: 44100
436
+ cache_generator_outputs: true
437
+ pitch_extract: dio
438
+ pitch_extract_conf:
439
+ use_token_averaged_f0: false
440
+ use_log_f0: false
441
+ fs: 44100
442
+ n_fft: 2048
443
+ hop_length: 512
444
+ f0max: 800
445
+ f0min: 80
446
+ pitch_normalize: null
447
+ pitch_normalize_conf: {}
448
+ ying_extract: null
449
+ ying_extract_conf: {}
450
+ energy_extract: null
451
+ energy_extract_conf: {}
452
+ energy_normalize: null
453
+ energy_normalize_conf: {}
454
+ required:
455
+ - output_dir
456
+ - token_list
457
+ version: '202301'
458
+ distributed: false
exp/44k/svs_train_visinger2_raw_phn_None_zh/images/discriminator_backward_time.png ADDED
exp/44k/svs_train_visinger2_raw_phn_None_zh/images/discriminator_fake_loss.png ADDED
exp/44k/svs_train_visinger2_raw_phn_None_zh/images/discriminator_forward_time.png ADDED
exp/44k/svs_train_visinger2_raw_phn_None_zh/images/discriminator_loss.png ADDED
exp/44k/svs_train_visinger2_raw_phn_None_zh/images/discriminator_optim_step_time.png ADDED
exp/44k/svs_train_visinger2_raw_phn_None_zh/images/discriminator_real_loss.png ADDED
exp/44k/svs_train_visinger2_raw_phn_None_zh/images/discriminator_train_time.png ADDED
exp/44k/svs_train_visinger2_raw_phn_None_zh/images/generator_adv_loss.png ADDED
exp/44k/svs_train_visinger2_raw_phn_None_zh/images/generator_backward_time.png ADDED
exp/44k/svs_train_visinger2_raw_phn_None_zh/images/generator_feat_match_loss.png ADDED
exp/44k/svs_train_visinger2_raw_phn_None_zh/images/generator_forward_time.png ADDED
exp/44k/svs_train_visinger2_raw_phn_None_zh/images/generator_kl_loss.png ADDED
exp/44k/svs_train_visinger2_raw_phn_None_zh/images/generator_loss.png ADDED
exp/44k/svs_train_visinger2_raw_phn_None_zh/images/generator_mel_am_loss.png ADDED
exp/44k/svs_train_visinger2_raw_phn_None_zh/images/generator_mel_ddsp_loss.png ADDED
exp/44k/svs_train_visinger2_raw_phn_None_zh/images/generator_mel_loss.png ADDED
exp/44k/svs_train_visinger2_raw_phn_None_zh/images/generator_optim_step_time.png ADDED
exp/44k/svs_train_visinger2_raw_phn_None_zh/images/generator_phn_dur_loss.png ADDED
exp/44k/svs_train_visinger2_raw_phn_None_zh/images/generator_pitch_loss.png ADDED
exp/44k/svs_train_visinger2_raw_phn_None_zh/images/generator_score_dur_loss.png ADDED
exp/44k/svs_train_visinger2_raw_phn_None_zh/images/generator_train_time.png ADDED
exp/44k/svs_train_visinger2_raw_phn_None_zh/images/gpu_max_cached_mem_GB.png ADDED
exp/44k/svs_train_visinger2_raw_phn_None_zh/images/iter_time.png ADDED
exp/44k/svs_train_visinger2_raw_phn_None_zh/images/optim0_lr0.png ADDED
exp/44k/svs_train_visinger2_raw_phn_None_zh/images/optim1_lr0.png ADDED
exp/44k/svs_train_visinger2_raw_phn_None_zh/images/train_time.png ADDED
meta.yaml ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ espnet: '202304'
2
+ files:
3
+ model_file: exp/44k/svs_train_visinger2_raw_phn_None_zh/100epoch.pth
4
+ python: "3.9.15 (main, Nov 24 2022, 14:31:59) \n[GCC 11.2.0]"
5
+ timestamp: 1685077143.490357
6
+ torch: 2.0.0.dev20230206+cu118
7
+ yaml_files:
8
+ train_config: exp/44k/svs_train_visinger2_raw_phn_None_zh/config.yaml