ESPnet
multilingual
audio
codec
ftshijt commited on
Commit
bdf70f6
1 Parent(s): 0cb2bc7

Update model

Browse files
Files changed (29) hide show
  1. README.md +343 -3
  2. exp/codec_train_soundstream4_raw_fs24000/120epoch.pth +3 -0
  3. exp/codec_train_soundstream4_raw_fs24000/config.yaml +268 -0
  4. exp/codec_train_soundstream4_raw_fs24000/images/adv_loss.png +0 -0
  5. exp/codec_train_soundstream4_raw_fs24000/images/codec_commit_loss.png +0 -0
  6. exp/codec_train_soundstream4_raw_fs24000/images/codec_loss.png +0 -0
  7. exp/codec_train_soundstream4_raw_fs24000/images/codec_quantization_loss.png +0 -0
  8. exp/codec_train_soundstream4_raw_fs24000/images/discriminator_backward_time.png +0 -0
  9. exp/codec_train_soundstream4_raw_fs24000/images/discriminator_forward_time.png +0 -0
  10. exp/codec_train_soundstream4_raw_fs24000/images/discriminator_loss.png +0 -0
  11. exp/codec_train_soundstream4_raw_fs24000/images/discriminator_optim_step_time.png +0 -0
  12. exp/codec_train_soundstream4_raw_fs24000/images/discriminator_train_time.png +0 -0
  13. exp/codec_train_soundstream4_raw_fs24000/images/fake_loss.png +0 -0
  14. exp/codec_train_soundstream4_raw_fs24000/images/feat_match_loss.png +0 -0
  15. exp/codec_train_soundstream4_raw_fs24000/images/generator_backward_time.png +0 -0
  16. exp/codec_train_soundstream4_raw_fs24000/images/generator_forward_time.png +0 -0
  17. exp/codec_train_soundstream4_raw_fs24000/images/generator_optim_step_time.png +0 -0
  18. exp/codec_train_soundstream4_raw_fs24000/images/generator_train_time.png +0 -0
  19. exp/codec_train_soundstream4_raw_fs24000/images/gpu_max_cached_mem_GB.png +0 -0
  20. exp/codec_train_soundstream4_raw_fs24000/images/iter_time.png +0 -0
  21. exp/codec_train_soundstream4_raw_fs24000/images/loss.png +0 -0
  22. exp/codec_train_soundstream4_raw_fs24000/images/mel_loss.png +0 -0
  23. exp/codec_train_soundstream4_raw_fs24000/images/mel_loss_real.png +0 -0
  24. exp/codec_train_soundstream4_raw_fs24000/images/optim0_lr0.png +0 -0
  25. exp/codec_train_soundstream4_raw_fs24000/images/optim1_lr0.png +0 -0
  26. exp/codec_train_soundstream4_raw_fs24000/images/real_loss.png +0 -0
  27. exp/codec_train_soundstream4_raw_fs24000/images/reconstruct_loss.png +0 -0
  28. exp/codec_train_soundstream4_raw_fs24000/images/train_time.png +0 -0
  29. meta.yaml +8 -0
README.md CHANGED
@@ -1,3 +1,343 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - espnet
4
+ - audio
5
+ - codec
6
+ language: multilingual
7
+ datasets:
8
+ - libritts
9
+ license: cc-by-4.0
10
+ ---
11
+
12
+ ## ESPnet2 Codec model
13
+
14
+ ### `espnet/libritts_soundstream24k`
15
+
16
+ This model was trained by ftshijt using libritts recipe in [espnet](https://github.com/espnet/espnet/).
17
+
18
+ ### Demo: How to use in ESPnet2
19
+
20
+ Follow the [ESPnet installation instructions](https://espnet.github.io/espnet/installation.html)
21
+ if you haven't done that already.
22
+
23
+ ```bash
24
+ cd espnet
25
+ git checkout 734f1235b3dd3c444822b6337fbb2e417e75e321
26
+ pip install -e .
27
+ cd egs2/libritts/codec1
28
+ ./run.sh --skip_data_prep false --skip_train true --download_model espnet/libritts_soundstream24k
29
+ ```
30
+
31
+
32
+
33
+ ## Codec config
34
+
35
+ <details><summary>expand</summary>
36
+
37
+ ```
38
+ config: conf/train_soundstream4.yaml
39
+ print_config: false
40
+ log_level: INFO
41
+ drop_last_iter: false
42
+ dry_run: false
43
+ iterator_type: chunk
44
+ valid_iterator_type: null
45
+ output_dir: exp/codec_train_soundstream4_raw_fs24000
46
+ ngpu: 1
47
+ seed: 777
48
+ num_workers: 1
49
+ num_att_plot: 0
50
+ dist_backend: nccl
51
+ dist_init_method: env://
52
+ dist_world_size: null
53
+ dist_rank: null
54
+ local_rank: 0
55
+ dist_master_addr: null
56
+ dist_master_port: null
57
+ dist_launcher: null
58
+ multiprocessing_distributed: false
59
+ unused_parameters: true
60
+ sharded_ddp: false
61
+ cudnn_enabled: true
62
+ cudnn_benchmark: false
63
+ cudnn_deterministic: false
64
+ collect_stats: false
65
+ write_collected_feats: false
66
+ max_epoch: 120
67
+ patience: null
68
+ val_scheduler_criterion:
69
+ - valid
70
+ - loss
71
+ early_stopping_criterion:
72
+ - valid
73
+ - loss
74
+ - min
75
+ best_model_criterion:
76
+ - - valid
77
+ - mel_loss
78
+ - min
79
+ - - train
80
+ - mel_loss
81
+ - min
82
+ - - train
83
+ - total_count
84
+ - max
85
+ keep_nbest_models: 5
86
+ nbest_averaging_interval: 0
87
+ grad_clip: -1
88
+ grad_clip_type: 2.0
89
+ grad_noise: false
90
+ accum_grad: 1
91
+ no_forward_run: false
92
+ resume: true
93
+ train_dtype: float32
94
+ use_amp: false
95
+ log_interval: 50
96
+ use_matplotlib: true
97
+ use_tensorboard: true
98
+ create_graph_in_tensorboard: false
99
+ use_wandb: false
100
+ wandb_project: null
101
+ wandb_id: null
102
+ wandb_entity: null
103
+ wandb_name: null
104
+ wandb_model_log_interval: -1
105
+ detect_anomaly: false
106
+ use_adapter: false
107
+ adapter: lora
108
+ save_strategy: all
109
+ adapter_conf: {}
110
+ pretrain_path: null
111
+ init_param: []
112
+ ignore_init_mismatch: false
113
+ freeze_param: []
114
+ num_iters_per_epoch: 5000
115
+ batch_size: 8
116
+ valid_batch_size: null
117
+ batch_bins: 1000000
118
+ valid_batch_bins: null
119
+ train_shape_file:
120
+ - exp/codec_stats_raw/train/audio_shape
121
+ valid_shape_file:
122
+ - exp/codec_stats_raw/valid/audio_shape
123
+ batch_type: unsorted
124
+ valid_batch_type: null
125
+ fold_length:
126
+ - 256000
127
+ sort_in_batch: descending
128
+ shuffle_within_batch: false
129
+ sort_batch: descending
130
+ multiple_iterator: false
131
+ chunk_length: 24000
132
+ chunk_shift_ratio: 0.5
133
+ num_cache_chunks: 64
134
+ chunk_excluded_key_prefixes: []
135
+ chunk_default_fs: null
136
+ train_data_path_and_name_and_type:
137
+ - - dump/raw/train-clean-460/wav.scp
138
+ - audio
139
+ - sound
140
+ valid_data_path_and_name_and_type:
141
+ - - dump/raw/dev-clean/wav.scp
142
+ - audio
143
+ - sound
144
+ allow_variable_data_keys: false
145
+ max_cache_size: 0.0
146
+ max_cache_fd: 32
147
+ allow_multi_rates: false
148
+ valid_max_cache_size: null
149
+ exclude_weight_decay: false
150
+ exclude_weight_decay_conf: {}
151
+ optim: adam
152
+ optim_conf:
153
+ lr: 0.0002
154
+ betas:
155
+ - 0.5
156
+ - 0.9
157
+ eps: 1.0e-09
158
+ weight_decay: 0.0
159
+ scheduler: exponentiallr
160
+ scheduler_conf:
161
+ gamma: 0.999875
162
+ optim2: adam
163
+ optim2_conf:
164
+ lr: 0.0002
165
+ betas:
166
+ - 0.5
167
+ - 0.9
168
+ eps: 1.0e-09
169
+ weight_decay: 0.0
170
+ scheduler2: exponentiallr
171
+ scheduler2_conf:
172
+ gamma: 0.999875
173
+ generator_first: true
174
+ model_conf: {}
175
+ use_preprocessor: true
176
+ codec: soundstream
177
+ codec_conf:
178
+ sampling_rate: 24000
179
+ generator_params:
180
+ hidden_dim: 512
181
+ encdec_channels: 1
182
+ encdec_n_filters: 32
183
+ encdec_n_residual_layers: 3
184
+ encdec_ratios:
185
+ - 8
186
+ - 5
187
+ - 4
188
+ - 2
189
+ encdec_activation: ELU
190
+ encdec_activation_params:
191
+ alpha: 1.0
192
+ encdec_norm: weight_norm
193
+ encdec_kernel_size: 7
194
+ encdec_residual_kernel_size: 7
195
+ encdec_last_kernel_size: 7
196
+ encdec_dilation_base: 2
197
+ encdec_causal: false
198
+ encdec_pad_mode: reflect
199
+ encdec_true_skip: false
200
+ encdec_compress: 2
201
+ encdec_lstm: 2
202
+ decoder_trim_right_ratio: 1.0
203
+ decoder_final_activation: null
204
+ decoder_final_activation_params: null
205
+ quantizer_n_q: 32
206
+ quantizer_bins: 1024
207
+ quantizer_decay: 0.99
208
+ quantizer_kmeans_init: true
209
+ quantizer_kmeans_iters: 50
210
+ quantizer_threshold_ema_dead_code: 2
211
+ quantizer_target_bandwidth:
212
+ - 2
213
+ - 4
214
+ - 8
215
+ - 16
216
+ - 32
217
+ sample_rate: 24000
218
+ discriminator_params:
219
+ scales: 3
220
+ scale_downsample_pooling: AvgPool1d
221
+ scale_downsample_pooling_params:
222
+ kernel_size: 4
223
+ stride: 2
224
+ padding: 2
225
+ scale_discriminator_params:
226
+ in_channels: 1
227
+ out_channels: 1
228
+ kernel_sizes:
229
+ - 15
230
+ - 41
231
+ - 5
232
+ - 3
233
+ channels: 128
234
+ max_downsample_channels: 1024
235
+ max_groups: 16
236
+ bias: true
237
+ downsample_scales:
238
+ - 2
239
+ - 2
240
+ - 4
241
+ - 4
242
+ - 1
243
+ nonlinear_activation: LeakyReLU
244
+ nonlinear_activation_params:
245
+ negative_slope: 0.1
246
+ scale_follow_official_norm: false
247
+ complexstft_discriminator_params:
248
+ in_channels: 1
249
+ channels: 32
250
+ strides:
251
+ - - 1
252
+ - 2
253
+ - - 2
254
+ - 2
255
+ - - 1
256
+ - 2
257
+ - - 2
258
+ - 2
259
+ - - 1
260
+ - 2
261
+ - - 2
262
+ - 2
263
+ chan_mults:
264
+ - 1
265
+ - 2
266
+ - 4
267
+ - 4
268
+ - 8
269
+ - 8
270
+ n_fft: 1024
271
+ hop_length: 256
272
+ win_length: 1024
273
+ stft_normalized: false
274
+ generator_adv_loss_params:
275
+ average_by_discriminators: false
276
+ loss_type: mse
277
+ discriminator_adv_loss_params:
278
+ average_by_discriminators: false
279
+ loss_type: mse
280
+ use_feat_match_loss: true
281
+ feat_match_loss_params:
282
+ average_by_discriminators: false
283
+ average_by_layers: false
284
+ include_final_outputs: true
285
+ use_mel_loss: true
286
+ mel_loss_params:
287
+ range_start: 6
288
+ range_end: 11
289
+ window: hann
290
+ n_mels: 80
291
+ fmin: 0
292
+ fmax: null
293
+ log_base: null
294
+ fs: 24000
295
+ lambda_quantization: 0.0
296
+ lambda_commit: 1.0
297
+ lambda_reconstruct: 1.0
298
+ lambda_adv: 1.0
299
+ lambda_mel: 45.0
300
+ lambda_feat_match: 2.0
301
+ cache_generator_outputs: true
302
+ required:
303
+ - output_dir
304
+ version: '202402'
305
+ distributed: false
306
+ ```
307
+
308
+ </details>
309
+
310
+
311
+
312
+ ### Citing ESPnet
313
+
314
+ ```BibTex
315
+ @inproceedings{watanabe2018espnet,
316
+ author={Shinji Watanabe and Takaaki Hori and Shigeki Karita and Tomoki Hayashi and Jiro Nishitoba and Yuya Unno and Nelson Yalta and Jahn Heymann and Matthew Wiesner and Nanxin Chen and Adithya Renduchintala and Tsubasa Ochiai},
317
+ title={{ESPnet}: End-to-End Speech Processing Toolkit},
318
+ year={2018},
319
+ booktitle={Proceedings of Interspeech},
320
+ pages={2207--2211},
321
+ doi={10.21437/Interspeech.2018-1456},
322
+ url={http://dx.doi.org/10.21437/Interspeech.2018-1456}
323
+ }
324
+
325
+
326
+
327
+
328
+
329
+
330
+ ```
331
+
332
+ or arXiv:
333
+
334
+ ```bibtex
335
+ @misc{watanabe2018espnet,
336
+ title={ESPnet: End-to-End Speech Processing Toolkit},
337
+ author={Shinji Watanabe and Takaaki Hori and Shigeki Karita and Tomoki Hayashi and Jiro Nishitoba and Yuya Unno and Nelson Yalta and Jahn Heymann and Matthew Wiesner and Nanxin Chen and Adithya Renduchintala and Tsubasa Ochiai},
338
+ year={2018},
339
+ eprint={1804.00015},
340
+ archivePrefix={arXiv},
341
+ primaryClass={cs.CL}
342
+ }
343
+ ```
exp/codec_train_soundstream4_raw_fs24000/120epoch.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9dd8cb738fcd77cce6a067ac373558535365112e827c7825b812dcd14de370c2
3
+ size 354547787
exp/codec_train_soundstream4_raw_fs24000/config.yaml ADDED
@@ -0,0 +1,268 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ config: conf/train_soundstream4.yaml
2
+ print_config: false
3
+ log_level: INFO
4
+ drop_last_iter: false
5
+ dry_run: false
6
+ iterator_type: chunk
7
+ valid_iterator_type: null
8
+ output_dir: exp/codec_train_soundstream4_raw_fs24000
9
+ ngpu: 1
10
+ seed: 777
11
+ num_workers: 1
12
+ num_att_plot: 0
13
+ dist_backend: nccl
14
+ dist_init_method: env://
15
+ dist_world_size: null
16
+ dist_rank: null
17
+ local_rank: 0
18
+ dist_master_addr: null
19
+ dist_master_port: null
20
+ dist_launcher: null
21
+ multiprocessing_distributed: false
22
+ unused_parameters: true
23
+ sharded_ddp: false
24
+ cudnn_enabled: true
25
+ cudnn_benchmark: false
26
+ cudnn_deterministic: false
27
+ collect_stats: false
28
+ write_collected_feats: false
29
+ max_epoch: 120
30
+ patience: null
31
+ val_scheduler_criterion:
32
+ - valid
33
+ - loss
34
+ early_stopping_criterion:
35
+ - valid
36
+ - loss
37
+ - min
38
+ best_model_criterion:
39
+ - - valid
40
+ - mel_loss
41
+ - min
42
+ - - train
43
+ - mel_loss
44
+ - min
45
+ - - train
46
+ - total_count
47
+ - max
48
+ keep_nbest_models: 5
49
+ nbest_averaging_interval: 0
50
+ grad_clip: -1
51
+ grad_clip_type: 2.0
52
+ grad_noise: false
53
+ accum_grad: 1
54
+ no_forward_run: false
55
+ resume: true
56
+ train_dtype: float32
57
+ use_amp: false
58
+ log_interval: 50
59
+ use_matplotlib: true
60
+ use_tensorboard: true
61
+ create_graph_in_tensorboard: false
62
+ use_wandb: false
63
+ wandb_project: null
64
+ wandb_id: null
65
+ wandb_entity: null
66
+ wandb_name: null
67
+ wandb_model_log_interval: -1
68
+ detect_anomaly: false
69
+ use_adapter: false
70
+ adapter: lora
71
+ save_strategy: all
72
+ adapter_conf: {}
73
+ pretrain_path: null
74
+ init_param: []
75
+ ignore_init_mismatch: false
76
+ freeze_param: []
77
+ num_iters_per_epoch: 5000
78
+ batch_size: 8
79
+ valid_batch_size: null
80
+ batch_bins: 1000000
81
+ valid_batch_bins: null
82
+ train_shape_file:
83
+ - exp/codec_stats_raw/train/audio_shape
84
+ valid_shape_file:
85
+ - exp/codec_stats_raw/valid/audio_shape
86
+ batch_type: unsorted
87
+ valid_batch_type: null
88
+ fold_length:
89
+ - 256000
90
+ sort_in_batch: descending
91
+ shuffle_within_batch: false
92
+ sort_batch: descending
93
+ multiple_iterator: false
94
+ chunk_length: 24000
95
+ chunk_shift_ratio: 0.5
96
+ num_cache_chunks: 64
97
+ chunk_excluded_key_prefixes: []
98
+ chunk_default_fs: null
99
+ train_data_path_and_name_and_type:
100
+ - - dump/raw/train-clean-460/wav.scp
101
+ - audio
102
+ - sound
103
+ valid_data_path_and_name_and_type:
104
+ - - dump/raw/dev-clean/wav.scp
105
+ - audio
106
+ - sound
107
+ allow_variable_data_keys: false
108
+ max_cache_size: 0.0
109
+ max_cache_fd: 32
110
+ allow_multi_rates: false
111
+ valid_max_cache_size: null
112
+ exclude_weight_decay: false
113
+ exclude_weight_decay_conf: {}
114
+ optim: adam
115
+ optim_conf:
116
+ lr: 0.0002
117
+ betas:
118
+ - 0.5
119
+ - 0.9
120
+ eps: 1.0e-09
121
+ weight_decay: 0.0
122
+ scheduler: exponentiallr
123
+ scheduler_conf:
124
+ gamma: 0.999875
125
+ optim2: adam
126
+ optim2_conf:
127
+ lr: 0.0002
128
+ betas:
129
+ - 0.5
130
+ - 0.9
131
+ eps: 1.0e-09
132
+ weight_decay: 0.0
133
+ scheduler2: exponentiallr
134
+ scheduler2_conf:
135
+ gamma: 0.999875
136
+ generator_first: true
137
+ model_conf: {}
138
+ use_preprocessor: true
139
+ codec: soundstream
140
+ codec_conf:
141
+ sampling_rate: 24000
142
+ generator_params:
143
+ hidden_dim: 512
144
+ encdec_channels: 1
145
+ encdec_n_filters: 32
146
+ encdec_n_residual_layers: 3
147
+ encdec_ratios:
148
+ - 8
149
+ - 5
150
+ - 4
151
+ - 2
152
+ encdec_activation: ELU
153
+ encdec_activation_params:
154
+ alpha: 1.0
155
+ encdec_norm: weight_norm
156
+ encdec_kernel_size: 7
157
+ encdec_residual_kernel_size: 7
158
+ encdec_last_kernel_size: 7
159
+ encdec_dilation_base: 2
160
+ encdec_causal: false
161
+ encdec_pad_mode: reflect
162
+ encdec_true_skip: false
163
+ encdec_compress: 2
164
+ encdec_lstm: 2
165
+ decoder_trim_right_ratio: 1.0
166
+ decoder_final_activation: null
167
+ decoder_final_activation_params: null
168
+ quantizer_n_q: 32
169
+ quantizer_bins: 1024
170
+ quantizer_decay: 0.99
171
+ quantizer_kmeans_init: true
172
+ quantizer_kmeans_iters: 50
173
+ quantizer_threshold_ema_dead_code: 2
174
+ quantizer_target_bandwidth:
175
+ - 2
176
+ - 4
177
+ - 8
178
+ - 16
179
+ - 32
180
+ sample_rate: 24000
181
+ discriminator_params:
182
+ scales: 3
183
+ scale_downsample_pooling: AvgPool1d
184
+ scale_downsample_pooling_params:
185
+ kernel_size: 4
186
+ stride: 2
187
+ padding: 2
188
+ scale_discriminator_params:
189
+ in_channels: 1
190
+ out_channels: 1
191
+ kernel_sizes:
192
+ - 15
193
+ - 41
194
+ - 5
195
+ - 3
196
+ channels: 128
197
+ max_downsample_channels: 1024
198
+ max_groups: 16
199
+ bias: true
200
+ downsample_scales:
201
+ - 2
202
+ - 2
203
+ - 4
204
+ - 4
205
+ - 1
206
+ nonlinear_activation: LeakyReLU
207
+ nonlinear_activation_params:
208
+ negative_slope: 0.1
209
+ scale_follow_official_norm: false
210
+ complexstft_discriminator_params:
211
+ in_channels: 1
212
+ channels: 32
213
+ strides:
214
+ - - 1
215
+ - 2
216
+ - - 2
217
+ - 2
218
+ - - 1
219
+ - 2
220
+ - - 2
221
+ - 2
222
+ - - 1
223
+ - 2
224
+ - - 2
225
+ - 2
226
+ chan_mults:
227
+ - 1
228
+ - 2
229
+ - 4
230
+ - 4
231
+ - 8
232
+ - 8
233
+ n_fft: 1024
234
+ hop_length: 256
235
+ win_length: 1024
236
+ stft_normalized: false
237
+ generator_adv_loss_params:
238
+ average_by_discriminators: false
239
+ loss_type: mse
240
+ discriminator_adv_loss_params:
241
+ average_by_discriminators: false
242
+ loss_type: mse
243
+ use_feat_match_loss: true
244
+ feat_match_loss_params:
245
+ average_by_discriminators: false
246
+ average_by_layers: false
247
+ include_final_outputs: true
248
+ use_mel_loss: true
249
+ mel_loss_params:
250
+ range_start: 6
251
+ range_end: 11
252
+ window: hann
253
+ n_mels: 80
254
+ fmin: 0
255
+ fmax: null
256
+ log_base: null
257
+ fs: 24000
258
+ lambda_quantization: 0.0
259
+ lambda_commit: 1.0
260
+ lambda_reconstruct: 1.0
261
+ lambda_adv: 1.0
262
+ lambda_mel: 45.0
263
+ lambda_feat_match: 2.0
264
+ cache_generator_outputs: true
265
+ required:
266
+ - output_dir
267
+ version: '202402'
268
+ distributed: false
exp/codec_train_soundstream4_raw_fs24000/images/adv_loss.png ADDED
exp/codec_train_soundstream4_raw_fs24000/images/codec_commit_loss.png ADDED
exp/codec_train_soundstream4_raw_fs24000/images/codec_loss.png ADDED
exp/codec_train_soundstream4_raw_fs24000/images/codec_quantization_loss.png ADDED
exp/codec_train_soundstream4_raw_fs24000/images/discriminator_backward_time.png ADDED
exp/codec_train_soundstream4_raw_fs24000/images/discriminator_forward_time.png ADDED
exp/codec_train_soundstream4_raw_fs24000/images/discriminator_loss.png ADDED
exp/codec_train_soundstream4_raw_fs24000/images/discriminator_optim_step_time.png ADDED
exp/codec_train_soundstream4_raw_fs24000/images/discriminator_train_time.png ADDED
exp/codec_train_soundstream4_raw_fs24000/images/fake_loss.png ADDED
exp/codec_train_soundstream4_raw_fs24000/images/feat_match_loss.png ADDED
exp/codec_train_soundstream4_raw_fs24000/images/generator_backward_time.png ADDED
exp/codec_train_soundstream4_raw_fs24000/images/generator_forward_time.png ADDED
exp/codec_train_soundstream4_raw_fs24000/images/generator_optim_step_time.png ADDED
exp/codec_train_soundstream4_raw_fs24000/images/generator_train_time.png ADDED
exp/codec_train_soundstream4_raw_fs24000/images/gpu_max_cached_mem_GB.png ADDED
exp/codec_train_soundstream4_raw_fs24000/images/iter_time.png ADDED
exp/codec_train_soundstream4_raw_fs24000/images/loss.png ADDED
exp/codec_train_soundstream4_raw_fs24000/images/mel_loss.png ADDED
exp/codec_train_soundstream4_raw_fs24000/images/mel_loss_real.png ADDED
exp/codec_train_soundstream4_raw_fs24000/images/optim0_lr0.png ADDED
exp/codec_train_soundstream4_raw_fs24000/images/optim1_lr0.png ADDED
exp/codec_train_soundstream4_raw_fs24000/images/real_loss.png ADDED
exp/codec_train_soundstream4_raw_fs24000/images/reconstruct_loss.png ADDED
exp/codec_train_soundstream4_raw_fs24000/images/train_time.png ADDED
meta.yaml ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ espnet: '202402'
2
+ files:
3
+ model_file: exp/codec_train_soundstream4_raw_fs24000/120epoch.pth
4
+ python: 3.10.14 | packaged by conda-forge | (main, Mar 20 2024, 12:45:18) [GCC 12.3.0]
5
+ timestamp: 1718868523.56438
6
+ torch: 2.2.2+cu118
7
+ yaml_files:
8
+ train_config: exp/codec_train_soundstream4_raw_fs24000/config.yaml