Sucial/Dereverb-Echo_Mel_Band_Roformer

Description

These models are used to separate reverb and delay effects in vocals. In addition, these models also have the ability to remove most of the harmonies. I added random high cut after the reverberation and delay effects in the dataset, so these model's handling of high frequencies is not particularly aggressive.
You can try listening to the performance of these models here!

How to use the model?

Try it with ZFTurbo's Music-Source-Separation-Training

Models

===Note: The following models are only effective for vocals!===

1. Fused Models(I personally recommend using this model)

I used a model fusion script to fuse three models with the same model structure. The three models and their corresponding fusion ratios are as follows:
0.5 * dereverb_echo_mbr_v2_sdr_dry_13.4843.ckpt + 0.25 * de_big_reverb_mbr_ep_362.ckpt + 0.25 * de_super_big_reverb_mbr_ep_346.ckpt
Therefore, the fused model has the ability to remove both small and large reverberations simultaneously. However, I did not carefully adjust the fusion ratio of each model. If any experts are willing to help me adjust it carefully, I would be very grateful!

config: the same as v2 models and big reverb models: config_dereverb_echo_mbr_v2.yaml
fused_model: dereverb_echo_mbr_fused_0.5_v2_0.25_big_0.25_super.ckpt

2. Big reverb Models

There are two models for removing large reverberation in total: de_big_reverb_mbr_ep_362.ckpt and de_super_big_reverb_mbr_ep_346.ckpt. In general, for large reverberations, using the de_big_reverb_mbr model is sufficient. The de_super_big_reverb_mbr model is trained for extremely large reverberations and is generally less commonly used. The configuration files of these two models and the v2 model share the same configuration file. And they are all finetuned from dereverb_echo_mbr_v2_sdr_dry_13.4843.ckpt.

config: config_dereverb_echo_mbr_v2.yaml
Model_de_big_reverb: de_big_reverb_mbr_ep_362.ckpt
Model_de_super_big_reverb: de_super_big_reverb_mbr_ep_346.ckpt

In order to better validate the model's performance, I have added two indicators, f0_fitness and uv_fitness, as follows:
Calculate the F0 and voiced/unvoiced (UV) fitness between a reference and an estimated audio signal. These two metrics are only of reference value for vocals.
The F0 fitness measures how similar the fundamental frequency (F0) of the reference and estimated signals are, while the UV fitness evaluates the accuracy of voiced/unvoiced detection between the two signals. Both are computed by extracting F0 and UV information using pitch analysis and then calculating the Pearson correlation between the corresponding F0 and UV sequences. The F0 fitness can also be used to compare the completeness of the extracted fundamental frequency (F0) for human voice signals. The values of these two metrics are both -1 to 1, and the closer the value is to 1, the better the fit.

For these two models, I used different validation sets for verification (so SDR has no practical reference significance), and the validation results are as follows:

de_big_reverb_mbr_ep_362.ckpt
Num overlap: 2
Instr dry sdr: 14.0030 (Std: 2.9492)
Instr dry bleedless: 43.6501 (Std: 10.1362)
Instr dry fullness: 21.7776 (Std: 5.9445)
Instr dry f0_fitness: 0.8405 (Std: 0.1520)
Instr dry uv_fitness: 0.9759 (Std: 0.0162)

de_super_big_reverb_mbr_ep_346.ckpt
Num overlap: 2
Instr dry sdr: 11.3164 (Std: 2.4877)
Instr dry bleedless: 43.3989 (Std: 10.7918)
Instr dry fullness: 17.5554 (Std: 4.0178)
Instr dry f0_fitness: 0.7845 (Std: 0.1864)
Instr dry uv_fitness: 0.9662 (Std: 0.0172)

3. V2 Models

Config: config_dereverb_echo_mbr_v2.yaml
Model: dereverb_echo_mbr_v2_sdr_dry_13.4843.ckpt
Instr dry sdr: 13.4843 (Std: 4.8675)

Finetuned from: dereverb-echo_mel_band_roformer_sdr_10.0169.ckpt
Used 1000+ songs to Finetune.

4. V1 Models

Configs: config_dereverb-echo_mel_band_roformer.yaml
Model: dereverb-echo_mel_band_roformer_sdr_10.0169.ckpt
Instr dry sdr: 13.1507, Instr other sdr: 6.8830, Metric avg sdr: 10.0169

Instruments: [dry, other]
Finetuned from: model_mel_band_roformer_ep_3005_sdr_11.4360.ckpt
Datasets:

Training datasets: 270 songs from opencpop and GTSinger
Validation datasets: 30 songs from my own collection
All random reverbs and delay effects are generated by this python script and sorted into the mustb18 dataset format.

Thanks

Mel-Band-Roformer [Paper, Repository]
ZFTurbo's training code [Music-Source-Separation-Training]
CN17161 provided GPUs.
Glucy-2 provided technical assistance.