Spaces:
Build error
SoftVC VITS Singing Voice Conversion
Updates
According to incomplete statistics, it seems that training with multiple speakers may lead to worsened leaking of voice timbre. It is not recommended to train models with more than 5 speakers. The current suggestion is to try to train models with only a single speaker if you want to achieve a voice timbre that is more similar to the target. Fixed the issue with unwanted staccato, improving audio quality by a decent amount.
The 2.0 version has been moved to the 2.0 branch.
Version 3.0 uses the code structure of FreeVC, which isn't compatible with older versions.
Compared to DiffSVC , diffsvc performs much better when the training data is of extremely high quality, but this repository may perform better on datasets with lower quality. Additionally, this repository is much faster in terms of inference speed compared to diffsvc.
Model Overview
A singing voice coversion (SVC) model, using the SoftVC encoder to extract features from the input audio, sent into VITS along with the F0 to replace the original input to acheive a voice conversion effect. Additionally, changing the vocoder to NSF HiFiGAN to fix the issue with unwanted staccato.
Notice
- The current branch is the 32kHz version, which requires less vram during inferencing, as well as faster inferencing speeds, and datasets for said branch take up less disk space. Thus the 32 kHz branch is recommended for use.
- If you want to train 48 kHz variant models, switch to the main branch.
Colab notebook script for dataset creation and training.
Required models
- soft vc hubert:hubert-soft-0d54a1f4.pt
- Place under
hubert
.
- Place under
- Pretrained models G_0.pth and D_0.pth
- Place under
logs/32k
. - Pretrained models are required, because from experiments, training from scratch can be rather unpredictable to say the least, and training with a pretrained model can greatly improve training speeds.
- The pretrained model includes云灏, 即霜, 辉宇·星AI, 派蒙, and 绫地宁宁, covering the common ranges of both male and female voices, and so it can be seen as a rather universal pretrained model.
- The pretrained model exludes the
optimizer speaker_embedding
section, rendering it only usable for pretraining and incapable of inferencing with.
- Place under
# For simple downloading.
# hubert
wget -P hubert/ https://github.com/bshall/hubert/releases/download/v0.1/hubert-soft-0d54a1f4.pt
# G&D pretrained models
wget -P logs/32k/ https://huggingface.co/innnky/sovits_pretrained/resolve/main/G_0.pth
wget -P logs/32k/ https://huggingface.co/innnky/sovits_pretrained/resolve/main/D_0.pth
Dataset preparation
All that is required is that the data be put under the dataset_raw
folder in the structure format provided below.
dataset_raw
├───speaker0
│ ├───xxx1-xxx1.wav
│ ├───...
│ └───Lxx-0xx8.wav
└───speaker1
├───xx2-0xxx2.wav
├───...
└───xxx7-xxx007.wav
Data pre-processing.
- Resample to 32khz
python resample.py
- Automatically sort out training set, validation set, test set, and automatically generate configuration files.
python preprocess_flist_config.py
# Notice.
# The n_speakers value in the config will be set automatically according to the amount of speakers in the dataset.
# To reserve space for additionally added speakers in the dataset, the n_speakers value will be be set to twice the actual amount.
# If you want even more space for adding more data, you can edit the n_speakers value in the config after runing this step.
# This can not be changed after training starts.
- Generate hubert and F0 features/
python preprocess_hubert_f0.py
After running the step above, the dataset
folder will contain all the pre-processed data, you can delete the dataset_raw
folder after that.
Training.
python train.py -c configs/config.json -m 32k
Inferencing.
- Edit
model_path
to your newest checkpoint. - Place the input audio under the
raw
folder. - Change
clean_names
to the output file name. - Use
trans
to edit the pitch shifting amount (semitones). - Change
spk_list
to the speaker name.