Spaces:
Runtime error
Runtime error
# Real-Time Voice Cloning v2 | |
### What is this? | |
It is an improved version of [Real-Time-Voice-Cloning](https://github.com/CorentinJ/Real-Time-Voice-Cloning). Our emotion voice cloning implementation is [here](https://github.com/liuhaozhe6788/voice-cloning-collab/tree/add_emotion)! | |
## Installation | |
1. Install [ffmpeg](https://ffmpeg.org/download.html#get-packages). This is necessary for reading audio files. | |
2. Create a new conda environment with | |
``` | |
conda create -n rtvc python=3.7.13 | |
``` | |
3. Install [PyTorch](https://download.pytorch.org/whl/torch_stable.html). Pick the proposed CUDA version if you have a GPU, otherwise pick CPU. | |
My torch version: `torch=1.9.1+cu111` | |
`torchvision=0.10.1+cu111` | |
4. Install the remaining requirements with | |
``` | |
pip install -r requirements.txt | |
``` | |
5. Install spaCy model en_core_web_sm by | |
`python -m spacy download en_core_web_sm` | |
## Training | |
### Encoder | |
**Download dataset:** | |
1. [LibriSpeech](https://www.openslr.org/12): train-other-500 for training, dev-other for validation | |
(extract as <datasets_root>/LibriSpeech/<dataset_name>) | |
2. [VoxCeleb1](https://mm.kaist.ac.kr/datasets/voxceleb/): Dev A - D for training, Test for validation as well as the metadata file `vox1_meta.csv` (extract as <datasets_root>/VoxCeleb1/ and <datasets_root>/VoxCeleb1/vox1_meta.csv) | |
3. [VoxCeleb2](https://mm.kaist.ac.kr/datasets/voxceleb/): Dev A - H for training, Test for validation | |
(extract as <datasets_root>/VoxCeleb2/) | |
**Encoder preprocessing:** | |
``` | |
python encoder_preprocess.py <datasets_root> | |
``` | |
**Encoder training:** | |
it is recommended to start visdom server for monitor training with | |
``` | |
visdom | |
``` | |
then start training with | |
``` | |
python encoder_train.py <model_id> <datasets_root>/SV2TTS/encoder | |
``` | |
### Synthesizer | |
**Download dataset:** | |
1. [LibriSpeech](https://www.openslr.org/12): train-clean-100 and train-clean-360 for training, dev-clean for validation (extract as <datasets_root>/LibriSpeech/<dataset_name>) | |
2. [LibriSpeech alignments](https://drive.google.com/file/d/1WYfgr31T-PPwMcxuAq09XZfHQO5Mw8fE/view?usp=sharing): merge the directory structure with the LibriSpeech datasets you have downloaded (do not take the alignments from the datasets you haven't downloaded else the scripts will think you have them) | |
3. [VCTK](https://datashare.ed.ac.uk/handle/10283/3443): used for training and validation | |
**Synthesizer preprocessing:** | |
``` | |
python synthesizer_preprocess_audio.py <datasets_root> | |
python synthesizer_preprocess_embeds.py <datasets_root>/SV2TTS/synthesizer | |
``` | |
**Synthesizer training:** | |
``` | |
python synthesizer_train.py <model_id> <datasets_root>/SV2TTS/synthesizer --use_tb | |
``` | |
if you want to monitor the training progress, run | |
``` | |
tensorboard --logdir log/vc/synthesizer --host localhost --port 8088 | |
``` | |
### Vocoder | |
**Download dataset:** | |
The same as synthesizer. You can skip this if you already download synthesizer training dataset. | |
**Vocoder preprocessing:** | |
``` | |
python vocoder_preprocess.py <datasets_root> | |
``` | |
**Vocoder training:** | |
``` | |
python vocoder_train.py <model_id> <datasets_root> --use_tb | |
``` | |
if you want to monitor the training progress, run | |
``` | |
tensorboard --logdir log/vc/vocoder --host localhost --port 8080 | |
``` | |
**Note:** | |
Training breakpoints are saved periodically, so you can run the training command and resume training when the breakpoint exists. | |
## Inference | |
**Terminal:** | |
``` | |
python demo_cli.py | |
``` | |
First input the number of audios, then input the audio file paths, then input the text message. The attention alignments and mel spectrogram are stored in syn_results/. The generated audio is stored in out_audios/. | |
**GUI demo:** | |
``` | |
python demo_toolbox.py | |
``` | |
## Dimension reduction visualization | |
**Download dataset:** | |
[LibriSpeech](https://www.openslr.org/12): test-other | |
(extract as <datasets_root>/LibriSpeech/<dataset_name>) | |
**Preprocessing:** | |
``` | |
python encoder_test_preprocess.py <datasets_root> | |
``` | |
**Visualization:** | |
``` | |
python encoder_test_visualization.py <model_id> <datasets_root> | |
``` | |
The results are saved in dim_reduction_results/. | |
## Pretrained models | |
You can download the pretrained model from [this](https://drive.google.com/drive/folders/11DFU_JBGet_HEwUoPZGDfe-fDZ42eqiG) and extract as saved_models/default | |
## Demo results | |
The audio results are [here](https://liuhaozhe6788.github.io/voice-cloning-collab/index.html) |