File size: 7,281 Bytes
8c92a11 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 |
# VALL-E
## Introduction
This is an unofficial PyTorch implementation of VALL-E, a zero-shot voice cloning model via neural codec language modeling ([paper link](https://arxiv.org/abs/2301.02111)).
If trained properly, this model could match the performance specified in the original paper.
## Change notes
This is a refined version compared to the first version of VALL-E in Amphion, we have changed the underlying implementation to Llama
to provide better model performance, faster training speed, and more readable codes.
This can be a great tool if you want to learn speech language models and its implementation.
## Installation requirement
Set up your environemnt as in Amphion README (you'll need a conda environment, and we recommend using Linux). A GPU is recommended if you want to train this model yourself.
For inferencing our pretrained models, you could generate samples even without a GPU.
To ensure your transformers library can run the code, we recommend additionally running:
```bash
pip install -U transformers==4.41.2
```
## Inferencing pretrained VALL-E models
### Download pretrained weights
You need to download our pretrained weights from huggingface.
Script to download AR and NAR model checkpoint:
```bash
huggingface-cli download amphion/valle valle_ar_mls_196000.bin valle_nar_mls_164000.bin --local-dir ckpts
```
Script to download codec model (SpeechTokenizer) checkpoint:
```bash
mkdir -p ckpts/speechtokenizer_hubert_avg && huggingface-cli download amphion/valle SpeechTokenizer.pt config.json --local-dir ckpts/speechtokenizer_hubert_avg
```
If you cannot access huggingface, consider using the huggingface mirror to download:
```bash
HF_ENDPOINT=https://hf-mirror.com huggingface-cli download amphion/valle valle_ar_mls_196000.bin valle_nar_mls_164000.bin --local-dir ckpts
```
```bash
mkdir -p ckpts/speechtokenizer_hubert_avg && HF_ENDPOINT=https://hf-mirror.com huggingface-cli download amphion/valle SpeechTokenizer.pt config.json --local-dir ckpts/speechtokenizer_hubert_avg
```
### Inference in IPython notebook
We provide our pretrained VALL-E model that is trained on 45k hours MLS dataset, which contains 10-20s English speech.
The "demo.ipynb" file provides a working example of inferencing our pretrained VALL-E model. Give it a try!
## Examining the model files
Examining the model files of VALL-E is a great way to learn how it works.
We provide examples that allows you to overfit a single batch (so no dataset downloading is required).
The AR model is essentially a causal language model that "continues" a speech. The NAR model is a modification from the AR model that allows for bidirectional attention.
File `valle_ar.py` and `valle_nar.py` in "models/tts/valle_v2" folder are models files, these files can be run directly via `python -m models.tts.valle_v2.valle_ar` (or `python -m models.tts.valle_v2.valle_nar`).
This will invoke a test which overfits it to a single example.
## Training VALL-E from scratch
### Preparing LibriTTS or LibriTTS-R dataset files
We have tested our training script on LibriTTS and LibriTTS-R.
You could download LibriTTS-R at [this link](https://www.openslr.org/141/) and LibriTTS at [this link](https://www.openslr.org/60).
The "train-clean-360" split is currently used by our configuration.
You can test dataset.py by run `python -m models.tts.valle_v2.libritts_dataset`.
For your reference, our unzipped dataset files has a file structure like this:
```
/path/to/LibriTTS_R
βββ BOOKS.txt
βββ CHAPTERS.txt
βββ dev-clean
β βββ 2412
β β βββ 153947
β β β βββ 2412_153947_000014_000000.normalized.txt
β β β βββ 2412_153947_000014_000000.original.txt
β β β βββ 2412_153947_000014_000000.wav
β β β βββ 2412_153947_000017_000001.normalized.txt
β β β βββ 2412_153947_000017_000001.original.txt
β β β βββ 2412_153947_000017_000001.wav
β β β βββ 2412_153947_000017_000005.normalized.txt
βββ train-clean-360
βββ 422
β β βββ 122949
β β βββ 422_122949_000009_000007.normalized.txt
β β βββ 422_122949_000009_000007.original.txt
β β βββ 422_122949_000009_000007.wav
β β βββ 422_122949_000013_000010.normalized.txt
β β βββ 422_122949_000013_000010.original.txt
β β βββ 422_122949_000013_000010.wav
β β βββ 422_122949.book.tsv
β β βββ 422_122949.trans.tsv
```
Alternativelly, you could write your own dataloader for your dataset.
You can reference the `__getitem__` method in `models/tts/VALLE_V2/mls_dataset.py`
It should return a dict of a 1-dimensional tensor 'speech', which is a 16kHz speech; and a 1-dimensional tensor of 'phone', which is the phoneme sequence of the speech.
As long as your dataset returns this in `__getitem__`, it should work.
### Changing batch size and dataset path in configuration file
Our configuration file for training VALL-E AR model is at "egs/tts/VALLE_V2/exp_ar_libritts.json", and NAR model at "egs/tts/VALLE_V2/exp_nar_libritts.json"
To train your model, you need to modify the `dataset` variable in the json configurations.
Currently it's at line 40, you should modify the "data_dir" to your dataset's root directory.
```
"dataset": {
"dataset_list":["train-clean-360"], // You can also change to other splits like "dev-clean"
"data_dir": "/path/to/your/LibriTTS_R",
},
```
You should also select a reasonable batch size at the "batch_size" entry (currently it's set at 5).
You can change other experiment settings in the `/egs/tts/VALLE_V2/exp_ar_libritts.json` such as the learning rate, optimizer and the dataset.
### Run the command to Train AR model
(Make sure your current directory is at the Amphion root directory).
Run:
```sh
sh egs/tts/VALLE_V2/train_ar_libritts.sh
```
Your initial model checkpoint could be found in places such as `ckpt/VALLE_V2/ar_libritts/checkpoint/epoch-0000_step-0000000_loss-7.397293/pytorch_model.bin`
### Resume from existing checkpoint
Our framework supports resuming from existing checkpoint.
Run:
```sh
sh egs/tts/VALLE_V2/train_ar_libritts.sh --resume
```
### Finetuning based on our AR model
We provide our AR model optimizer, and random_states checkpoints to support finetuning (No need to download these files if you're only inferencing from the pretrained model). First rename the models as "pytorch_model.bin", "optimizer.bin", and "random_states_0.pkl", then you could resume from these checkpoints. [Link to AR optimizer checkpoint](https://huggingface.co/amphion/valle/blob/main/optimizer_valle_ar_mls_196000.bin) and [Link to random_states.pkl](https://huggingface.co/amphion/valle/blob/main/random_states_0.pkl).
### Run the command to Train NAR model
(Make sure your current directory is at the Amphion root directory).
Run:
```sh
sh egs/tts/VALLE_V2/train_nar_libritts.sh
```
### Inference your models
Since our inference script is already given, you can change the paths
from our pretrained model to you newly trained models and do the inference.
## Future plans
- [ ] Support more languages
- [ ] More are coming...
|