|
--- |
|
license: mit |
|
pipeline_tag: audio-to-audio |
|
tags: |
|
- vocos |
|
- hifigan |
|
- tts |
|
- melspectrogram |
|
- vocoder |
|
- mel |
|
--- |
|
|
|
### Model Description |
|
|
|
<!-- Provide a longer summary of what this model is. --> |
|
|
|
**Vocos** is a fast neural vocoder designed to synthesize audio waveforms from acoustic features. |
|
Unlike other typical GAN-based vocoders, Vocos does not model audio samples in the time domain. |
|
Instead, it generates spectral coefficients, facilitating rapid audio reconstruction through |
|
inverse Fourier transform. |
|
|
|
This version of vocos uses 80-bin mel spectrograms as acoustic features which are widespread |
|
in the TTS domain since the introduction of [hifi-gan](https://github.com/jik876/hifi-gan/blob/master/meldataset.py) |
|
The goal of this model is to provide an alternative to hifi-gan that is faster and compatible with the |
|
acoustic output of several TTS models. |
|
|
|
## Intended Uses and limitations |
|
|
|
The model is aimed to serve as a vocoder to synthesize audio waveforms from mel spectrograms. Is trained to generate speech and if is used in other audio |
|
domain is possible that the model won't produce high quality samples. |
|
|
|
### Installation |
|
|
|
To use Vocos only in inference mode, install it using: |
|
|
|
```bash |
|
pip install git+https://github.com/langtech-bsc/vocos.git@matcha |
|
``` |
|
|
|
### Reconstruct audio from mel-spectrogram |
|
|
|
```python |
|
import torch |
|
|
|
from vocos import Vocos |
|
|
|
vocos = Vocos.from_pretrained("patriotyk/vocos-mel-hifigan-compat-44100khz") |
|
|
|
mel = torch.randn(1, 80, 256) # B, C, T |
|
audio = vocos.decode(mel) |
|
``` |
|
|
|
### Training Data |
|
|
|
The model was trained on private 800+ hours dataset, made from Ukrainian audio books, using [narizaka](https://github.com/patriotyk/narizaka) tool. |
|
|
|
### Training Procedure |
|
|
|
The model was trained for 2.0M steps and 210 epochs with a batch size of 20. We used a Cosine scheduler with a initial learning rate of 3e-4. |
|
We where using two RTX-3090 video cards for training, and it took about one month of continuous training. |
|
|
|
#### Training Hyperparameters |
|
|
|
* initial_learning_rate: 3e-4 |
|
* scheduler: cosine without warmup or restarts |
|
* mel_loss_coeff: 45 |
|
* mrd_loss_coeff: 1.0 |
|
* batch_size: 20 |
|
* num_samples: 32768 |
|
|
|
## Evaluation |
|
|
|
|
|
Evaluation was done using the metrics on the original repo, after 210 epochs we achieve: |
|
|
|
* val_loss: 3.703 |
|
* f1_score: 0.950 |
|
* mel_loss: 0.248 |
|
* periodicity_loss:0.127 |
|
* pesq_score: 3.399 |
|
* pitch_loss: 38.26 |
|
* utmos_score: 3.146 |
|
|
|
|
|
## Citation |
|
|
|
|
|
If this code contributes to your research, please cite the work: |
|
|
|
``` |
|
@article{siuzdak2023vocos, |
|
title={Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis}, |
|
author={Siuzdak, Hubert}, |
|
journal={arXiv preprint arXiv:2306.00814}, |
|
year={2023} |
|
} |
|
``` |