A StyleTTS2 fine-tune, designed for expressiveness.
Vokan features:
- A diverse dataset for a more authentic zero-shot performance
- Training on 6+ days worth of audio, with 672 diverse and expressive speakers
- Training on 1x H100 for 300 hours and 1x 3090 for an additional 600 hours
### Audio Examples
### Demo Spaces
Coming soon...
## This model was made possible thanks to
- [DagsHub](https://dagshub.com) who sponsored us with their GPU compute (with special thanks to Dean!)
- And the assistance from [camenduru](https://github.com/camenduru) on cloud infrastructure and model training
## Citations
```citations
@misc{li2023styletts,
title={StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models},
author={Yinghao Aaron Li and Cong Han and Vinay S. Raghavan and Gavin Mischler and Nima Mesgarani},
year={2023},
eprint={2306.07691},
archivePrefix={arXiv},
primaryClass={eess.AS}
}
@misc{zen2019libritts,
title={LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech},
author={Heiga Zen and Viet Dang and Rob Clark and Yu Zhang and Ron J. Weiss and Ye Jia and Zhifeng Chen and Yonghui Wu},
year={2019},
eprint={1904.02882},
archivePrefix={arXiv},
primaryClass={cs.SD}
}
Christophe Veaux, Junichi Yamagishi, Kirsten MacDonald,
"CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit",
The Centre for Speech Technology Research (CSTR),
University of Edinburgh
```
## License
```
MIT
```
Stay tuned for Vokan V2!