license: other
VALL-E Korean Model
Introduction
The VALL-E Korean model is an implementation of the VALL-E architecture designed for the Korean language. This model serves as a zero-shot text-to-speech synthesizer, allowing users to generate natural-sounding speech from text input in Korean. The model utilizes various components, including the espeak text phonemizer with language='ko' option and the EnCodec audio tokenizer from Facebook Research's EnCodec repository.
Model Details
- Architecture: The VALL-E Korean model consists of both ar (autoregressive) and nar (non-autoregressive) models.
- Hidden Dimensions: The model has a hidden dimension of 1024.
- Transformer Layers: It comprises 12 transformer layers.
- Attention Heads: Each layer has 16 attention heads.
Training Data
The training data for the VALL-E Korean model consists of approximately 2000 hours of Korean audio-text pairs. This dataset was sourced from AI-Hub ํ๊ตญ์ธ ๋ํ์์ฑ.
Example Usage
For an example of how to use the VALL-E Korean model, you can refer to the provided Google Colab notebook: tester_colab.ipynb. This notebook demonstrates how to perform text-to-speech synthesis using the model. Additionally, the example incorporates the vocos decoder from Plachtaa's VALL-E repository.
References
- Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers
- VALL-E Repository by lifeiteng
- Enhuiz's VALL-E Repository
- VALL-E-X Repository by Plachtaa
- Vocos
For more information and details on using the model, please refer to the provided references and resources.