metadata

license: other

VALL-E Korean Model

Introduction

The VALL-E Korean model is an implementation of the VALL-E architecture designed for the Korean language. This model serves as a zero-shot text-to-speech synthesizer, allowing users to generate natural-sounding speech from text input in Korean. The model utilizes various components, including the espeak text phonemizer with language='ko' option and the EnCodec audio tokenizer from Facebook Research's EnCodec repository.

Model Details

Architecture: The VALL-E Korean model consists of both ar (autoregressive) and nar (non-autoregressive) models.
Hidden Dimensions: The model has a hidden dimension of 1024.
Transformer Layers: It comprises 12 transformer layers.
Attention Heads: Each layer has 16 attention heads.

Training Data

The training data for the VALL-E Korean model consists of approximately 2000 hours of Korean audio-text pairs. This dataset was sourced from AI-Hub 한국인 대화음성.

Example Usage

For an example of how to use the VALL-E Korean model, you can refer to the provided Google Colab notebook: tester_colab.ipynb. This notebook demonstrates how to perform text-to-speech synthesis using the model. Additionally, the example incorporates the vocos decoder from Plachtaa's VALL-E repository.

References

For more information and details on using the model, please refer to the provided references and resources.