LearnItAnyway
commited on
Commit
·
6856c3b
1
Parent(s):
9c617e3
Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,33 @@
|
|
1 |
---
|
2 |
license: other
|
3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
license: other
|
3 |
---
|
4 |
+
# VALL-E Korean Model
|
5 |
+
|
6 |
+
## Introduction
|
7 |
+
|
8 |
+
The VALL-E Korean model is an implementation of the VALL-E architecture designed for the Korean language. This model serves as a zero-shot text-to-speech synthesizer, allowing users to generate natural-sounding speech from text input in Korean. The model utilizes various components, including the espeak text phonemizer with language='ko' option and the EnCodec audio tokenizer from [Facebook Research's EnCodec repository](https://github.com/facebookresearch/encodec).
|
9 |
+
|
10 |
+
## Model Details
|
11 |
+
|
12 |
+
- **Architecture**: The VALL-E Korean model consists of both ar (autoregressive) and nar (non-autoregressive) models.
|
13 |
+
- **Hidden Dimensions**: The model has a hidden dimension of 1024.
|
14 |
+
- **Transformer Layers**: It comprises 12 transformer layers.
|
15 |
+
- **Attention Heads**: Each layer has 16 attention heads.
|
16 |
+
|
17 |
+
## Training Data
|
18 |
+
|
19 |
+
The training data for the VALL-E Korean model consists of approximately 2000 hours of Korean audio-text pairs. This dataset was sourced from [AI-Hub 한국인 대화음성](https://www.aihub.or.kr/aihubdata/data/view.do?currMenu=115&topMenu=100&aihubDataSe=realm&dataSetSn=130).
|
20 |
+
|
21 |
+
## Example Usage
|
22 |
+
|
23 |
+
For an example of how to use the VALL-E Korean model, you can refer to the provided Google Colab notebook: [tester_colab.ipynb](https://huggingface.co/LearnItAnyway/vall-e_korean/blob/main/tester_colab.ipynb). This notebook demonstrates how to perform text-to-speech synthesis using the model. Additionally, the example incorporates the vocos decoder from [Plachtaa's VALL-E repository](https://github.com/Plachtaa/VALL-E-X).
|
24 |
+
|
25 |
+
## References
|
26 |
+
|
27 |
+
- [Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers](https://arxiv.org/abs/2301.02111)
|
28 |
+
- [VALL-E Repository by lifeiteng](https://github.com/lifeiteng/vall-e)
|
29 |
+
- [Enhuiz's VALL-E Repository](https://github.com/enhuiz/vall-e)
|
30 |
+
- [VALL-E-X Repository by Plachtaa](https://github.com/Plachtaa/VALL-E-X)
|
31 |
+
- [Vocos](https://github.com/charactr-platform/vocos)
|
32 |
+
|
33 |
+
For more information and details on using the model, please refer to the provided references and resources.
|