Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,73 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
language:
|
3 |
+
- ja
|
4 |
+
---
|
5 |
+
|
6 |
+
# Japanese GSLM
|
7 |
+
|
8 |
+
This is an Japanese implementation of [Generative Spoken Language Model](https://github.com/facebookresearch/fairseq/tree/main/examples/textless_nlp/gslm) to support textless NLP in Japanese. </br> Submitted to Acoustical Society of Japan, 2023 Spring.
|
9 |
+
</br>
|
10 |
+
|
11 |
+
## How to use
|
12 |
+
- PyTorch version >= 1.10.0
|
13 |
+
- Python version >= 3.8
|
14 |
+
|
15 |
+
### Install requirements
|
16 |
+
It is pre-required to install the [fairseq](https://github.com/facebookresearch/fairseq/) library and all the requirements the library needs.
|
17 |
+
|
18 |
+
```
|
19 |
+
git clone https://github.com/pytorch/fairseq
|
20 |
+
cd fairseq
|
21 |
+
pip install --editable ./
|
22 |
+
|
23 |
+
pip install librosa, unidecode, inflect
|
24 |
+
```
|
25 |
+
|
26 |
+
## Re-synthesis of voice signal
|
27 |
+
### speech2unit
|
28 |
+
|
29 |
+
The procedure for speech2unit is the same as the gslm example in [fairseq](https://github.com/facebookresearch/fairseq/tree/main/examples/textless_nlp/gslm/speech2unit).
|
30 |
+
|
31 |
+
|
32 |
+
You can convert the Japanese voice signal to discrete unit through this [pre-trained quantization model](https://huggingface.co/nonmetal/gslm-japanese/resolve/main/hubert200_JPN.bin). Route the downloaded model to ```KM_MODEL_PATH```.
|
33 |
+
|
34 |
+
|
35 |
+
This file replaces the ```HuBERT Base + KM200``` model provided by fariseq, so it is required to download ```HuBERT-Base``` model as a pretrained acoustic model.
|
36 |
+
|
37 |
+
```
|
38 |
+
TYPE='hubert'
|
39 |
+
CKPT_PATH=<path_of_pretrained_acoustic_model>
|
40 |
+
LAYER=6
|
41 |
+
KM_MODEL_PATH=<output_path_of_the_kmeans_model>
|
42 |
+
MANIFEST=<tab_separated_manifest_of_audio_files_to_quantize>
|
43 |
+
OUT_QUANTIZED_FILE=<output_quantized_audio_file_path>
|
44 |
+
|
45 |
+
python examples/textless_nlp/gslm/speech2unit/clustering/quantize_with_kmeans.py \
|
46 |
+
--feature_type $TYPE \
|
47 |
+
--kmeans_model_path $KM_MODEL_PATH \
|
48 |
+
--acoustic_model_path $CKPT_PATH \
|
49 |
+
--layer $LAYER \
|
50 |
+
--manifest_path $MANIFEST \
|
51 |
+
--out_quantized_file_path $OUT_QUANTIZED_FILE \
|
52 |
+
--extension ".wav"
|
53 |
+
```
|
54 |
+
|
55 |
+
### unit2speech
|
56 |
+
|
57 |
+
unit2speech model is modified Tacotron2 model that learns to synthesize speech from discrete speech units.
|
58 |
+
You can convert the discrete unit to synthesized voice through this [model](https://huggingface.co/nonmetal/gslm-japanese/resolve/main/checkpoint_125k.pt). Also, it is required to download [Waveglow checkpoint](https://dl.fbaipublicfiles.com/textless_nlp/gslm/waveglow_256channels_new.pt) for Vocoder.
|
59 |
+
|
60 |
+
```
|
61 |
+
TTS_MODEL_PATH=<unit2speech_model_file_path>
|
62 |
+
OUT_DIR=<dir_to_dump_synthesized_audio_files>
|
63 |
+
WAVEGLOW_PATH=<path_where_you_have_downloaded_waveglow_checkpoint>
|
64 |
+
|
65 |
+
python unit2speech_ja.py \
|
66 |
+
--tts_model_path $TTS_MODEL_PATH \
|
67 |
+
--out_audio_dir $OUT_DIR \
|
68 |
+
--waveglow_path $WAVEGLOW_PATH \
|
69 |
+
```
|
70 |
+
|
71 |
+
## References
|
72 |
+
- Lakhotia, Kushal et al. On Generative Spoken Language Modeling from Raw Audio. Transactions of the Association for Computational Linguistics, 9:1336–1354, 2021.
|
73 |
+
- Ott, Myle et al. fairseq: A Fast, Extensible Toolkit for Sequence Modeling. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), pages 48–53, 2019.
|