nonmetal
/

gslm-japanese

Model card Files Files and versions Community

nonmetal commited on Jan 5, 2023

Commit

5450807

•

1 Parent(s): 98bf4d0

Create README.md

Files changed (1) hide show

README.md +73 -0

README.md ADDED Viewed

	@@ -0,0 +1,73 @@

+---
+language:
+  - ja
+---
+# Japanese GSLM
+This is an Japanese implementation of [Generative Spoken Language Model](https://github.com/facebookresearch/fairseq/tree/main/examples/textless_nlp/gslm) to support textless NLP in Japanese. </br> Submitted to Acoustical Society of Japan, 2023 Spring.
+</br>
+## How to use
+- PyTorch version >= 1.10.0
+- Python version >= 3.8
+### Install requirements
+It is pre-required to install the [fairseq](https://github.com/facebookresearch/fairseq/) library and all the requirements the library needs.
+```
+git clone https://github.com/pytorch/fairseq
+cd fairseq
+pip install --editable ./
+pip install librosa, unidecode, inflect
+```
+## Re-synthesis of voice signal
+### speech2unit
+The procedure for speech2unit is the same as the gslm example in [fairseq](https://github.com/facebookresearch/fairseq/tree/main/examples/textless_nlp/gslm/speech2unit).
+You can convert the Japanese voice signal to discrete unit through this [pre-trained quantization model](https://huggingface.co/nonmetal/gslm-japanese/resolve/main/hubert200_JPN.bin). Route the downloaded model to ```KM_MODEL_PATH```.
+This file replaces the ```HuBERT Base + KM200``` model provided by fariseq, so it is required to download ```HuBERT-Base``` model as a pretrained acoustic model.
+```
+TYPE='hubert'
+CKPT_PATH=<path_of_pretrained_acoustic_model>
+LAYER=6
+KM_MODEL_PATH=<output_path_of_the_kmeans_model>
+MANIFEST=<tab_separated_manifest_of_audio_files_to_quantize>
+OUT_QUANTIZED_FILE=<output_quantized_audio_file_path>
+python examples/textless_nlp/gslm/speech2unit/clustering/quantize_with_kmeans.py \
+    --feature_type $TYPE \
+    --kmeans_model_path $KM_MODEL_PATH \
+    --acoustic_model_path $CKPT_PATH \
+    --layer $LAYER \
+    --manifest_path $MANIFEST \
+    --out_quantized_file_path $OUT_QUANTIZED_FILE \
+    --extension ".wav"
+```
+### unit2speech
+unit2speech model is modified Tacotron2 model that learns to synthesize speech from discrete speech units.
+You can convert the discrete unit to synthesized voice through this [model](https://huggingface.co/nonmetal/gslm-japanese/resolve/main/checkpoint_125k.pt). Also, it is required to download [Waveglow checkpoint](https://dl.fbaipublicfiles.com/textless_nlp/gslm/waveglow_256channels_new.pt) for Vocoder.
+```
+TTS_MODEL_PATH=<unit2speech_model_file_path>
+OUT_DIR=<dir_to_dump_synthesized_audio_files>
+WAVEGLOW_PATH=<path_where_you_have_downloaded_waveglow_checkpoint>
+python unit2speech_ja.py \
+    --tts_model_path $TTS_MODEL_PATH \
+    --out_audio_dir $OUT_DIR \
+    --waveglow_path  $WAVEGLOW_PATH \
+```
+## References
+- Lakhotia, Kushal et al. On Generative Spoken Language Modeling from Raw Audio. Transactions of the Association for Computational Linguistics, 9:1336–1354, 2021.
+- Ott, Myle et al. fairseq: A Fast, Extensible Toolkit for Sequence Modeling. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), pages 48–53, 2019.