A Vietnamese TTS ================ 🔔 **Notice**: This project is no longer being updated. Please refer to the new project, [LightSpeed](https://github.com/NTT123/light-speed), which includes [a new male voice](https://huggingface.co/spaces/ntt123/Vietnam-male-voice-TTS). Duration model + Acoustic model + HiFiGAN vocoder for vietnamese text-to-speech application. Online demo at https://huggingface.co/spaces/ntt123/vietTTS. A synthesized audio clip: [clip.wav](assets/infore/clip.wav). A colab notebook: [notebook](https://colab.research.google.com/drive/1oczrWOQOr1Y_qLdgis1twSlNZlfPVXoY?usp=sharing). Checkout the experimental `multi-speaker` branch (`git checkout multi-speaker`) for multi-speaker support. Install ------- ```sh git clone https://github.com/NTT123/vietTTS.git cd vietTTS pip3 install -e . ``` Quick start using pretrained models ---------------------------------- ```sh bash ./scripts/quick_start.sh ``` Download InfoRe dataset ----------------------- ```sh python ./scripts/download_aligned_infore_dataset.py ``` **Note**: this is a denoised and aligned version of the original dataset which is donated by the InfoRe Technology company (see [here](https://www.facebook.com/groups/j2team.community/permalink/1010834009248719/)). You can download the original dataset (**InfoRe Technology 1**) at [here](https://github.com/TensorSpeech/TensorFlowASR/blob/main/README.md#vietnamese). See `notebooks/denoise_infore_dataset.ipynb` for instructions on how to denoise the dataset. We use the Montreal Forced Aligner (MFA) to align transcript and speech (textgrid files). See `notebooks/align_text_audio_infore_mfa.ipynb` for instructions on how to create textgrid files. Train duration model -------------------- ```sh python -m vietTTS.nat.duration_trainer ``` Train acoustic model -------------------- ```sh python -m vietTTS.nat.acoustic_trainer ``` Train HiFiGAN vocoder ------------- We use the original implementation from HiFiGAN authors at https://github.com/jik876/hifi-gan. Use the config file at `assets/hifigan/config.json` to train your model. ```sh git clone https://github.com/jik876/hifi-gan.git # create dataset in hifi-gan format ln -sf `pwd`/train_data hifi-gan/data cd hifi-gan/data ls -1 *.TextGrid | sed -e 's/\.TextGrid$//' > files.txt cd .. head -n 100 data/files.txt > val_files.txt tail -n +101 data/files.txt > train_files.txt rm data/files.txt # training python train.py \ --config ../assets/hifigan/config.json \ --input_wavs_dir=data \ --input_training_file=train_files.txt \ --input_validation_file=val_files.txt ``` Finetune on Ground-Truth Aligned melspectrograms: ```sh cd /path/to/vietTTS # go to vietTTS directory python -m vietTTS.nat.zero_silence_segments -o train_data # zero all [sil, sp, spn] segments python -m vietTTS.nat.gta -o /path/to/hifi-gan/ft_dataset # create gta melspectrograms at hifi-gan/ft_dataset directory # turn on finetune cd /path/to/hifi-gan python train.py \ --fine_tuning True \ --config ../assets/hifigan/config.json \ --input_wavs_dir=data \ --input_training_file=train_files.txt \ --input_validation_file=val_files.txt ``` Then, use the following command to convert pytorch model to haiku format: ```sh cd .. python -m vietTTS.hifigan.convert_torch_model_to_haiku \ --config-file=assets/hifigan/config.json \ --checkpoint-file=hifi-gan/cp_hifigan/g_[latest_checkpoint] ``` Synthesize speech ----------------- ```sh python -m vietTTS.synthesizer \ --lexicon-file=train_data/lexicon.txt \ --text="hôm qua em tới trường" \ --output=clip.wav ```