LightSpeech MFA SW v4
LightSpeech MFA SW v4 is a text-to-mel-spectrogram model based on the LightSpeech architecture. This model was fine-tuned from LightSpeech MFA SW v1 and trained on real and synthetic audio datasets. The list of speakers include:
- sw-TZ-Victoria
- sw-TZ-Victoria-syllables-word
- sw-TZ-Victoria-v2
- sw-TZ-VictoriaNeural-upsampled-48kHz
We trained an acoustic Swahili model on our speech corpus using Montreal Forced Aligner v3.0.0 and used it as the duration extractor. That model, and consequently our model, uses the IPA phone set for Swahili. We used gruut for phonemization purposes. We followed these steps to perform duration extraction.
This model was trained using the TensorFlowTTS framework. All training was done on a RTX 4090 GPU. All necessary scripts used for training could be found in this Github Fork, as well as the Training metrics logged via Tensorboard.
Model
Model | Config | SR (Hz) | Mel range (Hz) | FFT / Hop / Win (pt) | #steps |
---|---|---|---|---|---|
lightspeech-mfa-sw-v4 |
Link | 44.1K | 20-11025 | 2048 / 512 / None | 200K |
Training Procedure
Feature Extraction Setting
hop_size: 512 # Hop size.
format: "npy"
Network Architecture Setting
model_type: lightspeech
lightspeech_params:
dataset: "swahiliipa"
n_speakers: 1
encoder_hidden_size: 256
encoder_num_hidden_layers: 3
encoder_num_attention_heads: 2
encoder_attention_head_size: 16
encoder_intermediate_size: 1024
encoder_intermediate_kernel_size:
- 5
- 25
- 13
- 9
encoder_hidden_act: "mish"
decoder_hidden_size: 256
decoder_num_hidden_layers: 3
decoder_num_attention_heads: 2
decoder_attention_head_size: 16
decoder_intermediate_size: 1024
decoder_intermediate_kernel_size:
- 17
- 21
- 9
- 13
decoder_hidden_act: "mish"
variant_prediction_num_conv_layers: 2
variant_predictor_filter: 256
variant_predictor_kernel_size: 3
variant_predictor_dropout_rate: 0.5
num_mels: 80
hidden_dropout_prob: 0.2
attention_probs_dropout_prob: 0.1
max_position_embeddings: 2048
initializer_range: 0.02
output_attentions: False
output_hidden_states: False
Data Loader Setting
batch_size: 16 # Batch size for each GPU with assuming that gradient_accumulation_steps == 1.
eval_batch_size: 16
remove_short_samples: true # Whether to remove samples the length of which are less than batch_max_steps.
allow_cache: true # Whether to allow cache in dataset. If true, it requires cpu memory.
mel_length_threshold: 32 # remove all targets has mel_length <= 32
is_shuffle: true # shuffle dataset after each epoch.
Optimizer & Scheduler Setting
optimizer_params:
initial_learning_rate: 0.0001
end_learning_rate: 0.00005
decay_steps: 150000 # < train_max_steps is recommend.
warmup_proportion: 0.02
weight_decay: 0.001
gradient_accumulation_steps: 1
var_train_expr:
null # trainable variable expr (eg. 'embeddings|encoder|decoder' )
# must separate by |. if var_train_expr is null then we
# training all variable
Interval Setting
train_max_steps: 200000 # Number of training steps.
save_interval_steps: 5000 # Interval steps to save checkpoint.
eval_interval_steps: 5000 # Interval steps to evaluate the network.
log_interval_steps: 200 # Interval steps to record the training log.
delay_f0_energy_steps: 3 # 2 steps use LR outputs only then 1 steps LR + F0 + Energy.
Other Setting
num_save_intermediate_results: 1 # Number of batch to be saved as intermediate results.
How to Use
import tensorflow as tf
from tensorflow_tts.inference import TFAutoModel, AutoProcessor
lightspeech = TFAutoModel.from_pretrained("bookbot/lightspeech-mfa-sw-v4")
processor = AutoProcessor.from_pretrained("bookbot/lightspeech-mfa-sw-v4")
text, speaker_name = "Hello World", "sw-TZ-Victoria"
input_ids = processor.text_to_sequence(text)
mel, duration_outputs, _ = lightspeech.inference(
input_ids=tf.expand_dims(tf.convert_to_tensor(input_ids, dtype=tf.int32), 0),
speaker_ids=tf.convert_to_tensor(
[processor.speakers_map[speaker_name]], dtype=tf.int32
),
speed_ratios=tf.convert_to_tensor([1.0], dtype=tf.float32),
f0_ratios=tf.convert_to_tensor([1.0], dtype=tf.float32),
energy_ratios=tf.convert_to_tensor([1.0], dtype=tf.float32),
)
Disclaimer
Do consider the biases which came from pre-training datasets that may be carried over into the results of this model.
Authors
LightSpeech MFA SW v4 was trained and evaluated by David Samuel Setiawan, Wilson Wongso. All computation and development are done on local machines.
Framework versions
- TensorFlowTTS 1.8
- TensorFlow 2.12.0