LightSpeech MFA SW v4

LightSpeech MFA SW v4 is a text-to-mel-spectrogram model based on the LightSpeech architecture. This model was fine-tuned from LightSpeech MFA SW v1 and trained on real and synthetic audio datasets. The list of speakers include:

sw-TZ-Victoria
sw-TZ-Victoria-syllables-word
sw-TZ-Victoria-v2
sw-TZ-VictoriaNeural-upsampled-48kHz

We trained an acoustic Swahili model on our speech corpus using Montreal Forced Aligner v3.0.0 and used it as the duration extractor. That model, and consequently our model, uses the IPA phone set for Swahili. We used gruut for phonemization purposes. We followed these steps to perform duration extraction.

This model was trained using the TensorFlowTTS framework. All training was done on a RTX 4090 GPU. All necessary scripts used for training could be found in this Github Fork, as well as the Training metrics logged via Tensorboard.

Model

Model	Config	SR (Hz)	Mel range (Hz)	FFT / Hop / Win (pt)	#steps
`lightspeech-mfa-sw-v4`	Link	44.1K	20-11025	2048 / 512 / None	200K

Training Procedure

Feature Extraction Setting

hop_size: 512 # Hop size.
format: "npy"

Network Architecture Setting

model_type: lightspeech
lightspeech_params:
    dataset: "swahiliipa"
    n_speakers: 1
    encoder_hidden_size: 256
    encoder_num_hidden_layers: 3
    encoder_num_attention_heads: 2
    encoder_attention_head_size: 16
    encoder_intermediate_size: 1024
    encoder_intermediate_kernel_size:
        - 5
        - 25
        - 13
        - 9
    encoder_hidden_act: "mish"
    decoder_hidden_size: 256
    decoder_num_hidden_layers: 3
    decoder_num_attention_heads: 2
    decoder_attention_head_size: 16
    decoder_intermediate_size: 1024
    decoder_intermediate_kernel_size:
        - 17
        - 21
        - 9
        - 13
    decoder_hidden_act: "mish"
    variant_prediction_num_conv_layers: 2
    variant_predictor_filter: 256
    variant_predictor_kernel_size: 3
    variant_predictor_dropout_rate: 0.5
    num_mels: 80
    hidden_dropout_prob: 0.2
    attention_probs_dropout_prob: 0.1
    max_position_embeddings: 2048
    initializer_range: 0.02
    output_attentions: False
    output_hidden_states: False

Data Loader Setting

batch_size: 16 # Batch size for each GPU with assuming that gradient_accumulation_steps == 1.
eval_batch_size: 16
remove_short_samples: true # Whether to remove samples the length of which are less than batch_max_steps.
allow_cache: true # Whether to allow cache in dataset. If true, it requires cpu memory.
mel_length_threshold: 32 # remove all targets has mel_length <= 32
is_shuffle: true # shuffle dataset after each epoch.

Optimizer & Scheduler Setting

optimizer_params:
    initial_learning_rate: 0.0001
    end_learning_rate: 0.00005
    decay_steps: 150000 # < train_max_steps is recommend.
    warmup_proportion: 0.02
    weight_decay: 0.001

gradient_accumulation_steps: 1
var_train_expr:
    null # trainable variable expr (eg. 'embeddings|encoder|decoder' )
    # must separate by |. if var_train_expr is null then we
    # training all variable

Interval Setting

train_max_steps: 200000 # Number of training steps.
save_interval_steps: 5000 # Interval steps to save checkpoint.
eval_interval_steps: 5000 # Interval steps to evaluate the network.
log_interval_steps: 200 # Interval steps to record the training log.
delay_f0_energy_steps: 3 # 2 steps use LR outputs only then 1 steps LR + F0 + Energy.

Other Setting

num_save_intermediate_results: 1 # Number of batch to be saved as intermediate results.

How to Use

import tensorflow as tf
from tensorflow_tts.inference import TFAutoModel, AutoProcessor

lightspeech = TFAutoModel.from_pretrained("bookbot/lightspeech-mfa-sw-v4")
processor = AutoProcessor.from_pretrained("bookbot/lightspeech-mfa-sw-v4")

text, speaker_name = "Hello World", "sw-TZ-Victoria"
input_ids = processor.text_to_sequence(text)

mel, duration_outputs, _ = lightspeech.inference(
    input_ids=tf.expand_dims(tf.convert_to_tensor(input_ids, dtype=tf.int32), 0),
    speaker_ids=tf.convert_to_tensor(
        [processor.speakers_map[speaker_name]], dtype=tf.int32
    ),
    speed_ratios=tf.convert_to_tensor([1.0], dtype=tf.float32),
    f0_ratios=tf.convert_to_tensor([1.0], dtype=tf.float32),
    energy_ratios=tf.convert_to_tensor([1.0], dtype=tf.float32),
)

Disclaimer

Do consider the biases which came from pre-training datasets that may be carried over into the results of this model.

Authors

LightSpeech MFA SW v4 was trained and evaluated by David Samuel Setiawan, Wilson Wongso. All computation and development are done on local machines.

Framework versions

TensorFlowTTS 1.8
TensorFlow 2.12.0

bookbot
/

lightspeech-mfa-sw-v4