Overview

The F5-TTS model is fine-tuned on the LJSpeech dataset with an emphasis on stability, ensuring it avoids choppiness, mispronunciations, repetitions, and skipping words.

Differences from the original model: The text input is converted to phonenes, we don't use the raw text. The phoneme alignment is used during training, whereas a duration predictor is used during inference.

Source code for phoneme alignment: https://github.com/sinhprous/F5-TTS/blob/main/src/f5_tts/train/datasets/utils_alignment.py

Source code for duration predictor: https://github.com/sinhprous/F5-TTS/blob/main/src/f5_tts/model/duration_predictor.py

Colab demo: colab

Audio samples

Outputs from original model was generated using https://huggingface.co/spaces/mrfakename/E2-F5-TTS The original model usually skips words in these hard texts..

Data - driven AI systems said, "Key data is the key, data is key, data is key, data is the key, and the key to the data is key, the data key is the key to the data that is key to the key". Can you keep up?

Original model:

Finetuned model:

Buffalo buffalo Buffalo buffalo buffalo buffalo Buffalo buffalo.

Original model:

Finetuned model:

Call one two three - one two three - one two three four who call one two three - one two three - one two three four who call one two three - one two three - one two three four who call one two three - one two three - one two three four.

Original model:

Finetuned model:

License

This model is released under the Creative Commons Attribution Non Commercial Share Alike 4.0 license, which allows for free usage, modification, and distribution

Model Information

Base Model: SWivid/F5-TTS
Total Training Duration: 130.000 steps

Training Configuration:

"exp_name": "F5TTS_Base",
"learning_rate": 1e-05,
"batch_size_per_gpu": 2000,
"batch_size_type": "frame",
"max_samples": 64,
"grad_accumulation_steps": 1,
"max_grad_norm": 1,
"epochs": 144,
"num_warmup_updates": 5838,
"save_per_updates": 11676,
"last_per_steps": 2918,
"finetune": true,
"file_checkpoint_train": "",
"tokenizer_type": "char",
"tokenizer_file": "",
"mixed_precision": "fp16",
"logger": "wandb",
"bnb_optimizer": true

Usage Instructions

Go to base repo

To do

Multi-speaker model

sinhprous
/

F5TTS-stabilized-LJSpeech

Overview

Audio samples

License

Model Information

Usage Instructions

To do

Other links

Model tree for sinhprous/F5TTS-stabilized-LJSpeech