Overview
The F5-TTS model is fine-tuned on the LJSpeech dataset with an emphasis on stability, ensuring it avoids choppiness, mispronunciations, repetitions, and skipping words.
Differences from the original model: The text input is converted to phonenes, we don't use the raw text. The phoneme alignment is used during training, whereas a duration predictor is used during inference.
Source code for phoneme alignment: https://github.com/sinhprous/F5-TTS/blob/main/src/f5_tts/train/datasets/utils_alignment.py
Source code for duration predictor: https://github.com/sinhprous/F5-TTS/blob/main/src/f5_tts/model/duration_predictor.py
Colab demo: colab
Audio samples
Outputs from original model was generated using https://huggingface.co/spaces/mrfakename/E2-F5-TTS The original model usually skips words in these hard texts..
Data - driven AI systems said, "Key data is the key, data is key, data is key, data is the key, and the key to the data is key, the data key is the key to the data that is key to the key". Can you keep up?
Original model:
Finetuned model:
Buffalo buffalo Buffalo buffalo buffalo buffalo Buffalo buffalo.
Original model:
Finetuned model:
Call one two three - one two three - one two three four who call one two three - one two three - one two three four who call one two three - one two three - one two three four who call one two three - one two three - one two three four.
Original model:
Finetuned model:
License
This model is released under the Creative Commons Attribution Non Commercial Share Alike 4.0 license, which allows for free usage, modification, and distribution
Model Information
Base Model: SWivid/F5-TTS
Total Training Duration: 130.000 steps
Training Configuration:
"exp_name": "F5TTS_Base",
"learning_rate": 1e-05,
"batch_size_per_gpu": 2000,
"batch_size_type": "frame",
"max_samples": 64,
"grad_accumulation_steps": 1,
"max_grad_norm": 1,
"epochs": 144,
"num_warmup_updates": 5838,
"save_per_updates": 11676,
"last_per_steps": 2918,
"finetune": true,
"file_checkpoint_train": "",
"tokenizer_type": "char",
"tokenizer_file": "",
"mixed_precision": "fp16",
"logger": "wandb",
"bnb_optimizer": true
Usage Instructions
Go to base repo
To do
- Multi-speaker model
Other links
Model tree for sinhprous/F5TTS-stabilized-LJSpeech
Base model
SWivid/F5-TTS