|
--- |
|
datasets: |
|
- ylacombe/cml-tts |
|
language: |
|
- it |
|
base_model: |
|
- SWivid/F5-TTS |
|
pipeline_tag: text-to-speech |
|
license: cc-by-4.0 |
|
library_name: f5-tts |
|
--- |
|
|
|
This is an Italian finetune for F5-TTS |
|
Italian only so can't speak english properly |
|
|
|
Trained over 73+ hours of "train" split of ylacombe/cml-tts dataset |
|
with 8xRTX4090, still in progress, using gradio finetuning app using following settings: |
|
``` |
|
exp_name"F5TTS_Base" |
|
learning_rate=0.00001 |
|
batch_size_per_gpu=10000 |
|
batch_size_type="frame" |
|
max_samples=64 |
|
grad_accumulation_steps=1 |
|
max_grad_norm=1 |
|
epochs=300 |
|
num_warmup_updates=2000 |
|
save_per_updates=600 |
|
last_per_steps=300 |
|
finetune=true |
|
file_checkpoint_train="" |
|
tokenizer_type="char" |
|
tokenizer_file="" |
|
mixed_precision="fp16" |
|
logger="wandb" |
|
bnb_optimizer=false |
|
``` |
|
|
|
# Pre processing |
|
Data extracted from the datasource has been preprocessed in its transcription. |
|
From my understanding, punctuation is important because it's used to teach to have pauses and proper intonation so it has been preserved. |
|
Original italian "text" field was even containing direct dialogue escapes (long hyphen) that has also be preserved but it contained also |
|
a hypen that was used to split a word in a new line (I don't know which process was used on original dataset to create the text transcription) |
|
and so I removed that hypens merging the two part of the word, otherwise the training was done on artifacts that didn't impacted the speech. |
|
I'm only talking about Italian data on cml-tts, I don't know if other languages are affected by this too. |
|
|
|
|
|
# Current most trained model |
|
model_159600.safetensors (~290 Epoch) |
|
|
|
## known problems |
|
- catastrophic failure (being Italian only, lost english skill). A proper multilanguage dataset should be used instead of single language. |
|
- not perfect pronunciation |
|
- numbers must be converter in letters to be pronunced in italian |
|
- a better dataset with more diverse voices would help improving zero-shot cloning |
|
|
|
|
|
### checkpoints folder |
|
Contains the weight of the checkpoints at specific steps, the higher the number, the further it went into training. |
|
Weights in this folder can be used as starting point to continue training. |
|
Ping me back if you can further finetune it to reach a better result |