|
--- |
|
datasets: |
|
- ylacombe/cml-tts |
|
language: |
|
- it |
|
base_model: |
|
- SWivid/F5-TTS |
|
pipeline_tag: text-to-speech |
|
license: cc-by-4.0 |
|
library_name: f5-tts |
|
--- |
|
|
|
This is an Italian finetune for F5-TTS |
|
Italian only so can't speak english properly |
|
|
|
Trained over 73+ hours of "train" split of ylacombe/cml-tts dataset |
|
with 8xRTX4090, still in progress, using gradio finetuning app using following settings: |
|
``` |
|
exp_name"F5TTS_Base" |
|
learning_rate0.00001 |
|
batch_size_per_gpu10000 |
|
batch_size_type"frame" |
|
max_samples64 |
|
grad_accumulation_steps1 |
|
max_grad_norm1 |
|
epochs100 |
|
num_warmup_updates2000 |
|
save_per_updates600 |
|
last_per_steps300 |
|
finetunetrue |
|
file_checkpoint_train"" |
|
tokenizer_type"char" |
|
tokenizer_file"" |
|
mixed_precision"fp16" |
|
logger"wandb" |
|
bnb_optimizerfalse |
|
``` |
|
|
|
# Pre processing |
|
Data extracted from the datasource has been preprocessed in its transcription. |
|
From my understanding, punctuation is important because it's used to teach to have pauses and proper intonation so it has been preserved. |
|
Original italian "text" field was even containing direct dialogue escapes (long hyphen) that has also be preserved but it contained also |
|
a hypen that was used to split a word in a new line (I don't know which process was used on original dataset to create the text transcription) |
|
and so I removed that hypens merging the two part of the word, otherwise the training was done on artifacts that didn't impacted the speech. |
|
I'm only talking about Italian data on cml-tts, I don't know if other languages are affected by this too. |
|
|
|
|
|
# Current most trained model |
|
model_25200.safetensors (45 Epoch) |
|
|
|
|
|
### checkpoints folder |
|
Contains the weight of the checkpoints at specific steps, the higher the number, the further it went into training. |
|
Weights in this folder can be used as starting point to continue training. |