Text-to-Speech
Greek
English
File size: 2,670 Bytes
320326a
 
 
 
 
 
 
 
 
 
 
d112198
 
b8808ae
 
 
d112198
 
 
 
 
b8808ae
 
3ba6a3d
 
11bb28d
 
b8808ae
d112198
2cf0fcd
b8808ae
d112198
 
4cf3f6c
 
 
 
073df27
a138dff
 
4cf3f6c
02d453a
 
 
 
 
 
 
41da1e5
02d453a
 
4cf3f6c
b8808ae
499793c
 
 
 
 
 
 
 
 
 
b8808ae
 
 
 
499793c
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
---
license: cc-by-nc-4.0
datasets:
- amphion/Emilia-Dataset
- mozilla-foundation/common_voice_12_0
language:
- el
- en
base_model:
- SWivid/F5-TTS
pipeline_tag: text-to-speech
---

# F5-TTS-Greek

## F5-TTS model finetuned to speak Greek

(This work is under development and is in beta version.)

Finetuned on Greek speech datasets and a small part of Emilia-EN dataset to prevent catastrophic forgetting of English.

Model can generate Greek text with Greek reference speech, English text with English reference speech, and mix of Greek and English (quality here needs improvement, and many runs might be needed to get good results).

#### NOTE: For Greek text, there is an issue with uppercase characters and it will skip them, so only use lowercase characters!

#### NOTE 2: Because the training data contained short reference audios, the best length should be around 6-9 seconds instead of the 15 in the original model.

## Datasets used:

- Common Voice 12.0 (All Greek Splits) (https://huggingface.co/datasets/mozilla-foundation/common_voice_12_0)
- Greek Single Speaker Speech Dataset (https://www.kaggle.com/datasets/bryanpark/greek-single-speaker-speech-dataset)
- Small part of Emilia Dataset (https://huggingface.co/datasets/amphion/Emilia-Dataset) (EN-B000049.tar)

## Training

Training was done in a single RTX 3090.

After some manual evaluation, these two checkpoints produced the best results:
- 225K steps ([model_225000.safetensors](https://huggingface.co/PetrosStav/F5-TTS-Greek/resolve/main/model_225000.safetensors?download=true))
- 325K steps ([model_325000.safetensors](https://huggingface.co/PetrosStav/F5-TTS-Greek/resolve/main/model_325000.safetensors?download=true))

## How to use

With the [dcd9a19 commit](https://github.com/SWivid/F5-TTS/commit/dcd9a19889147481d0a6f4b34505cdf75a1f3b90) of the main github project page, now you can directly use custom models in the `infer_gradio` page:

![image/png](https://cdn-uploads.huggingface.co/production/uploads/62d01102fb896639b296b9d3/wd5GEqB86Ny7ZM_L830Af.png)

You can either download the models and use the local paths or use the hf paths of this repo directly:
- hf://PetrosStav/F5-TTS-Greek/model_325000.safetensors
- hf://PetrosStav/F5-TTS-Greek/vocab.txt

### Arguments

- Learning Rate: 0.00001
- Batch Size per GPU: 3200
- Max Samples: 64
- Gradient Accumulation Steps: 1
- Max Gradient Norm: 1
- Epochs: 277
- Warmup Updates: 1274
- Save per Updates: 25000
- Last per Steps: 1000
- mixed_precision: fp16


## Links:

- Github: https://github.com/SWivid/F5-TTS
- Paper: F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching (https://arxiv.org/abs/2410.06885)