File size: 3,528 Bytes
723436f
d8cd0dc
 
723436f
 
 
9b7bbc7
723436f
d8cd0dc
723436f
 
 
 
 
 
 
 
 
 
d8cd0dc
723436f
 
 
d699ebc
 
723436f
d699ebc
 
723436f
 
d699ebc
 
 
 
 
 
 
 
 
 
723436f
d699ebc
 
 
 
 
723436f
 
d699ebc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
723436f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
---
language:
- id
license: mit
base_model: microsoft/speecht5_tts
tags:
- text-to-speech
datasets:
- mozilla-foundation/common_voice_16_1
model-index:
- name: speecht5_finetuned_commonvoice_id
  results: []
---

<!-- This model card has been generated automatically according to the information the Trainer had access to. You
should probably proofread and complete it, then remove this comment. -->

# speecht5_finetuned_commonvoice_id

This model is a fine-tuned version of [microsoft/speecht5_tts](https://huggingface.co/microsoft/speecht5_tts) on the mozilla-foundation/common_voice_16_1 dataset.
It achieves the following results on the evaluation set:
- Loss: 0.4675

## How to use/inference
Follow the example below and adapt with your own need.

```
# ft_t5_id_inference.py


import sounddevice as sd
import torch
import torchaudio
from datasets import Audio, load_dataset
from transformers import (
    SpeechT5ForTextToSpeech,
    SpeechT5HifiGan,
    SpeechT5Processor,
)
from utils import create_speaker_embedding

# load dataset and pre-trained model
dataset = load_dataset(
    "mozilla-foundation/common_voice_16_1", "id", split="test")
model = SpeechT5ForTextToSpeech.from_pretrained(
    "Bagus/speecht5_finetuned_commonvoice_id")


# process the text using checkpoint

checkpoint = "microsoft/speecht5_tts"
processor = SpeechT5Processor.from_pretrained(checkpoint)

sampling_rate = processor.feature_extractor.sampling_rate
dataset = dataset.cast_column("audio", Audio(sampling_rate=sampling_rate))


def prepare_dataset(example):
    audio = example["audio"]

    example = processor(
        text=example["sentence"],
        audio_target=audio["array"],
        sampling_rate=audio["sampling_rate"],
        return_attention_mask=False,
    )

    # strip off the batch dimension
    example["labels"] = example["labels"][0]

    # use SpeechBrain to obtain x-vector
    example["speaker_embeddings"] = create_speaker_embedding(audio["array"])

    return example


# prepare the speaker embeddings from the dataset and text
example = prepare_dataset(dataset[30])
speaker_embeddings = torch.tensor(example["speaker_embeddings"]).unsqueeze(0)

# prepare text to be converted to speech
text = "Saya suka baju yang berwarna merah tua."

inputs = processor(text=text, return_tensors="pt")


vocoder = SpeechT5HifiGan.from_pretrained("microsoft/speecht5_hifigan")
speech = model.generate_speech(
    inputs["input_ids"], speaker_embeddings, vocoder=vocoder)

sampling_rate = 16000
sd.play(speech, samplerate=sampling_rate, blocking=True)

# save the audio, signal needs to be in 2D tensor
torchaudio.save("output_t5_ft_cv16_id.wav", speech.unsqueeze(0), 16000)

```

### Training hyperparameters

The following hyperparameters were used during training:
- learning_rate: 1e-05
- train_batch_size: 4
- eval_batch_size: 2
- seed: 42
- gradient_accumulation_steps: 8
- total_train_batch_size: 32
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- lr_scheduler_warmup_steps: 500
- training_steps: 4000
- mixed_precision_training: Native AMP

### Training results

| Training Loss | Epoch | Step | Validation Loss |
|:-------------:|:-----:|:----:|:---------------:|
| 0.5394        | 4.28  | 1000 | 0.4908          |
| 0.5062        | 8.56  | 2000 | 0.4730          |
| 0.5074        | 12.83 | 3000 | 0.4700          |
| 0.5023        | 17.11 | 4000 | 0.4675          |


### Framework versions

- Transformers 4.35.2
- Pytorch 2.1.1+cu121
- Datasets 2.15.0
- Tokenizers 0.15.0