File size: 3,528 Bytes
723436f d8cd0dc 723436f 9b7bbc7 723436f d8cd0dc 723436f d8cd0dc 723436f d699ebc 723436f d699ebc 723436f d699ebc 723436f d699ebc 723436f d699ebc 723436f |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 |
---
language:
- id
license: mit
base_model: microsoft/speecht5_tts
tags:
- text-to-speech
datasets:
- mozilla-foundation/common_voice_16_1
model-index:
- name: speecht5_finetuned_commonvoice_id
results: []
---
<!-- This model card has been generated automatically according to the information the Trainer had access to. You
should probably proofread and complete it, then remove this comment. -->
# speecht5_finetuned_commonvoice_id
This model is a fine-tuned version of [microsoft/speecht5_tts](https://huggingface.co/microsoft/speecht5_tts) on the mozilla-foundation/common_voice_16_1 dataset.
It achieves the following results on the evaluation set:
- Loss: 0.4675
## How to use/inference
Follow the example below and adapt with your own need.
```
# ft_t5_id_inference.py
import sounddevice as sd
import torch
import torchaudio
from datasets import Audio, load_dataset
from transformers import (
SpeechT5ForTextToSpeech,
SpeechT5HifiGan,
SpeechT5Processor,
)
from utils import create_speaker_embedding
# load dataset and pre-trained model
dataset = load_dataset(
"mozilla-foundation/common_voice_16_1", "id", split="test")
model = SpeechT5ForTextToSpeech.from_pretrained(
"Bagus/speecht5_finetuned_commonvoice_id")
# process the text using checkpoint
checkpoint = "microsoft/speecht5_tts"
processor = SpeechT5Processor.from_pretrained(checkpoint)
sampling_rate = processor.feature_extractor.sampling_rate
dataset = dataset.cast_column("audio", Audio(sampling_rate=sampling_rate))
def prepare_dataset(example):
audio = example["audio"]
example = processor(
text=example["sentence"],
audio_target=audio["array"],
sampling_rate=audio["sampling_rate"],
return_attention_mask=False,
)
# strip off the batch dimension
example["labels"] = example["labels"][0]
# use SpeechBrain to obtain x-vector
example["speaker_embeddings"] = create_speaker_embedding(audio["array"])
return example
# prepare the speaker embeddings from the dataset and text
example = prepare_dataset(dataset[30])
speaker_embeddings = torch.tensor(example["speaker_embeddings"]).unsqueeze(0)
# prepare text to be converted to speech
text = "Saya suka baju yang berwarna merah tua."
inputs = processor(text=text, return_tensors="pt")
vocoder = SpeechT5HifiGan.from_pretrained("microsoft/speecht5_hifigan")
speech = model.generate_speech(
inputs["input_ids"], speaker_embeddings, vocoder=vocoder)
sampling_rate = 16000
sd.play(speech, samplerate=sampling_rate, blocking=True)
# save the audio, signal needs to be in 2D tensor
torchaudio.save("output_t5_ft_cv16_id.wav", speech.unsqueeze(0), 16000)
```
### Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 1e-05
- train_batch_size: 4
- eval_batch_size: 2
- seed: 42
- gradient_accumulation_steps: 8
- total_train_batch_size: 32
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- lr_scheduler_warmup_steps: 500
- training_steps: 4000
- mixed_precision_training: Native AMP
### Training results
| Training Loss | Epoch | Step | Validation Loss |
|:-------------:|:-----:|:----:|:---------------:|
| 0.5394 | 4.28 | 1000 | 0.4908 |
| 0.5062 | 8.56 | 2000 | 0.4730 |
| 0.5074 | 12.83 | 3000 | 0.4700 |
| 0.5023 | 17.11 | 4000 | 0.4675 |
### Framework versions
- Transformers 4.35.2
- Pytorch 2.1.1+cu121
- Datasets 2.15.0
- Tokenizers 0.15.0
|