SpeechT5 Text-to-Speech (TTS) Model for ONNX
Fine-tuned version of SpeechT5 TTS exported to ONNX. This model was exported to ONNX using the Optimum library.
Usage with txtai
txtai has a built in Text to Speech (TTS) pipeline that makes using this model easy.
Note the following example requires txtai >= 7.5
import soundfile as sf
from txtai.pipeline import TextToSpeech
# Build pipeline
tts = TextToSpeech("NeuML/txtai-speecht5-onnx")
# Generate speech
speech, rate = tts("Say something here")
# Write to file
sf.write("out.wav", speech, rate)
# Generate speech with custom speaker
speech, rate = tts("Say something here", speaker=np.array(...))
Model training
This model was fine-tuned using the code in this Hugging Face article and a custom set of WAV files.
The ONNX export uses the following code, which requires installing optimum
.
import os
from optimum.exporters.onnx import main_export
from optimum.onnx import merge_decoders
# Params
model = "txtai-speecht5-tts"
output = "txtai-speecht5-onnx"
# ONNX Export
main_export(
task="text-to-audio",
model_name_or_path=model,
model_kwargs={
"vocoder": "microsoft/speecht5_hifigan"
},
output = output
)
# Merge into single decoder model
merge_decoders(
f"{output}/decoder_model.onnx",
f"{output}/decoder_with_past_model.onnx",
save_path=f"{output}/decoder_model_merged.onnx",
strict=False
)
# Remove unnecessary files
os.remove(f"{output}/decoder_model.onnx")
os.remove(f"{output}/decoder_with_past_model.onnx")
Custom speaker embeddings
When no speaker argument is passed in, the default speaker embeddings are used. The defaults speaker is David Mezzetti, the primary developer of txtai.
It's possible to build custom speaker embeddings as shown below. Fine-tuning the model with a new voice leads to the best results but zero-shot speaker embeddings are OK in some cases.
The following code requires installing torchaudio
and speechbrain
.
import os
import numpy as np
import torchaudio
from speechbrain.inference import EncoderClassifier
def speaker(path):
"""
Extracts a speaker embedding from an audio file.
Args:
path: file path
Returns:
speaker embeddings
"""
model = "speechbrain/spkrec-xvect-voxceleb"
encoder = EncoderClassifier.from_hparams(model,
savedir=os.path.join("/tmp", model),
run_opts={"device": "cuda"})
samples, sr = torchaudio.load(path)
samples = encoder.audio_normalizer(samples[0], sr)
embedding = encoder.encode_batch(samples.unsqueeze(0))
return embedding[0,0].to("cuda").unsqueeze(0)
embedding = speaker("reference.wav")
np.save("speaker.npy", embedding.cpu().numpy(), allow_pickle=False)
Then load as shown below.
speech, rate = tts("Say something here", speaker=np.load("speaker.npy"))
Speaker embeddings from the original SpeechT5 TTS training set are supported. See the README for more.