SpeechT5 Text-to-Speech (TTS) Model for ONNX

Fine-tuned version of SpeechT5 TTS exported to ONNX. This model was exported to ONNX using the Optimum library.

Usage with txtai

txtai has a built in Text to Speech (TTS) pipeline that makes using this model easy.

Note the following example requires txtai >= 7.5

import soundfile as sf

from txtai.pipeline import TextToSpeech

# Build pipeline
tts = TextToSpeech("NeuML/txtai-speecht5-onnx")

# Generate speech
speech, rate = tts("Say something here")

# Write to file
sf.write("out.wav", speech, rate)

# Generate speech with custom speaker
speech, rate = tts("Say something here", speaker=np.array(...))

Model training

This model was fine-tuned using the code in this Hugging Face article and a custom set of WAV files.

The ONNX export uses the following code, which requires installing optimum.

import os

from optimum.exporters.onnx import main_export
from optimum.onnx import merge_decoders

# Params
model = "txtai-speecht5-tts"
output = "txtai-speecht5-onnx"

# ONNX Export
main_export(
    task="text-to-audio",
    model_name_or_path=model,
    model_kwargs={
        "vocoder": "microsoft/speecht5_hifigan"
    },
    output = output
)

# Merge into single decoder model
merge_decoders(
    f"{output}/decoder_model.onnx",
    f"{output}/decoder_with_past_model.onnx",
    save_path=f"{output}/decoder_model_merged.onnx",
    strict=False
)

# Remove unnecessary files
os.remove(f"{output}/decoder_model.onnx")
os.remove(f"{output}/decoder_with_past_model.onnx")

Custom speaker embeddings

When no speaker argument is passed in, the default speaker embeddings are used. The defaults speaker is David Mezzetti, the primary developer of txtai.

It's possible to build custom speaker embeddings as shown below. Fine-tuning the model with a new voice leads to the best results but zero-shot speaker embeddings are OK in some cases.

The following code requires installing torchaudio and speechbrain.

import os

import numpy as np
import torchaudio

from speechbrain.inference import EncoderClassifier

def speaker(path):
    """
    Extracts a speaker embedding from an audio file.

    Args:
        path: file path

    Returns:
        speaker embeddings
    """

    model = "speechbrain/spkrec-xvect-voxceleb"
    encoder = EncoderClassifier.from_hparams(model,
                                             savedir=os.path.join("/tmp", model),
                                             run_opts={"device": "cuda"})

    samples, sr = torchaudio.load(path)
    samples = encoder.audio_normalizer(samples[0], sr)
    embedding = encoder.encode_batch(samples.unsqueeze(0))

    return embedding[0,0].to("cuda").unsqueeze(0)

embedding = speaker("reference.wav")
np.save("speaker.npy", embedding.cpu().numpy(), allow_pickle=False)

Then load as shown below.

speech, rate = tts("Say something here", speaker=np.load("speaker.npy"))

Speaker embeddings from the original SpeechT5 TTS training set are supported. See the README for more.

NeuML
/

txtai-speecht5-onnx

SpeechT5 Text-to-Speech (TTS) Model for ONNX

Usage with txtai

Model training

Custom speaker embeddings

Collection including NeuML/txtai-speecht5-onnx

Text to Speech (TTS)