---
tags:
- audio
- text-to-speech
- onnx
inference: false
language: en
license: apache-2.0
library_name: txtai
---

# SpeechT5 Text-to-Speech (TTS) Model for ONNX

Fine-tuned version of [SpeechT5 TTS](https://huggingface.co/microsoft/speecht5_tts) exported to ONNX. This model was exported to ONNX using the [Optimum](https://github.com/huggingface/optimum) library.

## Usage with txtai

[txtai](https://github.com/neuml/txtai) has a built in Text to Speech (TTS) pipeline that makes using this model easy.

_Note the following example requires txtai >= 7.5_

```python
import soundfile as sf

from txtai.pipeline import TextToSpeech

# Build pipeline
tts = TextToSpeech("NeuML/txtai-speecht5-onnx")

# Generate speech
speech, rate = tts("Say something here")

# Write to file
sf.write("out.wav", speech, rate)

# Generate speech with custom speaker
speech, rate = tts("Say something here", speaker=np.array(...))
```

## Model training

This model was fine-tuned using the code in this [Hugging Face article](https://huggingface.co/learn/audio-course/en/chapter6/fine-tuning) and a custom set of WAV files.

The ONNX export uses the following code, which requires installing `optimum`.

```python
import os

from optimum.exporters.onnx import main_export
from optimum.onnx import merge_decoders

# Params
model = "txtai-speecht5-tts"
output = "txtai-speecht5-onnx"

# ONNX Export
main_export(
    task="text-to-audio",
    model_name_or_path=model,
    model_kwargs={
        "vocoder": "microsoft/speecht5_hifigan"
    },
    output = output
)

# Merge into single decoder model
merge_decoders(
    f"{output}/decoder_model.onnx",
    f"{output}/decoder_with_past_model.onnx",
    save_path=f"{output}/decoder_model_merged.onnx",
    strict=False
)

# Remove unnecessary files
os.remove(f"{output}/decoder_model.onnx")
os.remove(f"{output}/decoder_with_past_model.onnx")
```

## Custom speaker embeddings

When no speaker argument is passed in, the default speaker embeddings are used. The defaults speaker is David Mezzetti, the primary developer of txtai.

It's possible to build custom speaker embeddings as shown below. Fine-tuning the model with a new voice leads to the best results but zero-shot speaker embeddings are OK in some cases.

The following code requires installing `torchaudio` and `speechbrain`.

```python
import os

import numpy as np
import torchaudio

from speechbrain.inference import EncoderClassifier

def speaker(path):
    """
    Extracts a speaker embedding from an audio file.

    Args:
        path: file path

    Returns:
        speaker embeddings
    """

    model = "speechbrain/spkrec-xvect-voxceleb"
    encoder = EncoderClassifier.from_hparams(model,
                                             savedir=os.path.join("/tmp", model),
                                             run_opts={"device": "cuda"})

    samples, sr = torchaudio.load(path)
    samples = encoder.audio_normalizer(samples[0], sr)
    embedding = encoder.encode_batch(samples.unsqueeze(0))

    return embedding[0,0].to("cuda").unsqueeze(0)

embedding = speaker("reference.wav")
np.save("speaker.npy", embedding.cpu().numpy(), allow_pickle=False)
```

Then load as shown below.

```python
speech, rate = tts("Say something here", speaker=np.load("speaker.npy"))
```

Speaker embeddings from the original SpeechT5 TTS training set are supported. See the [README](https://huggingface.co/microsoft/speecht5_tts#%F0%9F%A4%97-transformers-usage) for more.