--- tags: - audio - text-to-speech - onnx inference: false language: en license: apache-2.0 library_name: txtai --- # SpeechT5 Text-to-Speech (TTS) Model for ONNX Fine-tuned version of [SpeechT5 TTS](https://huggingface.co/microsoft/speecht5_tts) exported to ONNX. This model was exported to ONNX using the [Optimum](https://github.com/huggingface/optimum) library. ## Usage with txtai [txtai](https://github.com/neuml/txtai) has a built in Text to Speech (TTS) pipeline that makes using this model easy. _Note the following example requires txtai >= 7.5_ ```python import soundfile as sf from txtai.pipeline import TextToSpeech # Build pipeline tts = TextToSpeech("NeuML/txtai-speecht5-onnx") # Generate speech speech, rate = tts("Say something here") # Write to file sf.write("out.wav", speech, rate) # Generate speech with custom speaker speech, rate = tts("Say something here", speaker=np.array(...)) ``` ## Model training This model was fine-tuned using the code in this [Hugging Face article](https://huggingface.co/learn/audio-course/en/chapter6/fine-tuning) and a custom set of WAV files. The ONNX export uses the following code, which requires installing `optimum`. ```python import os from optimum.exporters.onnx import main_export from optimum.onnx import merge_decoders # Params model = "txtai-speecht5-tts" output = "txtai-speecht5-onnx" # ONNX Export main_export( task="text-to-audio", model_name_or_path=model, model_kwargs={ "vocoder": "microsoft/speecht5_hifigan" }, output = output ) # Merge into single decoder model merge_decoders( f"{output}/decoder_model.onnx", f"{output}/decoder_with_past_model.onnx", save_path=f"{output}/decoder_model_merged.onnx", strict=False ) # Remove unnecessary files os.remove(f"{output}/decoder_model.onnx") os.remove(f"{output}/decoder_with_past_model.onnx") ``` ## Custom speaker embeddings When no speaker argument is passed in, the default speaker embeddings are used. The defaults speaker is David Mezzetti, the primary developer of txtai. It's possible to build custom speaker embeddings as shown below. Fine-tuning the model with a new voice leads to the best results but zero-shot speaker embeddings are OK in some cases. The following code requires installing `torchaudio` and `speechbrain`. ```python import os import numpy as np import torchaudio from speechbrain.inference import EncoderClassifier def speaker(path): """ Extracts a speaker embedding from an audio file. Args: path: file path Returns: speaker embeddings """ model = "speechbrain/spkrec-xvect-voxceleb" encoder = EncoderClassifier.from_hparams(model, savedir=os.path.join("/tmp", model), run_opts={"device": "cuda"}) samples, sr = torchaudio.load(path) samples = encoder.audio_normalizer(samples[0], sr) embedding = encoder.encode_batch(samples.unsqueeze(0)) return embedding[0,0].to("cuda").unsqueeze(0) embedding = speaker("reference.wav") np.save("speaker.npy", embedding.cpu().numpy(), allow_pickle=False) ``` Then load as shown below. ```python speech, rate = tts("Say something here", speaker=np.load("speaker.npy")) ``` Speaker embeddings from the original SpeechT5 TTS training set are supported. See the [README](https://huggingface.co/microsoft/speecht5_tts#%F0%9F%A4%97-transformers-usage) for more.