Initial commit

Browse files

Files changed (10) hide show

README.md +127 -0
added_tokens.json +4 -0
decoder_model_merged.onnx +3 -0
decoder_postnet_and_vocoder.onnx +3 -0
encoder_model.onnx +3 -0
preprocessor_config.json +19 -0
speaker.npy +3 -0
special_tokens_map.json +37 -0
spm_char.model +3 -0
tokenizer_config.json +63 -0

README.md ADDED Viewed

	@@ -0,0 +1,127 @@

+---
+tags:
+- audio
+- text-to-speech
+- onnx
+inference: false
+language: en
+license: apache-2.0
+library_name: txtai
+---
+# SpeechT5 Text-to-Speech (TTS) Model for ONNX
+Fine-tuned version of [SpeechT5 TTS](https://huggingface.co/microsoft/speecht5_tts) exported to ONNX. This model was exported to ONNX using the [Optimum](https://github.com/huggingface/optimum) library.
+## Usage with txtai
+[txtai](https://github.com/neuml/txtai) has a built in Text to Speech (TTS) pipeline that makes using this model easy.
+_Note the following example requires txtai >= 7.5_
+```python
+import soundfile as sf
+from txtai.pipeline import TextToSpeech
+# Build pipeline
+tts = TextToSpeech("NeuML/txtai-speecht5-onnx")
+# Generate speech
+speech, rate = tts("Say something here")
+# Write to file
+sf.write("out.wav", speech, rate)
+# Generate speech with custom speaker
+speech, rate = tts("Say something here", speaker=np.array(...))
+```
+## Model training
+This model was fine-tuned using the code in this [Hugging Face article](https://huggingface.co/learn/audio-course/en/chapter6/fine-tuning) and a custom set of WAV files.
+The ONNX export uses the following code, which requires installing `optimum`.
+```python
+import os
+from optimum.exporters.onnx import main_export
+from optimum.onnx import merge_decoders
+# Params
+model = "txtai-speecht5-tts"
+output = "txtai-speecht5-onnx"
+# ONNX Export
+main_export(
+    task="text-to-audio",
+    model_name_or_path=model,
+    model_kwargs={
+        "vocoder": "microsoft/speecht5_hifigan"
+    },
+    output = output
+)
+# Merge into single decoder model
+merge_decoders(
+    f"{output}/decoder_model.onnx",
+    f"{output}/decoder_with_past_model.onnx",
+    save_path=f"{output}/decoder_model_merged.onnx",
+    strict=False
+)
+# Remove unnecessary files
+os.remove(f"{output}/decoder_model.onnx")
+os.remove(f"{output}/decoder_with_past_model.onnx")
+```
+## Custom speaker embeddings
+When no speaker argument is passed in, the default speaker embeddings are used. The defaults speaker is David Mezzetti, the primary developer of txtai.
+It's possible to build custom speaker embeddings as shown below. Fine-tuning the model with a new voice leads to the best results but zero-shot speaker embeddings are OK in some cases.
+The following code requires installing `torchaudio` and `speechbrain`.
+```python
+import os
+import numpy as np
+import torchaudio
+from speechbrain.inference import EncoderClassifier
+def speaker(path):
+    """
+    Extracts a speaker embedding from an audio file.
+    Args:
+        path: file path
+    Returns:
+        speaker embeddings
+    """
+    model = "speechbrain/spkrec-xvect-voxceleb"
+    encoder = EncoderClassifier.from_hparams(model,
+                                             savedir=os.path.join("/tmp", model),
+                                             run_opts={"device": "cuda"})
+    samples, sr = torchaudio.load(path)
+    samples = encoder.audio_normalizer(samples[0], sr)
+    embedding = encoder.encode_batch(samples.unsqueeze(0))
+    return embedding[0,0].to("cuda").unsqueeze(0)
+embedding = speaker("reference.wav")
+np.save("speaker.npy", embedding.cpu().numpy(), allow_pickle=False)
+```
+Then load as shown below.
+```python
+speech, rate = tts("Say something here", speaker=np.load("speaker.npy"))
+```
+Speaker embeddings from the original SpeechT5 TTS training set are supported. See the [README](https://huggingface.co/microsoft/speecht5_tts#%F0%9F%A4%97-transformers-usage) for more.

added_tokens.json ADDED Viewed

	@@ -0,0 +1,4 @@

+{
+  "<ctc_blank>": 80,
+  "<mask>": 79
+}

decoder_model_merged.onnx ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:7b54ec8a6de93d4033282e5655ebd3fd631bef5844f099ac79338aa237b4acea
+size 244485821

decoder_postnet_and_vocoder.onnx ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:77239e4c7a56859d43024325ce4c69f85910b8dab63225e6d73a7136707acad5
+size 55432027

encoder_model.onnx ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:688cc59481b71d4a15a68a56a366330bd4bfd054adcf19a0feddcf9d126e8d6b
+size 342803471

preprocessor_config.json ADDED Viewed

	@@ -0,0 +1,19 @@

+{
+  "do_normalize": false,
+  "feature_extractor_type": "SpeechT5FeatureExtractor",
+  "feature_size": 1,
+  "fmax": 7600,
+  "fmin": 80,
+  "frame_signal_scale": 1.0,
+  "hop_length": 16,
+  "mel_floor": 1e-10,
+  "num_mel_bins": 80,
+  "padding_side": "right",
+  "padding_value": 0.0,
+  "processor_class": "SpeechT5Processor",
+  "reduction_factor": 2,
+  "return_attention_mask": true,
+  "sampling_rate": 16000,
+  "win_function": "hann_window",
+  "win_length": 64
+}

speaker.npy ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:49691ea43c3bc17ffa7afa9ff68b3cfec92bbf316afa90a4c797e6f8fd88042b
+size 2176

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,37 @@

+{
+  "bos_token": {
+    "content": "<s>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "eos_token": {
+    "content": "</s>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "mask_token": {
+    "content": "<mask>",
+    "lstrip": true,
+    "normalized": true,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": {
+    "content": "<pad>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "unk_token": {
+    "content": "<unk>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  }
+}

spm_char.model ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:7fcc48f3e225f627b1641db410ceb0c8649bd2b0c982e150b03f8be3728ab560
+size 238473

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,63 @@

+{
+  "added_tokens_decoder": {
+    "0": {
+      "content": "<s>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "1": {
+      "content": "<pad>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "2": {
+      "content": "</s>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "3": {
+      "content": "<unk>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "79": {
+      "content": "<mask>",
+      "lstrip": true,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "80": {
+      "content": "<ctc_blank>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    }
+  },
+  "bos_token": "<s>",
+  "clean_up_tokenization_spaces": true,
+  "eos_token": "</s>",
+  "mask_token": "<mask>",
+  "model_max_length": 600,
+  "normalize": false,
+  "pad_token": "<pad>",
+  "processor_class": "SpeechT5Processor",
+  "sp_model_kwargs": {},
+  "tokenizer_class": "SpeechT5Tokenizer",
+  "unk_token": "<unk>"
+}