davidmezzetti commited on
Commit
b4c1aec
·
1 Parent(s): 401f3dd

Initial commit

Browse files
README.md ADDED
@@ -0,0 +1,127 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - audio
4
+ - text-to-speech
5
+ - onnx
6
+ inference: false
7
+ language: en
8
+ license: apache-2.0
9
+ library_name: txtai
10
+ ---
11
+
12
+ # SpeechT5 Text-to-Speech (TTS) Model for ONNX
13
+
14
+ Fine-tuned version of [SpeechT5 TTS](https://huggingface.co/microsoft/speecht5_tts) exported to ONNX. This model was exported to ONNX using the [Optimum](https://github.com/huggingface/optimum) library.
15
+
16
+ ## Usage with txtai
17
+
18
+ [txtai](https://github.com/neuml/txtai) has a built in Text to Speech (TTS) pipeline that makes using this model easy.
19
+
20
+ _Note the following example requires txtai >= 7.5_
21
+
22
+ ```python
23
+ import soundfile as sf
24
+
25
+ from txtai.pipeline import TextToSpeech
26
+
27
+ # Build pipeline
28
+ tts = TextToSpeech("NeuML/txtai-speecht5-onnx")
29
+
30
+ # Generate speech
31
+ speech, rate = tts("Say something here")
32
+
33
+ # Write to file
34
+ sf.write("out.wav", speech, rate)
35
+
36
+ # Generate speech with custom speaker
37
+ speech, rate = tts("Say something here", speaker=np.array(...))
38
+ ```
39
+
40
+ ## Model training
41
+
42
+ This model was fine-tuned using the code in this [Hugging Face article](https://huggingface.co/learn/audio-course/en/chapter6/fine-tuning) and a custom set of WAV files.
43
+
44
+ The ONNX export uses the following code, which requires installing `optimum`.
45
+
46
+ ```python
47
+ import os
48
+
49
+ from optimum.exporters.onnx import main_export
50
+ from optimum.onnx import merge_decoders
51
+
52
+ # Params
53
+ model = "txtai-speecht5-tts"
54
+ output = "txtai-speecht5-onnx"
55
+
56
+ # ONNX Export
57
+ main_export(
58
+ task="text-to-audio",
59
+ model_name_or_path=model,
60
+ model_kwargs={
61
+ "vocoder": "microsoft/speecht5_hifigan"
62
+ },
63
+ output = output
64
+ )
65
+
66
+ # Merge into single decoder model
67
+ merge_decoders(
68
+ f"{output}/decoder_model.onnx",
69
+ f"{output}/decoder_with_past_model.onnx",
70
+ save_path=f"{output}/decoder_model_merged.onnx",
71
+ strict=False
72
+ )
73
+
74
+ # Remove unnecessary files
75
+ os.remove(f"{output}/decoder_model.onnx")
76
+ os.remove(f"{output}/decoder_with_past_model.onnx")
77
+ ```
78
+
79
+ ## Custom speaker embeddings
80
+
81
+ When no speaker argument is passed in, the default speaker embeddings are used. The defaults speaker is David Mezzetti, the primary developer of txtai.
82
+
83
+ It's possible to build custom speaker embeddings as shown below. Fine-tuning the model with a new voice leads to the best results but zero-shot speaker embeddings are OK in some cases.
84
+
85
+ The following code requires installing `torchaudio` and `speechbrain`.
86
+
87
+ ```python
88
+ import os
89
+
90
+ import numpy as np
91
+ import torchaudio
92
+
93
+ from speechbrain.inference import EncoderClassifier
94
+
95
+ def speaker(path):
96
+ """
97
+ Extracts a speaker embedding from an audio file.
98
+
99
+ Args:
100
+ path: file path
101
+
102
+ Returns:
103
+ speaker embeddings
104
+ """
105
+
106
+ model = "speechbrain/spkrec-xvect-voxceleb"
107
+ encoder = EncoderClassifier.from_hparams(model,
108
+ savedir=os.path.join("/tmp", model),
109
+ run_opts={"device": "cuda"})
110
+
111
+ samples, sr = torchaudio.load(path)
112
+ samples = encoder.audio_normalizer(samples[0], sr)
113
+ embedding = encoder.encode_batch(samples.unsqueeze(0))
114
+
115
+ return embedding[0,0].to("cuda").unsqueeze(0)
116
+
117
+ embedding = speaker("reference.wav")
118
+ np.save("speaker.npy", embedding.cpu().numpy(), allow_pickle=False)
119
+ ```
120
+
121
+ Then load as shown below.
122
+
123
+ ```python
124
+ speech, rate = tts("Say something here", speaker=np.load("speaker.npy"))
125
+ ```
126
+
127
+ Speaker embeddings from the original SpeechT5 TTS training set are supported. See the [README](https://huggingface.co/microsoft/speecht5_tts#%F0%9F%A4%97-transformers-usage) for more.
added_tokens.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "<ctc_blank>": 80,
3
+ "<mask>": 79
4
+ }
decoder_model_merged.onnx ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:7b54ec8a6de93d4033282e5655ebd3fd631bef5844f099ac79338aa237b4acea
3
+ size 244485821
decoder_postnet_and_vocoder.onnx ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:77239e4c7a56859d43024325ce4c69f85910b8dab63225e6d73a7136707acad5
3
+ size 55432027
encoder_model.onnx ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:688cc59481b71d4a15a68a56a366330bd4bfd054adcf19a0feddcf9d126e8d6b
3
+ size 342803471
preprocessor_config.json ADDED
@@ -0,0 +1,19 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "do_normalize": false,
3
+ "feature_extractor_type": "SpeechT5FeatureExtractor",
4
+ "feature_size": 1,
5
+ "fmax": 7600,
6
+ "fmin": 80,
7
+ "frame_signal_scale": 1.0,
8
+ "hop_length": 16,
9
+ "mel_floor": 1e-10,
10
+ "num_mel_bins": 80,
11
+ "padding_side": "right",
12
+ "padding_value": 0.0,
13
+ "processor_class": "SpeechT5Processor",
14
+ "reduction_factor": 2,
15
+ "return_attention_mask": true,
16
+ "sampling_rate": 16000,
17
+ "win_function": "hann_window",
18
+ "win_length": 64
19
+ }
speaker.npy ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:49691ea43c3bc17ffa7afa9ff68b3cfec92bbf316afa90a4c797e6f8fd88042b
3
+ size 2176
special_tokens_map.json ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": {
3
+ "content": "<s>",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "eos_token": {
10
+ "content": "</s>",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "mask_token": {
17
+ "content": "<mask>",
18
+ "lstrip": true,
19
+ "normalized": true,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
+ "pad_token": {
24
+ "content": "<pad>",
25
+ "lstrip": false,
26
+ "normalized": false,
27
+ "rstrip": false,
28
+ "single_word": false
29
+ },
30
+ "unk_token": {
31
+ "content": "<unk>",
32
+ "lstrip": false,
33
+ "normalized": false,
34
+ "rstrip": false,
35
+ "single_word": false
36
+ }
37
+ }
spm_char.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:7fcc48f3e225f627b1641db410ceb0c8649bd2b0c982e150b03f8be3728ab560
3
+ size 238473
tokenizer_config.json ADDED
@@ -0,0 +1,63 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "<s>",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "1": {
12
+ "content": "<pad>",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "2": {
20
+ "content": "</s>",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "3": {
28
+ "content": "<unk>",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "79": {
36
+ "content": "<mask>",
37
+ "lstrip": true,
38
+ "normalized": true,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ },
43
+ "80": {
44
+ "content": "<ctc_blank>",
45
+ "lstrip": false,
46
+ "normalized": true,
47
+ "rstrip": false,
48
+ "single_word": false,
49
+ "special": false
50
+ }
51
+ },
52
+ "bos_token": "<s>",
53
+ "clean_up_tokenization_spaces": true,
54
+ "eos_token": "</s>",
55
+ "mask_token": "<mask>",
56
+ "model_max_length": 600,
57
+ "normalize": false,
58
+ "pad_token": "<pad>",
59
+ "processor_class": "SpeechT5Processor",
60
+ "sp_model_kwargs": {},
61
+ "tokenizer_class": "SpeechT5Tokenizer",
62
+ "unk_token": "<unk>"
63
+ }