microsoft
/

speecht5_tts

@@ -25,6 +25,30 @@ Leveraging large-scale unlabeled speech and text data, we pre-train SpeechT5 to
 Extensive evaluations show the superiority of the proposed SpeechT5 framework on a wide variety of spoken language processing tasks, including automatic speech recognition, speech synthesis, speech translation, voice conversion, speech enhancement, and speaker identification.
 ## Intended Uses & Limitations
 You can use this model for speech synthesis. See the [model hub](https://huggingface.co/models?search=speecht5) to look for fine-tuned versions on a task that interests you.
@@ -45,28 +69,3 @@ Currently, both the feature extractor and model support PyTorch.
     pages={5723--5738},
 }
 ```
-## How to Get Started With the Model
-Use the code below to convert text into a mono 16 kHz speech waveform.
-```python
-from transformers import SpeechT5Processor, SpeechT5ForTextToSpeech, SpeechT5HifiGan
-processor = SpeechT5Processor.from_pretrained("microsoft/speecht5_tts")
-model = SpeechT5ForTextToSpeech.from_pretrained("microsoft/speecht5_tts")
-vocoder = SpeechT5HifiGan.from_pretrained("microsoft/speecht5_hifigan")
-inputs = processor(text="Hello, my dog is cute", return_tensors="pt")
-# load xvector containing speaker's voice characteristics from a file
-import numpy as np
-import torch
-speaker_embeddings = np.load("xvector_speaker_embedding.npy")
-speaker_embeddings = torch.tensor(speaker_embeddings).unsqueeze(0)
-speech = model.generate_speech(inputs["input_ids"], speaker_embeddings, vocoder=vocoder)
-import soundfile as sf
-sf.write("speech.wav", speech.numpy(), samplerate=16000)
-```

 Extensive evaluations show the superiority of the proposed SpeechT5 framework on a wide variety of spoken language processing tasks, including automatic speech recognition, speech synthesis, speech translation, voice conversion, speech enhancement, and speaker identification.
+## How to Get Started With the Model
+Use the code below to convert text into a mono 16 kHz speech waveform.
+```python
+from transformers import SpeechT5Processor, SpeechT5ForTextToSpeech, SpeechT5HifiGan
+import torch
+import soundfile as sf
+processor = SpeechT5Processor.from_pretrained("microsoft/speecht5_tts")
+model = SpeechT5ForTextToSpeech.from_pretrained("microsoft/speecht5_tts")
+vocoder = SpeechT5HifiGan.from_pretrained("microsoft/speecht5_hifigan")
+inputs = processor(text="Hello, my dog is cute", return_tensors="pt")
+# load xvector containing speaker's voice characteristics from a dataset
+embeddings_dataset = load_dataset("Matthijs/cmu-arctic-xvectors", split="validation")
+speaker_embeddings = torch.tensor(embeddings_dataset[7306]["xvector"]).unsqueeze(0)
+speech = model.generate_speech(inputs["input_ids"], speaker_embeddings, vocoder=vocoder)
+sf.write("speech.wav", speech.numpy(), samplerate=16000)
+```
 ## Intended Uses & Limitations
 You can use this model for speech synthesis. See the [model hub](https://huggingface.co/models?search=speecht5) to look for fine-tuned versions on a task that interests you.
     pages={5723--5738},
 }
 ```