sanchit-gandhi HF staff commited on
Commit
0e3d927
1 Parent(s): f94f039

Update README.md

Browse files

Push model example higher and update to load speaker embeddings from dataset

Files changed (1) hide show
  1. README.md +24 -25
README.md CHANGED
@@ -25,6 +25,30 @@ Leveraging large-scale unlabeled speech and text data, we pre-train SpeechT5 to
25
 
26
  Extensive evaluations show the superiority of the proposed SpeechT5 framework on a wide variety of spoken language processing tasks, including automatic speech recognition, speech synthesis, speech translation, voice conversion, speech enhancement, and speaker identification.
27
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
28
  ## Intended Uses & Limitations
29
 
30
  You can use this model for speech synthesis. See the [model hub](https://huggingface.co/models?search=speecht5) to look for fine-tuned versions on a task that interests you.
@@ -45,28 +69,3 @@ Currently, both the feature extractor and model support PyTorch.
45
  pages={5723--5738},
46
  }
47
  ```
48
-
49
- ## How to Get Started With the Model
50
-
51
- Use the code below to convert text into a mono 16 kHz speech waveform.
52
-
53
- ```python
54
- from transformers import SpeechT5Processor, SpeechT5ForTextToSpeech, SpeechT5HifiGan
55
-
56
- processor = SpeechT5Processor.from_pretrained("microsoft/speecht5_tts")
57
- model = SpeechT5ForTextToSpeech.from_pretrained("microsoft/speecht5_tts")
58
- vocoder = SpeechT5HifiGan.from_pretrained("microsoft/speecht5_hifigan")
59
-
60
- inputs = processor(text="Hello, my dog is cute", return_tensors="pt")
61
-
62
- # load xvector containing speaker's voice characteristics from a file
63
- import numpy as np
64
- import torch
65
- speaker_embeddings = np.load("xvector_speaker_embedding.npy")
66
- speaker_embeddings = torch.tensor(speaker_embeddings).unsqueeze(0)
67
-
68
- speech = model.generate_speech(inputs["input_ids"], speaker_embeddings, vocoder=vocoder)
69
-
70
- import soundfile as sf
71
- sf.write("speech.wav", speech.numpy(), samplerate=16000)
72
- ```
 
25
 
26
  Extensive evaluations show the superiority of the proposed SpeechT5 framework on a wide variety of spoken language processing tasks, including automatic speech recognition, speech synthesis, speech translation, voice conversion, speech enhancement, and speaker identification.
27
 
28
+ ## How to Get Started With the Model
29
+
30
+ Use the code below to convert text into a mono 16 kHz speech waveform.
31
+
32
+ ```python
33
+ from transformers import SpeechT5Processor, SpeechT5ForTextToSpeech, SpeechT5HifiGan
34
+ import torch
35
+ import soundfile as sf
36
+
37
+ processor = SpeechT5Processor.from_pretrained("microsoft/speecht5_tts")
38
+ model = SpeechT5ForTextToSpeech.from_pretrained("microsoft/speecht5_tts")
39
+ vocoder = SpeechT5HifiGan.from_pretrained("microsoft/speecht5_hifigan")
40
+
41
+ inputs = processor(text="Hello, my dog is cute", return_tensors="pt")
42
+
43
+ # load xvector containing speaker's voice characteristics from a dataset
44
+ embeddings_dataset = load_dataset("Matthijs/cmu-arctic-xvectors", split="validation")
45
+ speaker_embeddings = torch.tensor(embeddings_dataset[7306]["xvector"]).unsqueeze(0)
46
+
47
+ speech = model.generate_speech(inputs["input_ids"], speaker_embeddings, vocoder=vocoder)
48
+
49
+ sf.write("speech.wav", speech.numpy(), samplerate=16000)
50
+ ```
51
+
52
  ## Intended Uses & Limitations
53
 
54
  You can use this model for speech synthesis. See the [model hub](https://huggingface.co/models?search=speecht5) to look for fine-tuned versions on a task that interests you.
 
69
  pages={5723--5738},
70
  }
71
  ```