Update README.md
#1
by
sanchit-gandhi
HF staff
- opened
README.md
CHANGED
@@ -25,6 +25,30 @@ Leveraging large-scale unlabeled speech and text data, we pre-train SpeechT5 to
|
|
25 |
|
26 |
Extensive evaluations show the superiority of the proposed SpeechT5 framework on a wide variety of spoken language processing tasks, including automatic speech recognition, speech synthesis, speech translation, voice conversion, speech enhancement, and speaker identification.
|
27 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
28 |
## Intended Uses & Limitations
|
29 |
|
30 |
You can use this model for speech synthesis. See the [model hub](https://huggingface.co/models?search=speecht5) to look for fine-tuned versions on a task that interests you.
|
@@ -45,28 +69,3 @@ Currently, both the feature extractor and model support PyTorch.
|
|
45 |
pages={5723--5738},
|
46 |
}
|
47 |
```
|
48 |
-
|
49 |
-
## How to Get Started With the Model
|
50 |
-
|
51 |
-
Use the code below to convert text into a mono 16 kHz speech waveform.
|
52 |
-
|
53 |
-
```python
|
54 |
-
from transformers import SpeechT5Processor, SpeechT5ForTextToSpeech, SpeechT5HifiGan
|
55 |
-
|
56 |
-
processor = SpeechT5Processor.from_pretrained("microsoft/speecht5_tts")
|
57 |
-
model = SpeechT5ForTextToSpeech.from_pretrained("microsoft/speecht5_tts")
|
58 |
-
vocoder = SpeechT5HifiGan.from_pretrained("microsoft/speecht5_hifigan")
|
59 |
-
|
60 |
-
inputs = processor(text="Hello, my dog is cute", return_tensors="pt")
|
61 |
-
|
62 |
-
# load xvector containing speaker's voice characteristics from a file
|
63 |
-
import numpy as np
|
64 |
-
import torch
|
65 |
-
speaker_embeddings = np.load("xvector_speaker_embedding.npy")
|
66 |
-
speaker_embeddings = torch.tensor(speaker_embeddings).unsqueeze(0)
|
67 |
-
|
68 |
-
speech = model.generate_speech(inputs["input_ids"], speaker_embeddings, vocoder=vocoder)
|
69 |
-
|
70 |
-
import soundfile as sf
|
71 |
-
sf.write("speech.wav", speech.numpy(), samplerate=16000)
|
72 |
-
```
|
|
|
25 |
|
26 |
Extensive evaluations show the superiority of the proposed SpeechT5 framework on a wide variety of spoken language processing tasks, including automatic speech recognition, speech synthesis, speech translation, voice conversion, speech enhancement, and speaker identification.
|
27 |
|
28 |
+
## How to Get Started With the Model
|
29 |
+
|
30 |
+
Use the code below to convert text into a mono 16 kHz speech waveform.
|
31 |
+
|
32 |
+
```python
|
33 |
+
from transformers import SpeechT5Processor, SpeechT5ForTextToSpeech, SpeechT5HifiGan
|
34 |
+
import torch
|
35 |
+
import soundfile as sf
|
36 |
+
|
37 |
+
processor = SpeechT5Processor.from_pretrained("microsoft/speecht5_tts")
|
38 |
+
model = SpeechT5ForTextToSpeech.from_pretrained("microsoft/speecht5_tts")
|
39 |
+
vocoder = SpeechT5HifiGan.from_pretrained("microsoft/speecht5_hifigan")
|
40 |
+
|
41 |
+
inputs = processor(text="Hello, my dog is cute", return_tensors="pt")
|
42 |
+
|
43 |
+
# load xvector containing speaker's voice characteristics from a dataset
|
44 |
+
embeddings_dataset = load_dataset("Matthijs/cmu-arctic-xvectors", split="validation")
|
45 |
+
speaker_embeddings = torch.tensor(embeddings_dataset[7306]["xvector"]).unsqueeze(0)
|
46 |
+
|
47 |
+
speech = model.generate_speech(inputs["input_ids"], speaker_embeddings, vocoder=vocoder)
|
48 |
+
|
49 |
+
sf.write("speech.wav", speech.numpy(), samplerate=16000)
|
50 |
+
```
|
51 |
+
|
52 |
## Intended Uses & Limitations
|
53 |
|
54 |
You can use this model for speech synthesis. See the [model hub](https://huggingface.co/models?search=speecht5) to look for fine-tuned versions on a task that interests you.
|
|
|
69 |
pages={5723--5738},
|
70 |
}
|
71 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|