Update README.md
Browse files
README.md
CHANGED
@@ -7,13 +7,11 @@ tags:
|
|
7 |
- speech
|
8 |
---
|
9 |
|
10 |
-
# UniSpeech-SAT-Base
|
11 |
|
12 |
[Microsoft's UniSpeech](https://www.microsoft.com/en-us/research/publication/unispeech-unified-speech-representation-learning-with-labeled-and-unlabeled-data/)
|
13 |
|
14 |
-
The
|
15 |
-
|
16 |
-
**Note**: This model does not have a tokenizer as it was pretrained on audio alone. In order to use this model **speech recognition**, a tokenizer should be created and the model should be fine-tuned on labeled text data. Check out [this blog](https://huggingface.co/blog/fine-tune-wav2vec2-english) for more in-detail explanation of how to fine-tune the model.
|
17 |
|
18 |
The model was pre-trained on:
|
19 |
|
@@ -29,33 +27,38 @@ Authors: Sanyuan Chen, Yu Wu, Chengyi Wang, Zhengyang Chen, Zhuo Chen, Shujie Li
|
|
29 |
|
30 |
The original model can be found under https://github.com/microsoft/UniSpeech/tree/main/UniSpeech-SAT.
|
31 |
|
32 |
-
#
|
33 |
-
|
34 |
-
This is an English pre-trained speech model that has to be fine-tuned on a downstream task like speech recognition or audio classification before it can be
|
35 |
-
used in inference. The model was pre-trained in English and should therefore perform well only in English. The model has been shown to work well on task such as speaker verification, speaker identification, and speaker diarization.
|
36 |
|
37 |
-
|
38 |
-
of phonemes before fine-tuning.
|
39 |
|
40 |
-
|
41 |
|
42 |
-
|
43 |
-
|
44 |
-
## Speech Classification
|
45 |
-
|
46 |
-
To fine-tune the model for speech classification, see [the official audio classification example](https://github.com/huggingface/transformers/tree/master/examples/pytorch/audio-classification).
|
47 |
|
48 |
## Speaker Verification
|
49 |
|
50 |
-
|
|
|
|
|
|
|
51 |
|
52 |
-
|
53 |
|
54 |
-
|
|
|
55 |
|
56 |
-
#
|
|
|
|
|
|
|
57 |
|
58 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
59 |
|
60 |
# License
|
61 |
|
|
|
7 |
- speech
|
8 |
---
|
9 |
|
10 |
+
# UniSpeech-SAT-Base for Speaker Verification
|
11 |
|
12 |
[Microsoft's UniSpeech](https://www.microsoft.com/en-us/research/publication/unispeech-unified-speech-representation-learning-with-labeled-and-unlabeled-data/)
|
13 |
|
14 |
+
The model was pretrained on 16kHz sampled speech audio with utterance and speaker contrastive loss. When using the model, make sure that your speech input is also sampled at 16kHz.
|
|
|
|
|
15 |
|
16 |
The model was pre-trained on:
|
17 |
|
|
|
27 |
|
28 |
The original model can be found under https://github.com/microsoft/UniSpeech/tree/main/UniSpeech-SAT.
|
29 |
|
30 |
+
# Fine-tuning details
|
|
|
|
|
|
|
31 |
|
32 |
+
The model is fine-tuned on the [VoxCeleb1 dataset](https://www.robots.ox.ac.uk/~vgg/data/voxceleb/vox1.html) using an X-Vector head with an Additive Margin Softmax loss
|
|
|
33 |
|
34 |
+
[X-Vectors: Robust DNN Embeddings for Speaker Recognition](https://www.danielpovey.com/files/2018_icassp_xvectors.pdf)
|
35 |
|
36 |
+
# Usage
|
|
|
|
|
|
|
|
|
37 |
|
38 |
## Speaker Verification
|
39 |
|
40 |
+
```python
|
41 |
+
from transformers import Wav2Vec2FeatureExtractor, UniSpeechSatForXVector
|
42 |
+
from datasets import load_dataset
|
43 |
+
import torch
|
44 |
|
45 |
+
dataset = load_dataset("hf-internal-testing/librispeech_asr_demo", "clean", split="validation")
|
46 |
|
47 |
+
feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained('microsoft/unispeech-sat-base-sv')
|
48 |
+
model = UniSpeechSatForXVector.from_pretrained('microsoft/unispeech-sat-base-sv')
|
49 |
|
50 |
+
# audio files are decoded on the fly
|
51 |
+
inputs = feature_extractor(dataset[:2]["audio"]["array"], return_tensors="pt")
|
52 |
+
embeddings = model(**inputs).embeddings
|
53 |
+
embeddings = torch.nn.functional.normalize(embeddings, dim=-1).cpu()
|
54 |
|
55 |
+
# the resulting embeddings can be used for cosine similarity-based retrieval
|
56 |
+
cosine_sim = torch.nn.CosineSimilarity(dim=-1)
|
57 |
+
similarity = cosine_sim(embeddings[0], embeddings[1])
|
58 |
+
threshold = 0.86 # the optimal threshold is dataset-dependent
|
59 |
+
if similarity < threshold:
|
60 |
+
print("Speakers are not the same!")
|
61 |
+
```
|
62 |
|
63 |
# License
|
64 |
|