anton-l HF staff commited on
Commit
35751ea
1 Parent(s): c3800f2

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +24 -21
README.md CHANGED
@@ -7,13 +7,11 @@ tags:
7
  - speech
8
  ---
9
 
10
- # UniSpeech-SAT-Base
11
 
12
  [Microsoft's UniSpeech](https://www.microsoft.com/en-us/research/publication/unispeech-unified-speech-representation-learning-with-labeled-and-unlabeled-data/)
13
 
14
- The base model pretrained on 16kHz sampled speech audio with utterance and speaker contrastive loss. When using the model, make sure that your speech input is also sampled at 16kHz.
15
-
16
- **Note**: This model does not have a tokenizer as it was pretrained on audio alone. In order to use this model **speech recognition**, a tokenizer should be created and the model should be fine-tuned on labeled text data. Check out [this blog](https://huggingface.co/blog/fine-tune-wav2vec2-english) for more in-detail explanation of how to fine-tune the model.
17
 
18
  The model was pre-trained on:
19
 
@@ -29,33 +27,38 @@ Authors: Sanyuan Chen, Yu Wu, Chengyi Wang, Zhengyang Chen, Zhuo Chen, Shujie Li
29
 
30
  The original model can be found under https://github.com/microsoft/UniSpeech/tree/main/UniSpeech-SAT.
31
 
32
- # Usage
33
-
34
- This is an English pre-trained speech model that has to be fine-tuned on a downstream task like speech recognition or audio classification before it can be
35
- used in inference. The model was pre-trained in English and should therefore perform well only in English. The model has been shown to work well on task such as speaker verification, speaker identification, and speaker diarization.
36
 
37
- **Note**: The model was pre-trained on phonemes rather than characters. This means that one should make sure that the input text is converted to a sequence
38
- of phonemes before fine-tuning.
39
 
40
- ## Speech Recognition
41
 
42
- To fine-tune the model for speech recognition, see [the official speech recognition example](https://github.com/huggingface/transformers/tree/master/examples/pytorch/speech-recognition).
43
-
44
- ## Speech Classification
45
-
46
- To fine-tune the model for speech classification, see [the official audio classification example](https://github.com/huggingface/transformers/tree/master/examples/pytorch/audio-classification).
47
 
48
  ## Speaker Verification
49
 
50
- TODO
 
 
 
51
 
52
- ## Speaker Diarization
53
 
54
- TODO
 
55
 
56
- # Contribution
 
 
 
57
 
58
- The model was contributed by [cywang](https://huggingface.co/cywang) and [patrickvonplaten](https://huggingface.co/patrickvonplaten).
 
 
 
 
 
 
59
 
60
  # License
61
 
 
7
  - speech
8
  ---
9
 
10
+ # UniSpeech-SAT-Base for Speaker Verification
11
 
12
  [Microsoft's UniSpeech](https://www.microsoft.com/en-us/research/publication/unispeech-unified-speech-representation-learning-with-labeled-and-unlabeled-data/)
13
 
14
+ The model was pretrained on 16kHz sampled speech audio with utterance and speaker contrastive loss. When using the model, make sure that your speech input is also sampled at 16kHz.
 
 
15
 
16
  The model was pre-trained on:
17
 
 
27
 
28
  The original model can be found under https://github.com/microsoft/UniSpeech/tree/main/UniSpeech-SAT.
29
 
30
+ # Fine-tuning details
 
 
 
31
 
32
+ The model is fine-tuned on the [VoxCeleb1 dataset](https://www.robots.ox.ac.uk/~vgg/data/voxceleb/vox1.html) using an X-Vector head with an Additive Margin Softmax loss
 
33
 
34
+ [X-Vectors: Robust DNN Embeddings for Speaker Recognition](https://www.danielpovey.com/files/2018_icassp_xvectors.pdf)
35
 
36
+ # Usage
 
 
 
 
37
 
38
  ## Speaker Verification
39
 
40
+ ```python
41
+ from transformers import Wav2Vec2FeatureExtractor, UniSpeechSatForXVector
42
+ from datasets import load_dataset
43
+ import torch
44
 
45
+ dataset = load_dataset("hf-internal-testing/librispeech_asr_demo", "clean", split="validation")
46
 
47
+ feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained('microsoft/unispeech-sat-base-sv')
48
+ model = UniSpeechSatForXVector.from_pretrained('microsoft/unispeech-sat-base-sv')
49
 
50
+ # audio files are decoded on the fly
51
+ inputs = feature_extractor(dataset[:2]["audio"]["array"], return_tensors="pt")
52
+ embeddings = model(**inputs).embeddings
53
+ embeddings = torch.nn.functional.normalize(embeddings, dim=-1).cpu()
54
 
55
+ # the resulting embeddings can be used for cosine similarity-based retrieval
56
+ cosine_sim = torch.nn.CosineSimilarity(dim=-1)
57
+ similarity = cosine_sim(embeddings[0], embeddings[1])
58
+ threshold = 0.86 # the optimal threshold is dataset-dependent
59
+ if similarity < threshold:
60
+ print("Speakers are not the same!")
61
+ ```
62
 
63
  # License
64