hbredin commited on
Commit
d9481a5
1 Parent(s): 5750a46

Upload folder using huggingface_hub

Browse files
Files changed (4) hide show
  1. .DS_Store +0 -0
  2. README.md +111 -0
  3. pytorch_model.bin +3 -0
  4. wespeaker.pt +3 -0
.DS_Store ADDED
Binary file (6.15 kB). View file
 
README.md ADDED
@@ -0,0 +1,111 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - pyannote
4
+ - pyannote-audio
5
+ - pyannote-audio-model
6
+ - wespeaker
7
+ - audio
8
+ - voice
9
+ - speech
10
+ - speaker
11
+ - speaker-recognition
12
+ - speaker-verification
13
+ - speaker-identification
14
+ - speaker-embedding
15
+ datasets:
16
+ - voxceleb
17
+ license: cc-by-4.0
18
+ inference: false
19
+ ---
20
+
21
+ Using this open-source model in production?
22
+ Make the most of it thanks to our [consulting services](https://herve.niderb.fr/consulting.html).
23
+
24
+ # 🎹 `pyannote` wrapper around `wespeaker-voxceleb-resnet34-LM`
25
+
26
+ This model requires `pyannote.audio` version 3.1 or higher (currently in development).
27
+
28
+ This is a wrapper around [WeSpeaker](https://github.com/wenet-e2e/wespeaker) `wespeaker-voxceleb-resnet34-LM` pretrained speaker embedding model, for use in `pyannote.audio`.
29
+
30
+ ## Basic usage
31
+
32
+ ```python
33
+ # instantiate pretrained model
34
+ from pyannote.audio import Model
35
+ model = Model.from_pretrained("pyannote/wespeaker-voxceleb-resnet34-LM")
36
+ ```
37
+
38
+ ```python
39
+ from pyannote.audio import Inference
40
+ inference = Inference(model, window="whole")
41
+ embedding1 = inference("speaker1.wav")
42
+ embedding2 = inference("speaker2.wav")
43
+ # `embeddingX` is (1 x D) numpy array extracted from the file as a whole.
44
+
45
+ from scipy.spatial.distance import cdist
46
+ distance = cdist(embedding1, embedding2, metric="cosine")[0,0]
47
+ # `distance` is a `float` describing how dissimilar speakers 1 and 2 are.
48
+ ```
49
+
50
+ ## Advanced usage
51
+
52
+ ### Running on GPU
53
+
54
+ ```python
55
+ import torch
56
+ inference.to(torch.device("cuda"))
57
+ embedding = inference("audio.wav")
58
+ ```
59
+
60
+ ### Extract embedding from an excerpt
61
+
62
+ ```python
63
+ from pyannote.audio import Inference
64
+ from pyannote.core import Segment
65
+ inference = Inference(model, window="whole")
66
+ excerpt = Segment(13.37, 19.81)
67
+ embedding = inference.crop("audio.wav", excerpt)
68
+ # `embedding` is (1 x D) numpy array extracted from the file excerpt.
69
+ ```
70
+
71
+ ### Extract embeddings using a sliding window
72
+
73
+ ```python
74
+ from pyannote.audio import Inference
75
+ inference = Inference(model, window="sliding",
76
+ duration=3.0, step=1.0)
77
+ embeddings = inference("audio.wav")
78
+ # `embeddings` is a (N x D) pyannote.core.SlidingWindowFeature
79
+ # `embeddings[i]` is the embedding of the ith position of the
80
+ # sliding window, i.e. from [i * step, i * step + duration].
81
+ ```
82
+
83
+ ## License
84
+
85
+ According to [this page](https://github.com/wenet-e2e/wespeaker/blob/master/docs/pretrained.md):
86
+
87
+ > The pretrained model in WeNet follows the license of it's corresponding dataset. For example, the pretrained model on VoxCeleb follows Creative Commons Attribution 4.0 International License., since it is used as license of the VoxCeleb dataset, see https://mm.kaist.ac.kr/datasets/voxceleb/.
88
+
89
+ ## Citation
90
+
91
+ ```bibtex
92
+ @inproceedings{Wang2023,
93
+ title={Wespeaker: A research and production oriented speaker embedding learning toolkit},
94
+ author={Wang, Hongji and Liang, Chengdong and Wang, Shuai and Chen, Zhengyang and Zhang, Binbin and Xiang, Xu and Deng, Yanlei and Qian, Yanmin},
95
+ booktitle={ICASSP 2023, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
96
+ pages={1--5},
97
+ year={2023},
98
+ organization={IEEE}
99
+ }
100
+ ```
101
+
102
+ ```bibtex
103
+ @inproceedings{Bredin23,
104
+ author={Hervé Bredin},
105
+ title={{pyannote.audio 2.1 speaker diarization pipeline: principle, benchmark, and recipe}},
106
+ year=2023,
107
+ booktitle={Proc. INTERSPEECH 2023},
108
+ pages={1983--1987},
109
+ doi={10.21437/Interspeech.2023-105}
110
+ }
111
+ ```
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1c13df4cd484da9c18852da17680d12848dbc1ab7b4f080546b5812c19550c0f
3
+ size 26644085
wespeaker.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9872b375f2c6a3851ca471cbbf59e06efd23a627d78bf5872e1f0269fd298449
3
+ size 45053131