File size: 2,334 Bytes
5824c21
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
07f32ca
 
7f480ed
07f32ca
b8ba5b8
 
d25916e
 
 
b8ba5b8
 
 
aaf6353
 
 
 
 
f3d9c66
07f32ca
 
 
 
 
 
 
80420dd
 
 
07f32ca
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
---
language: "en"
thumbnail: 
tags:
- speechbrain
- embeddings
- Speaker
- Verification
- Identification
- pytorch
- ECAPA-TDNN
license: "apache-2.0"
datasets:
- voxceleb
metrics:
- EER
- Accuracy
widget:
- example_title: VoxCeleb Speaker id10003
  src: https://cdn-media.huggingface.co/speech_samples/VoxCeleb1_00003.wav
- example_title: VoxCeleb Speaker id10004
  src: https://cdn-media.huggingface.co/speech_samples/VoxCeleb_00004.wav
---

# Speaker Identification with ECAPA-TDNN embeddings on Voxceleb

This repository provides a pretrained ECAPA-TDNN model using SpeechBrain. The system can be used to extract speaker embeddings as well. Since we can't find any resource that has SpeechBrain or HuggingFace compatible checkpoints that has only been trained on VoxCeleb2 development data, so we decide to pre-train an ECAPA-TDNN system from scratch.

# Pipeline description

This system is composed of an ECAPA-TDNN model. It is a combination of convolutional and residual blocks. The embeddings are extracted using attentive statistical pooling. The system is trained with Additive Margin Softmax Loss. 

We use FBank (16kHz, 25ms frame length, 10ms hop length, 80 filter-bank channels) as the input features. It was trained using initial learning rate of 0.001 and batch size of 512 with cyclical learning rate policy (CLR) for 10 epochs on 4 A100 GPUs. We employ additive noises and reverberation from [MUSAN](http://www.openslr.org/17/) and [RIR](http://www.openslr.org/28/) datasets to enrich the supervised information. The pre-training progress takes approximately seven days for the ECAPA-TDNN model.

# Performance

| Splits | Backend | S-norm | EER(%) | minDCF(0.01) |
|:-------------:|:--------------:|:--------------:|:--------------:|:--------------:|
| VoxCeleb1-O | cosine | no | 1.45 | 0.17 |
| VoxCeleb1-E | cosine | no | TBD | TBD  |
| VoxCeleb1-H | cosine | no | TBD | TBD  |

# Compute the speaker embeddings

The system is trained with recordings sampled at 16kHz (single channel).

```python
import torchaudio
from speechbrain.pretrained import EncoderClassifier
classifier = EncoderClassifier.from_hparams(
    source="yangwang825/ecapa-tdnn-vox2"
)
signal, fs = torchaudio.load('spk1_snt1.wav')
embeddings = classifier.encode_batch(signal)
```

You can find our training results (models, logs, etc) here.