WavLMRawNetSVBase
WavLM Large + RawNet2.5 Speaker Verification Base: End-to-End Speaker Verification Architecture
This architecture combines WavLM Large and RawNet2.5 to learn both micro and macro features directly from raw waveforms. The goal is to obtain a fully end-to-end model, avoiding any manual feature extraction (e.g., MFCC, mel-spectrogram). Instead, the network itself discovers the most relevant frequency and temporal patterns for speaker verification.
Note: If you would like to contribute to this repository, please read the CONTRIBUTING first.
Table of Contents
- Introduction
- Architecture
- Reports
- Demo
- Prerequisites
- Installation
- Usage
- File Structure
- Version Control System
- Upcoming
- Documentations
- License
- Links
- Team
- Contact
- Citation
Introduction
Combine WavLM Large and RawNet2.5
WavLM Large (Transformer-based)
- Developed by Microsoft, WavLM relies on self-attention layers that capture fine-grained (
frame-level
) or “micro” acoustic features. - It produces a 1024-dimensional embedding, focusing on localized, short-term variations in the speech signal.
- Developed by Microsoft, WavLM relies on self-attention layers that capture fine-grained (
RawNet2.5 (SincConv + Residual Stack)
- Uses SincConv and residual blocks to summarize the raw signal on a broader (macro) scale.
- The Attentive Stats Pooling layer aggregates mean + std across the entire time axis (with learnable attention), capturing global speaker characteristics.
- Outputs a 256-dimensional embedding, representing the overall, longer-term structure of the speech.
These two approaches complement each other: WavLM Large excels at fine-detailed temporal features, while RawNet2.5 captures a more global, statistical overview.
Architectural Flow
Raw Audio Input
- No manual preprocessing (like MFCC or mel-spectrogram).
- A minimal Transform and Segment step (mono conversion, resample, slice/pad) formats the data into shape
(B, T)
.
RawNet2.5 (Macro Features)
- SincConv: Learns band-pass filters in a frequency-focused manner, constrained by low/high cutoff frequencies.
- ResidualStack: A set of residual blocks (optionally with SEBlock) refines the representation.
- Attentive Stats Pooling: Aggregates time-domain information into mean and std with a learnable attention mechanism.
- A final FC layer yields a 256-dimensional embedding.
WavLM Large (Micro Features)
- Transformer layers operate at
frame-level
, capturing fine-grained details. - Produces a 1024-dimensional embedding after mean pooling across time.
- Transformer layers operate at
Fusion Layer
- Concatenate the 256-dim RawNet2.5 embedding with the 1024-dim WavLM embedding, resulting in 1280 dimensions.
- A Linear(1280 → 256) + ReLU layer reduces it to a 256-dim Fusion Embedding, combining micro and macro insights.
AMSoftmax Loss
- During training, the 256-dim fusion embedding is passed to an AMSoftmax classifier (with margin + scale).
- Embeddings of the same speaker are pulled closer, while different speakers are pushed apart in the angular space.
A Single End-to-End Learning Pipeline
- Fully Automatic: Raw waveforms go in, final speaker embeddings come out.
- No Manual Feature Extraction: We do not rely on handcrafted features like MFCC or mel-spectrogram.
- Data-Driven: The model itself figures out which frequency bands or time segments matter most.
- Enhanced Representation: WavLM delivers local detail, RawNet2.5 captures global stats, leading to a more robust speaker representation.
Why Avoid Preprocessing?
- Deep Learning Principle: The model should learn how to process raw signals rather than relying on human-defined feature pipelines.
- Better Generalization: Fewer hand-tuned hyperparameters; the model adapts better to various speakers, languages, and environments.
- Scientific Rigor: Manual feature engineering can introduce subjective design choices. Letting the network learn directly from data is more consistent with data-driven approaches.
Performance & Advantages
Micro + Macro Features Combined
- Captures both short-term acoustic nuances (WavLM) and holistic temporal stats (RawNet2.5).
Truly End-to-End
- Beyond minimal slicing/padding, all layers are trainable.
- No handcrafted feature extraction is involved.
VoxCeleb1 Test Results
- Achieved an EER of 4.67% on the VoxCeleb1 evaluation set.
Overall Benefits
- Potentially outperforms using WavLM or RawNet2.5 alone on standard metrics like EER and minDCF.
- Combining both scales of analysis yields a richer speaker representation.
In essence, WavLM Large + RawNet2.5 merges two scales of speaker representation to produce a unified 256-dim embedding. By staying fully end-to-end, the architecture remains flexible and can leverage large amounts of data for improved speaker verification results.
Architecture
Reports
Benchmark
Speaker Verification Benchmark on VoxCeleb1 Dataset
Model | EER |
---|---|
ReDimNet-B6-SF2-LM-ASNorm | 0.37 |
WavLM+ECAPA-TDNN | 0.39 |
... | ... |
TitanNet-L | 0.68 |
... | ... |
SpeechNAS | 1.02 |
... | ... |
Multi Task SSL | 1.98 |
... | ... |
WavLMRawNetSVBase | 4.67 |
Note
- For a detailed notebook showing how test for
WavLMRawNetSVBase
was performed, please seenotebook/test.ipynb
.
Prerequisites
Inference
Python3.11
(or above)
For trainig from scratch
10GB Disk Space
(for VoxCeleb1 Dataset)12GB VRAM GPU
(or above)
Installation
Linux/Ubuntu
sudo apt update -y && sudo apt upgrade -y
sudo apt install -y ffmpeg
git clone https://github.com/bunyaminergen/WavLMRawNetSVBase
cd WavLMRawNetSVBase
conda env create -f environment.yaml
conda activate WavLMRawNetSVBase
Dataset Download (if training from scratch)
Please go to the url and register: KAIST MM
After receiving the e-mail, you can download the dataset directly from the e-mail by clicking on the link or you can use the following commands.
Note: To download from the command line, you must take the key parameter from the link in the e-mail and place it in the relevant place in the command line below.
To download
List of trial pairs - VoxCeleb1 (cleaned)
please go to the url: VoxCeleb
VoxCeleb1
Dev A
wget -c --no-check-certificate -O vox1_dev_wav_partaa "https://cn01.mmai.io/download/voxceleb?key=<YOUR_KEY>&file=vox1_dev_wav_partaa"
Dev B
wget -c --no-check-certificate -O vox1_dev_wav_partab "https://cn01.mmai.io/download/voxceleb?key=<YOUR_KEY>&file=vox1_dev_wav_partab"
Dev C
wget -c --no-check-certificate -O vox1_dev_wav_partac "https://cn01.mmai.io/download/voxceleb?key=<YOUR_KEY>&file=vox1_dev_wav_partac"
Dev D
wget -c --no-check-certificate -O vox1_dev_wav_partad "https://cn01.mmai.io/download/voxceleb?key=<YOUR_KEY>&file=vox1_dev_wav_partad"
Concatenate
cat vox1_dev* > vox1_dev_wav.zip
Test
wget -c --no-check-certificate -O vox1_test_wav.zip "https://cn01.mmai.io/download/voxceleb?key=<YOUR_KEY>&file=vox1_test_wav.zip"
List of trial pairs - VoxCeleb1 (cleaned)
wget https://mm.kaist.ac.kr/datasets/voxceleb/meta/veri_test2.txt
Version Control System
Releases
Branches
- Downloads last month
- 30