WavLMRawNetSVBase

WavLM Large + RawNet2.5 Speaker Verification Base: End-to-End Speaker Verification Architecture

This architecture combines WavLM Large and RawNet2.5 to learn both micro and macro features directly from raw waveforms. The goal is to obtain a fully end-to-end model, avoiding any manual feature extraction (e.g., MFCC, mel-spectrogram). Instead, the network itself discovers the most relevant frequency and temporal patterns for speaker verification.

Note: If you would like to contribute to this repository, please read the CONTRIBUTING first.

License GitHub release (latest by date) GitHub Discussions GitHub Issues

LinkedIn


Table of Contents


Introduction

Combine WavLM Large and RawNet2.5
  • WavLM Large (Transformer-based)

    • Developed by Microsoft, WavLM relies on self-attention layers that capture fine-grained (frame-level) or “micro” acoustic features.
    • It produces a 1024-dimensional embedding, focusing on localized, short-term variations in the speech signal.
  • RawNet2.5 (SincConv + Residual Stack)

    • Uses SincConv and residual blocks to summarize the raw signal on a broader (macro) scale.
    • The Attentive Stats Pooling layer aggregates mean + std across the entire time axis (with learnable attention), capturing global speaker characteristics.
    • Outputs a 256-dimensional embedding, representing the overall, longer-term structure of the speech.

These two approaches complement each other: WavLM Large excels at fine-detailed temporal features, while RawNet2.5 captures a more global, statistical overview.

Architectural Flow
  • Raw Audio Input

    • No manual preprocessing (like MFCC or mel-spectrogram).
    • A minimal Transform and Segment step (mono conversion, resample, slice/pad) formats the data into shape (B, T).
  • RawNet2.5 (Macro Features)

    • SincConv: Learns band-pass filters in a frequency-focused manner, constrained by low/high cutoff frequencies.
    • ResidualStack: A set of residual blocks (optionally with SEBlock) refines the representation.
    • Attentive Stats Pooling: Aggregates time-domain information into mean and std with a learnable attention mechanism.
    • A final FC layer yields a 256-dimensional embedding.
  • WavLM Large (Micro Features)

    • Transformer layers operate at frame-level, capturing fine-grained details.
    • Produces a 1024-dimensional embedding after mean pooling across time.
  • Fusion Layer

    • Concatenate the 256-dim RawNet2.5 embedding with the 1024-dim WavLM embedding, resulting in 1280 dimensions.
    • A Linear(1280 → 256) + ReLU layer reduces it to a 256-dim Fusion Embedding, combining micro and macro insights.
  • AMSoftmax Loss

    • During training, the 256-dim fusion embedding is passed to an AMSoftmax classifier (with margin + scale).
    • Embeddings of the same speaker are pulled closer, while different speakers are pushed apart in the angular space.
A Single End-to-End Learning Pipeline
  • Fully Automatic: Raw waveforms go in, final speaker embeddings come out.
  • No Manual Feature Extraction: We do not rely on handcrafted features like MFCC or mel-spectrogram.
  • Data-Driven: The model itself figures out which frequency bands or time segments matter most.
  • Enhanced Representation: WavLM delivers local detail, RawNet2.5 captures global stats, leading to a more robust speaker representation.
Why Avoid Preprocessing?
  • Deep Learning Principle: The model should learn how to process raw signals rather than relying on human-defined feature pipelines.
  • Better Generalization: Fewer hand-tuned hyperparameters; the model adapts better to various speakers, languages, and environments.
  • Scientific Rigor: Manual feature engineering can introduce subjective design choices. Letting the network learn directly from data is more consistent with data-driven approaches.
Performance & Advantages
  • Micro + Macro Features Combined

    • Captures both short-term acoustic nuances (WavLM) and holistic temporal stats (RawNet2.5).
  • Truly End-to-End

    • Beyond minimal slicing/padding, all layers are trainable.
    • No handcrafted feature extraction is involved.
  • VoxCeleb1 Test Results

    • Achieved an EER of 4.67% on the VoxCeleb1 evaluation set.
  • Overall Benefits

    • Potentially outperforms using WavLM or RawNet2.5 alone on standard metrics like EER and minDCF.
    • Combining both scales of analysis yields a richer speaker representation.

In essence, WavLM Large + RawNet2.5 merges two scales of speaker representation to produce a unified 256-dim embedding. By staying fully end-to-end, the architecture remains flexible and can leverage large amounts of data for improved speaker verification results.


Architecture

Architecture


Reports

Benchmark

Speaker Verification Benchmark on VoxCeleb1 Dataset

Model EER
ReDimNet-B6-SF2-LM-ASNorm 0.37
WavLM+ECAPA-TDNN 0.39
... ...
TitanNet-L 0.68
... ...
SpeechNAS 1.02
... ...
Multi Task SSL 1.98
... ...
WavLMRawNetSVBase 4.67

Note

  • For a detailed notebook showing how test for WavLMRawNetSVBase was performed, please see notebook/test.ipynb.

Prerequisites

Inference
  • Python3.11 (or above)
For trainig from scratch
  • 10GB Disk Space (for VoxCeleb1 Dataset)
  • 12GB VRAM GPU (or above)

Installation

Linux/Ubuntu
sudo apt update -y && sudo apt upgrade -y
sudo apt install -y ffmpeg
git clone https://github.com/bunyaminergen/WavLMRawNetSVBase
cd WavLMRawNetSVBase
conda env create -f environment.yaml
conda activate WavLMRawNetSVBase
Dataset Download (if training from scratch)
  1. Please go to the url and register: KAIST MM

  2. After receiving the e-mail, you can download the dataset directly from the e-mail by clicking on the link or you can use the following commands.

    Note: To download from the command line, you must take the key parameter from the link in the e-mail and place it in the relevant place in the command line below.

  3. To download List of trial pairs - VoxCeleb1 (cleaned) please go to the url: VoxCeleb

VoxCeleb1

Dev A

wget -c --no-check-certificate -O vox1_dev_wav_partaa "https://cn01.mmai.io/download/voxceleb?key=<YOUR_KEY>&file=vox1_dev_wav_partaa"

Dev B

wget -c --no-check-certificate -O vox1_dev_wav_partab "https://cn01.mmai.io/download/voxceleb?key=<YOUR_KEY>&file=vox1_dev_wav_partab"

Dev C

wget -c --no-check-certificate -O vox1_dev_wav_partac "https://cn01.mmai.io/download/voxceleb?key=<YOUR_KEY>&file=vox1_dev_wav_partac"

Dev D

wget -c --no-check-certificate -O vox1_dev_wav_partad "https://cn01.mmai.io/download/voxceleb?key=<YOUR_KEY>&file=vox1_dev_wav_partad"

Concatenate

cat vox1_dev* > vox1_dev_wav.zip

Test

wget -c --no-check-certificate -O vox1_test_wav.zip "https://cn01.mmai.io/download/voxceleb?key=<YOUR_KEY>&file=vox1_test_wav.zip"

List of trial pairs - VoxCeleb1 (cleaned)

wget https://mm.kaist.ac.kr/datasets/voxceleb/meta/veri_test2.txt

Version Control System

Releases
Branches

Downloads last month
30
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.