--- license: gpl-3.0 metrics: - Equal Error Rate (EER) language: - en pipeline_tag: feature-extraction tags: - speech - speakerverification - voxceleb - wavlm - rawnet ---
# WavLMRawNetSVBase ### `WavLM Large + RawNet2.5 Speaker Verification Base: End-to-End Speaker Verification Architecture` This architecture combines **WavLM Large** and **RawNet2.5** to learn both **micro** and **macro** features directly from raw waveforms. The goal is to obtain a **fully end-to-end** model, avoiding any manual feature extraction (e.g., MFCC, mel-spectrogram). Instead, the network itself discovers the most relevant frequency and temporal patterns for speaker verification. **Note**: _If you would like to contribute to this repository, please read the [CONTRIBUTING](.docs/documentation/CONTRIBUTING.md) first._ ![License](https://img.shields.io/github/license/bunyaminergen/WavLMRawNetSVBase) ![GitHub release (latest by date)](https://img.shields.io/github/v/release/bunyaminergen/WavLMRawNetSVBase) ![GitHub Discussions](https://img.shields.io/github/discussions/bunyaminergen/WavLMRawNetSVBase) ![GitHub Issues](https://img.shields.io/github/issues/bunyaminergen/WavLMRawNetSVBase) [![LinkedIn](https://img.shields.io/badge/LinkedIn-Profile-blue?logo=linkedin)](https://linkedin.com/in/bunyaminergen)
--- ### Table of Contents - [Introduction](#introduction) - [Architecture](#architecture) - [Reports](#reports) - [Demo](#demo) - [Prerequisites](#prerequisites) - [Installation](#installation) - [Usage](#usage) - [File Structure](#file-structure) - [Version Control System](#version-control-system) - [Upcoming](#upcoming) - [Documentations](#documentations) - [License](#licence) - [Links](#links) - [Team](#team) - [Contact](#contact) - [Citation](#citation) --- ### Introduction ##### Combine WavLM Large and RawNet2.5 - WavLM Large (Transformer-based) - Developed by Microsoft, WavLM relies on self-attention layers that capture fine-grained (`frame-level`) or “micro” acoustic features. - It produces a **1024-dimensional** embedding, focusing on localized, short-term variations in the speech signal. - RawNet2.5 (SincConv + Residual Stack) - Uses SincConv and residual blocks to summarize the raw signal on a broader (macro) scale. - The **Attentive Stats Pooling** layer aggregates mean + std across the entire time axis (with learnable attention), capturing global speaker characteristics. - Outputs a **256-dimensional** embedding, representing the overall, longer-term structure of the speech. These two approaches complement each other: WavLM Large excels at fine-detailed temporal features, while RawNet2.5 captures a more global, statistical overview. ##### Architectural Flow - Raw Audio Input - **No manual preprocessing** (like MFCC or mel-spectrogram). - A minimal **Transform** and **Segment** step (mono conversion, resample, slice/pad) formats the data into shape `(B, T)`. - RawNet2.5 (Macro Features) - SincConv: Learns band-pass filters in a frequency-focused manner, constrained by low/high cutoff frequencies. - ResidualStack: A set of residual blocks (optionally with SEBlock) refines the representation. - Attentive Stats Pooling: Aggregates time-domain information into mean and std with a learnable attention mechanism. - A final **FC** layer yields a 256-dimensional embedding. - WavLM Large (Micro Features) - Transformer layers operate at `frame-level`, capturing fine-grained details. - Produces a **1024-dimensional** embedding after mean pooling across time. - Fusion Layer - Concatenate the **256-dim** RawNet2.5 embedding with the **1024-dim** WavLM embedding, resulting in **1280** dimensions. - A **Linear(1280 → 256) + ReLU** layer reduces it to a **256-dim Fusion Embedding**, combining micro and macro insights. - AMSoftmax Loss - During training, the 256-dim fusion embedding is passed to an AMSoftmax classifier (with margin + scale). - Embeddings of the same speaker are pulled closer, while different speakers are pushed apart in the angular space. ##### A Single End-to-End Learning Pipeline - Fully Automatic: Raw waveforms go in, final speaker embeddings come out. - No Manual Feature Extraction: We do not rely on handcrafted features like MFCC or mel-spectrogram. - Data-Driven: The model itself figures out which frequency bands or time segments matter most. - Enhanced Representation: WavLM delivers local detail, RawNet2.5 captures global stats, leading to a more robust speaker representation. ##### Why Avoid Preprocessing? - Deep Learning Principle: The model should learn how to process raw signals rather than relying on human-defined feature pipelines. - Better Generalization: Fewer hand-tuned hyperparameters; the model adapts better to various speakers, languages, and environments. - Scientific Rigor: Manual feature engineering can introduce subjective design choices. Letting the network learn directly from data is more consistent with data-driven approaches. ##### Performance & Advantages - Micro + Macro Features Combined - Captures both short-term acoustic nuances (WavLM) and holistic temporal stats (RawNet2.5). - Truly End-to-End - Beyond minimal slicing/padding, all layers are trainable. - No handcrafted feature extraction is involved. - VoxCeleb1 Test Results - Achieved an **EER of 4.67%** on the VoxCeleb1 evaluation set. - Overall Benefits - Potentially outperforms using WavLM or RawNet2.5 alone on standard metrics like EER and minDCF. - Combining both scales of analysis yields a richer speaker representation. In essence, **WavLM Large + RawNet2.5** merges two scales of speaker representation to produce a **unified 256-dim embedding**. By staying fully end-to-end, the architecture remains flexible and can leverage large amounts of data for improved speaker verification results. --- ### Architecture ![Architecture](.docs/img/architecture/WavLMRawNetSVBase.gif) --- ### Reports ##### Benchmark *Speaker Verification Benchmark on VoxCeleb1 Dataset* | Model | EER | |-------------------------------|-------| | **ReDimNet-B6-SF2-LM-ASNorm** | 0.37 | | **WavLM+ECAPA-TDNN** | 0.39 | | ... | ... | | **TitanNet-L** | 0.68 | | ... | ... | | **SpeechNAS** | 1.02 | | ... | ... | | **Multi Task SSL** | 1.98 | | ... | ... | | **WavLMRawNetSVBase** | 4.67 | > **Note** > - For a detailed notebook showing how test for `WavLMRawNetSVBase` was performed, please see > [`notebook/test.ipynb`](notebook/test.ipynb). --- ### Prerequisites ##### Inference - `Python3.11` _(or above)_ ##### For trainig from scratch - `10GB Disk Space` _(for VoxCeleb1 Dataset)_ - `12GB VRAM GPU` _(or above)_ --- ### Installation ##### Linux/Ubuntu ```bash sudo apt update -y && sudo apt upgrade -y ``` ```bash sudo apt install -y ffmpeg ``` ```bash git clone https://github.com/bunyaminergen/WavLMRawNetSVBase ``` ```bash cd WavLMRawNetSVBase ``` ```bash conda env create -f environment.yaml ``` ```bash conda activate WavLMRawNetSVBase ``` ##### Dataset Download (if training from scratch) 1. Please go to the url and register: [KAIST MM](https://cn01.mmai.io/keyreq/voxceleb) 2. After receiving the e-mail, you can download the dataset directly from the e-mail by clicking on the link or you can use the following commands. **Note**: *To download from the command line, you must take the key parameter from the link in the e-mail and place it in the relevant place in the command line below.* 3. To download `List of trial pairs - VoxCeleb1 (cleaned)` please go to the url: [VoxCeleb](https://mm.kaist.ac.kr/datasets/voxceleb/) **VoxCeleb1** Dev A ```bash wget -c --no-check-certificate -O vox1_dev_wav_partaa "https://cn01.mmai.io/download/voxceleb?key=&file=vox1_dev_wav_partaa" ``` Dev B ```bash wget -c --no-check-certificate -O vox1_dev_wav_partab "https://cn01.mmai.io/download/voxceleb?key=&file=vox1_dev_wav_partab" ``` Dev C ```bash wget -c --no-check-certificate -O vox1_dev_wav_partac "https://cn01.mmai.io/download/voxceleb?key=&file=vox1_dev_wav_partac" ``` Dev D ```bash wget -c --no-check-certificate -O vox1_dev_wav_partad "https://cn01.mmai.io/download/voxceleb?key=&file=vox1_dev_wav_partad" ``` Concatenate ```bash cat vox1_dev* > vox1_dev_wav.zip ``` Test ```bash wget -c --no-check-certificate -O vox1_test_wav.zip "https://cn01.mmai.io/download/voxceleb?key=&file=vox1_test_wav.zip" ``` List of trial pairs - VoxCeleb1 (cleaned) ```bash wget https://mm.kaist.ac.kr/datasets/voxceleb/meta/veri_test2.txt ``` --- ### Version Control System ##### Releases - [v1.0.0](https://github.com/bunyaminergen/WavLMRawNetSVBase/archive/refs/tags/v1.0.0.zip) _.zip_ - [v1.0.0](https://github.com/bunyaminergen/WavLMRawNetSVBase/archive/refs/tags/v1.0.0.tar.gz) _.tar.gz_ ##### Branches - [main](https://github.com/bunyaminergen/WavLMRawNetSVBase/main/) - [develop](https://github.com/bunyaminergen/WavLMRawNetSVBase/develop/) ---