---
license: gpl-3.0
metrics:
- Equal Error Rate (EER)
language:
- en
pipeline_tag: feature-extraction
tags:
- speech
- speakerverification
- voxceleb
- wavlm
- rawnet
---
<div align="center">

# WavLMRawNetSVBase

### `WavLM Large + RawNet2.5 Speaker Verification Base: End-to-End Speaker Verification Architecture`

This architecture combines **WavLM Large** and **RawNet2.5** to learn both **micro** and **macro** features directly
from raw waveforms. The goal is to obtain a **fully end-to-end** model, avoiding any manual feature extraction (e.g.,
MFCC, mel-spectrogram). Instead, the network itself discovers the most relevant frequency and temporal patterns for
speaker verification.

**Note**: _If you would like to contribute to this repository,
please read the [CONTRIBUTING](.docs/documentation/CONTRIBUTING.md) first._

![License](https://img.shields.io/github/license/bunyaminergen/WavLMRawNetSVBase)
![GitHub release (latest by date)](https://img.shields.io/github/v/release/bunyaminergen/WavLMRawNetSVBase)
![GitHub Discussions](https://img.shields.io/github/discussions/bunyaminergen/WavLMRawNetSVBase)
![GitHub Issues](https://img.shields.io/github/issues/bunyaminergen/WavLMRawNetSVBase)

[![LinkedIn](https://img.shields.io/badge/LinkedIn-Profile-blue?logo=linkedin)](https://linkedin.com/in/bunyaminergen)

</div>

---

### Table of Contents

- [Introduction](#introduction)
- [Architecture](#architecture)
- [Reports](#reports)
- [Demo](#demo)
- [Prerequisites](#prerequisites)
- [Installation](#installation)
- [Usage](#usage)
- [File Structure](#file-structure)
- [Version Control System](#version-control-system)
- [Upcoming](#upcoming)
- [Documentations](#documentations)
- [License](#licence)
- [Links](#links)
- [Team](#team)
- [Contact](#contact)
- [Citation](#citation)

---

### Introduction

##### Combine WavLM Large and RawNet2.5

- WavLM Large (Transformer-based)
    - Developed by Microsoft, WavLM relies on self-attention layers that capture fine-grained (`frame-level`) or “micro”
      acoustic features.
    - It produces a **1024-dimensional** embedding, focusing on localized, short-term variations in the speech signal.

- RawNet2.5 (SincConv + Residual Stack)
    - Uses SincConv and residual blocks to summarize the raw signal on a broader (macro) scale.
    - The **Attentive Stats Pooling** layer aggregates mean + std across the entire time axis (with learnable
      attention),
      capturing global speaker characteristics.
    - Outputs a **256-dimensional** embedding, representing the overall, longer-term structure of the speech.

These two approaches complement each other: WavLM Large excels at fine-detailed temporal features, while RawNet2.5
captures a more global, statistical overview.

##### Architectural Flow

- Raw Audio Input
    - **No manual preprocessing** (like MFCC or mel-spectrogram).
    - A minimal **Transform** and **Segment** step (mono conversion, resample, slice/pad) formats the data into shape
      `(B, T)`.

- RawNet2.5 (Macro Features)
    - SincConv: Learns band-pass filters in a frequency-focused manner, constrained by low/high cutoff frequencies.
    - ResidualStack: A set of residual blocks (optionally with SEBlock) refines the representation.
    - Attentive Stats Pooling: Aggregates time-domain information into mean and std with a learnable attention
      mechanism.
    - A final **FC** layer yields a 256-dimensional embedding.

- WavLM Large (Micro Features)
    - Transformer layers operate at `frame-level`, capturing fine-grained details.
    - Produces a **1024-dimensional** embedding after mean pooling across time.

- Fusion Layer
    - Concatenate the **256-dim** RawNet2.5 embedding with the **1024-dim** WavLM embedding, resulting in **1280**
      dimensions.
    - A **Linear(1280 → 256) + ReLU** layer reduces it to a **256-dim Fusion Embedding**, combining micro and macro
      insights.

- AMSoftmax Loss
    - During training, the 256-dim fusion embedding is passed to an AMSoftmax classifier (with margin + scale).
    - Embeddings of the same speaker are pulled closer, while different speakers are pushed apart in the angular space.

##### A Single End-to-End Learning Pipeline

- Fully Automatic: Raw waveforms go in, final speaker embeddings come out.
- No Manual Feature Extraction: We do not rely on handcrafted features like MFCC or mel-spectrogram.
- Data-Driven: The model itself figures out which frequency bands or time segments matter most.
- Enhanced Representation: WavLM delivers local detail, RawNet2.5 captures global stats, leading to a more robust
  speaker representation.

##### Why Avoid Preprocessing?

- Deep Learning Principle: The model should learn how to process raw signals rather than relying on human-defined
  feature pipelines.
- Better Generalization: Fewer hand-tuned hyperparameters; the model adapts better to various speakers, languages, and
  environments.
- Scientific Rigor: Manual feature engineering can introduce subjective design choices. Letting the network learn
  directly from data is more consistent with data-driven approaches.

##### Performance & Advantages

- Micro + Macro Features Combined
    - Captures both short-term acoustic nuances (WavLM) and holistic temporal stats (RawNet2.5).

- Truly End-to-End
    - Beyond minimal slicing/padding, all layers are trainable.
    - No handcrafted feature extraction is involved.

- VoxCeleb1 Test Results
    - Achieved an **EER of 4.67%** on the VoxCeleb1 evaluation set.

- Overall Benefits
    - Potentially outperforms using WavLM or RawNet2.5 alone on standard metrics like EER and minDCF.
    - Combining both scales of analysis yields a richer speaker representation.

In essence, **WavLM Large + RawNet2.5** merges two scales of speaker representation to produce a **unified 256-dim
embedding**. By staying fully end-to-end, the architecture remains flexible and can leverage large amounts of data for
improved speaker verification results.

---

### Architecture

![Architecture](.docs/img/architecture/WavLMRawNetSVBase.gif)

---

### Reports

##### Benchmark

*Speaker Verification Benchmark on VoxCeleb1 Dataset*

| Model                         | EER   |
|-------------------------------|-------|
| **ReDimNet-B6-SF2-LM-ASNorm** | 0.37  |
| **WavLM+ECAPA-TDNN**          | 0.39  |
| ...                           | ...   |
| **TitanNet-L**                | 0.68	 |
| ...                           | ...   |
| **SpeechNAS**                 | 1.02	 |
| ...                           | ...   |
| **Multi Task SSL**            | 1.98	 |
| ...                           | ...   |
| **WavLMRawNetSVBase**         | 4.67	 |

> **Note**
> - For a detailed notebook showing how test for `WavLMRawNetSVBase` was performed, please see
    > [`notebook/test.ipynb`](notebook/test.ipynb).

---

### Prerequisites

##### Inference

- `Python3.11` _(or above)_

##### For trainig from scratch

- `10GB Disk Space` _(for VoxCeleb1 Dataset)_
- `12GB VRAM GPU` _(or above)_

---

### Installation

##### Linux/Ubuntu

```bash
sudo apt update -y && sudo apt upgrade -y
```

```bash
sudo apt install -y ffmpeg
```

```bash
git clone https://github.com/bunyaminergen/WavLMRawNetSVBase
```

```bash
cd WavLMRawNetSVBase
```

```bash
conda env create -f environment.yaml
```

```bash
conda activate WavLMRawNetSVBase
```

##### Dataset Download (if training from scratch)

1. Please go to the url and register: [KAIST MM](https://cn01.mmai.io/keyreq/voxceleb)
2. After receiving the e-mail, you can download the dataset directly from the e-mail by clicking on the link or you can
   use the following commands.

   **Note**: *To download from the command line, you must take the key parameter from the
   link
   in the e-mail and place it in the relevant place in the command line below.*
3. To download `List of trial pairs - VoxCeleb1 (cleaned)` please go to the
   url: [VoxCeleb](https://mm.kaist.ac.kr/datasets/voxceleb/)

**VoxCeleb1**

Dev A

```bash
wget -c --no-check-certificate -O vox1_dev_wav_partaa "https://cn01.mmai.io/download/voxceleb?key=<YOUR_KEY>&file=vox1_dev_wav_partaa"
```

Dev B

```bash
wget -c --no-check-certificate -O vox1_dev_wav_partab "https://cn01.mmai.io/download/voxceleb?key=<YOUR_KEY>&file=vox1_dev_wav_partab"
```

Dev C

```bash
wget -c --no-check-certificate -O vox1_dev_wav_partac "https://cn01.mmai.io/download/voxceleb?key=<YOUR_KEY>&file=vox1_dev_wav_partac"
```

Dev D

```bash
wget -c --no-check-certificate -O vox1_dev_wav_partad "https://cn01.mmai.io/download/voxceleb?key=<YOUR_KEY>&file=vox1_dev_wav_partad"

```

Concatenate

```bash
cat vox1_dev* > vox1_dev_wav.zip
```

Test

```bash
wget -c --no-check-certificate -O vox1_test_wav.zip "https://cn01.mmai.io/download/voxceleb?key=<YOUR_KEY>&file=vox1_test_wav.zip"
```

List of trial pairs - VoxCeleb1 (cleaned)

```bash
wget https://mm.kaist.ac.kr/datasets/voxceleb/meta/veri_test2.txt
```

---

### Version Control System

##### Releases

- [v1.0.0](https://github.com/bunyaminergen/WavLMRawNetSVBase/archive/refs/tags/v1.0.0.zip) _.zip_
- [v1.0.0](https://github.com/bunyaminergen/WavLMRawNetSVBase/archive/refs/tags/v1.0.0.tar.gz) _.tar.gz_

##### Branches

- [main](https://github.com/bunyaminergen/WavLMRawNetSVBase/main/)
- [develop](https://github.com/bunyaminergen/WavLMRawNetSVBase/develop/)

---