wav2vec2-emotion-recognition

This model is fine-tuned on the Wav2Vec2 architecture for speech emotion recognition. It can classify speech into 8 different emotions with corresponding confidence scores.

Model Description

  • Model Architecture: Wav2Vec2 with sequence classification head
  • Language: English
  • Task: Speech Emotion Recognition
  • Fine-tuned from: facebook/wav2vec2-base
  • Datasets: Combined emotion datasets
  • TESS
  • CREMA-D
  • SAVEE
  • RAVDESS

Performance Metrics

  • Accuracy: 79.57%
  • F1 Score: 79.43%

Supported Emotions

  • 😠 Angry
  • 😌 Calm
  • 🀒 Disgust
  • 😨 Fearful
  • 😊 Happy
  • 😐 Neutral
  • 😒 Sad
  • 😲 Surprised

Training Details

The model was trained with the following configuration:

  • Epochs: 15
  • Batch Size: 16
  • Learning Rate: 5e-5
  • Optimizer: AdamW
  • Weight Decay: 0.03
  • Gradient Accumulation Steps: 2
  • Mixed Precision: fp16

For detailed training process, check out the Fine-tuning Notebook

Limitations

Audio Requirements:

  • Sampling rate: 16kHz (will be automatically resampled)
  • Maximum duration: 1 minute
  • Clear speech with minimal background noise recommended

Performance Considerations:

  • Best results with clear speech audio
  • Performance may vary with different accents
  • Background noise can affect accuracy

Demo

https://huggingface.co/spaces/Dpngtm/Audio-Emotion-Recognition

Contact

For issues and questions, feel free to:

  1. Open an issue on the Model Repository
  2. Comment on the Demo Space

Usage

from transformers import Wav2Vec2ForSequenceClassification, Wav2Vec2Processor
import torch
import torchaudio

# Load model and processor
model = Wav2Vec2ForSequenceClassification.from_pretrained("Dpngtm/wav2vec2-emotion-recognition")
processor = Wav2Vec2Processor.from_pretrained("Dpngtm/wav2vec2-emotion-recognition")

# Load and preprocess audio
speech_array, sampling_rate = torchaudio.load("path_to_audio.wav")
if sampling_rate != 16000:
   resampler = torchaudio.transforms.Resample(orig_freq=sampling_rate, new_freq=16000)
   speech_array = resampler(speech_array)

# Convert to mono if stereo
if speech_array.shape[0] > 1:
   speech_array = torch.mean(speech_array, dim=0, keepdim=True)

speech_array = speech_array.squeeze().numpy()

# Process through model
inputs = processor(speech_array, sampling_rate=16000, return_tensors="pt", padding=True)
with torch.no_grad():
   outputs = model(**inputs)
   predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)

# Get predicted emotion
emotion_labels = ["angry", "calm", "disgust", "fearful", "happy", "neutral", "sad", "surprised"]
predicted_emotion = emotion_labels[predictions.argmax().item()]
Downloads last month
114
Safetensors
Model size
94.6M params
Tensor type
F32
Β·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.

Space using Dpngtm/wav2vec2-emotion-recognition 1