File size: 4,391 Bytes
809d399 d13a393 809d399 b38d00c 809d399 14f7e43 809d399 14f7e43 809d399 d13a393 14f7e43 809d399 14f7e43 809d399 14f7e43 809d399 14f7e43 809d399 14f7e43 809d399 14f7e43 809d399 14f7e43 809d399 c203450 809d399 14f7e43 809d399 14f7e43 809d399 14f7e43 809d399 14f7e43 060a143 14f7e43 24610a3 14f7e43 060a143 14f7e43 060a143 14f7e43 d13a393 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 |
---
language:
- en
- ta
- fr
- ml
- hi
pipeline_tag: voice-activity-detection
base_model: facebook/wav2vec2-base
---
# Model Card for Emotion Classification from Voice
This model performs emotion classification from voice data using fine-tuned `Wav2Vec2Model` from Facebook. The model predicts one of seven emotion labels: Angry, Disgust, Fear, Happy, Neutral, Sad, and Surprise.
## Model Details
- **Developed by:** [Lingesh]
- **Model type:** Fine-tuned Wav2Vec2Model
- **Language(s):** English (en), Tamil (ta), French (fr), Malayalam (ml)
- **License:** [Choose a license]
- **Finetuned from model:** [facebook/wav2vec2-base](https://huggingface.co/facebook/wav2vec2-base)
### Model Sources
- **Repository:** [Link to your repository]
- **Demo:** [Gradio Demo Link if Available]
## Uses
### Direct Use
This model can be directly used for emotion detection in speech audio files, which can have applications in call centers, virtual assistants, and mental health monitoring.
### Out-of-Scope Use
The model is not intended for general speech recognition or other NLP tasks outside emotion classification.
## Datasets Used
The model has been trained on a combination of the following datasets:
- **CREMA-D:** 7,442 clips of actors speaking with various emotions
- **Torrento:** Emotional speech in Spanish, captured from various environments
- **RAVDESS:** 24 professional actors, 7 emotions
- **Emo-DB:** 535 utterances, covering 7 emotions
The combination of these datasets allows the model to generalize across multiple languages and accents.
## Bias, Risks, and Limitations
- **Bias:** The model might underperform on speech data with accents or languages not present in the training data.
- **Limitations:** The model is trained specifically for emotion detection and might not generalize well for other speech tasks.
## How to Get Started with the Model
```python
import torch
import numpy as np
from transformers import Wav2Vec2Model
from torchaudio.transforms import Resample
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
wav2vec2_model = Wav2Vec2Model.from_pretrained("facebook/wav2vec2-base", output_hidden_states=True).to(device)
class FineTunedWav2Vec2Model(torch.nn.Module):
def __init__(self, wav2vec2_model, output_size):
super(FineTunedWav2Vec2Model, self).__init__()
self.wav2vec2 = wav2vec2_model
self.fc = torch.nn.Linear(self.wav2vec2.config.hidden_size, output_size)
def forward(self, x):
self.wav2vec2 = self.wav2vec2.double()
self.fc = self.fc.double()
outputs = self.wav2vec2(x.double())
out = outputs.hidden_states[-1]
out = self.fc(out[:, 0, :])
return out
def preprocess_audio(audio):
sample_rate, waveform = audio
if isinstance(waveform, np.ndarray):
waveform = torch.from_numpy(waveform)
if waveform.dim() == 2:
waveform = waveform.mean(dim=0)
# Normalize audio
if waveform.dtype != torch.float32:
waveform = waveform.float() / torch.iinfo(waveform.dtype).max
# Resample to 16kHz
if sample_rate != 16000:
resampler = Resample(orig_freq=sample_rate, new_freq=16000)
waveform = resampler(waveform)
return waveform
def predict(audio):
model_path = "model.pth" # Path to your fine-tuned model
model = FineTunedWav2Vec2Model(wav2vec2_model, 7).to(device)
model.load_state_dict(torch.load(model_path, map_location=device))
model.eval()
waveform = preprocess_audio(audio)
waveform = waveform.unsqueeze(0).to(device)
with torch.no_grad():
output = model(waveform)
predicted_label = torch.argmax(output, dim=1).item()
emotion_labels = ["Angry", "Disgust", "Fear", "Happy", "Neutral", "Sad", "Surprise"]
return emotion_labels[predicted_label]
# Example usage
audio_data = (sample_rate, waveform) # Replace with your actual audio data
emotion = predict(audio_data)
print(f"Predicted Emotion: {emotion}")
```
## Training Procedure
- Preprocessing: Resampled all audio to 16kHz.
- Training: Fine-tuned facebook/wav2vec2-base with emotion labels.
- Hyperparameters: Batch size: 16, Learning rate: 5e-5, Epochs: 50
## Evaluation
Testing Data
Evaluation was performed on a held-out test set from the CREMA-D and RAVDESS datasets.
## Metrics
Accuracy: 85%
F1-score: 82% (weighted average across all classes) |