File size: 4,391 Bytes
809d399
 
 
 
 
 
d13a393
809d399
b38d00c
809d399
 
14f7e43
809d399
14f7e43
809d399
 
 
d13a393
14f7e43
 
 
 
809d399
14f7e43
809d399
14f7e43
 
809d399
 
 
 
 
14f7e43
809d399
 
 
14f7e43
809d399
14f7e43
809d399
14f7e43
809d399
c203450
 
 
 
809d399
14f7e43
809d399
14f7e43
809d399
14f7e43
 
809d399
 
 
14f7e43
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
060a143
14f7e43
 
 
24610a3
 
 
14f7e43
060a143
14f7e43
 
 
060a143
14f7e43
d13a393
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
---
language:
- en
- ta
- fr
- ml
- hi
pipeline_tag: voice-activity-detection
base_model: facebook/wav2vec2-base
---

# Model Card for Emotion Classification from Voice

This model performs emotion classification from voice data using fine-tuned `Wav2Vec2Model` from Facebook. The model predicts one of seven emotion labels: Angry, Disgust, Fear, Happy, Neutral, Sad, and Surprise.

## Model Details

- **Developed by:** [Lingesh]
- **Model type:** Fine-tuned Wav2Vec2Model
- **Language(s):** English (en), Tamil (ta), French (fr), Malayalam (ml)
- **License:** [Choose a license]
- **Finetuned from model:** [facebook/wav2vec2-base](https://huggingface.co/facebook/wav2vec2-base)

### Model Sources

- **Repository:** [Link to your repository]
- **Demo:** [Gradio Demo Link if Available]

## Uses

### Direct Use

This model can be directly used for emotion detection in speech audio files, which can have applications in call centers, virtual assistants, and mental health monitoring.

### Out-of-Scope Use

The model is not intended for general speech recognition or other NLP tasks outside emotion classification.

## Datasets Used

The model has been trained on a combination of the following datasets:

- **CREMA-D:** 7,442 clips of actors speaking with various emotions
- **Torrento:** Emotional speech in Spanish, captured from various environments
- **RAVDESS:** 24 professional actors, 7 emotions
- **Emo-DB:** 535 utterances, covering 7 emotions

The combination of these datasets allows the model to generalize across multiple languages and accents.

## Bias, Risks, and Limitations

- **Bias:** The model might underperform on speech data with accents or languages not present in the training data.
- **Limitations:** The model is trained specifically for emotion detection and might not generalize well for other speech tasks.

## How to Get Started with the Model

```python
import torch
import numpy as np
from transformers import Wav2Vec2Model
from torchaudio.transforms import Resample

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
wav2vec2_model = Wav2Vec2Model.from_pretrained("facebook/wav2vec2-base", output_hidden_states=True).to(device)

class FineTunedWav2Vec2Model(torch.nn.Module):
    def __init__(self, wav2vec2_model, output_size):
        super(FineTunedWav2Vec2Model, self).__init__()
        self.wav2vec2 = wav2vec2_model
        self.fc = torch.nn.Linear(self.wav2vec2.config.hidden_size, output_size)

    def forward(self, x):
        self.wav2vec2 = self.wav2vec2.double()
        self.fc = self.fc.double()
        outputs = self.wav2vec2(x.double())
        out = outputs.hidden_states[-1]
        out = self.fc(out[:, 0, :])
        return out

def preprocess_audio(audio):
    sample_rate, waveform = audio
    if isinstance(waveform, np.ndarray):
        waveform = torch.from_numpy(waveform)
    if waveform.dim() == 2:
        waveform = waveform.mean(dim=0)
    
    # Normalize audio
    if waveform.dtype != torch.float32:
        waveform = waveform.float() / torch.iinfo(waveform.dtype).max
    
    # Resample to 16kHz
    if sample_rate != 16000:
        resampler = Resample(orig_freq=sample_rate, new_freq=16000)
        waveform = resampler(waveform)
    return waveform

def predict(audio):
    model_path = "model.pth"  # Path to your fine-tuned model
    model = FineTunedWav2Vec2Model(wav2vec2_model, 7).to(device)
    model.load_state_dict(torch.load(model_path, map_location=device))
    model.eval()

    waveform = preprocess_audio(audio)
    waveform = waveform.unsqueeze(0).to(device)

    with torch.no_grad():
        output = model(waveform)

    predicted_label = torch.argmax(output, dim=1).item()
    emotion_labels = ["Angry", "Disgust", "Fear", "Happy", "Neutral", "Sad", "Surprise"]
    return emotion_labels[predicted_label]

# Example usage
audio_data = (sample_rate, waveform)  # Replace with your actual audio data
emotion = predict(audio_data)
print(f"Predicted Emotion: {emotion}")
```

## Training Procedure

 - Preprocessing: Resampled all audio to 16kHz.
 - Training: Fine-tuned facebook/wav2vec2-base with emotion labels.
 - Hyperparameters: Batch size: 16, Learning rate: 5e-5, Epochs: 50

## Evaluation
Testing Data
Evaluation was performed on a held-out test set from the CREMA-D and RAVDESS datasets.

## Metrics
Accuracy: 85%
F1-score: 82% (weighted average across all classes)