File size: 5,539 Bytes
c76626b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
---
license: other
license_name: model-license
license_link: https://github.com/alibaba-damo-academy/FunASR
frameworks:
- Pytorch
tasks:
- emotion-recognition
widgets:
  - enable: true
    version: 1
    task: emotion-recognition
    examples:
      - inputs:
          - data: git://example/test.wav
    inputs:
      - type: audio
        displayType: AudioUploader
        validator:
          max_size: 10M
        name: input
    output:
      displayType: Prediction
      displayValueMapping:
        labels: labels
        scores: scores
    inferencespec:
      cpu: 8
      gpu: 0
      gpu_memory: 0
      memory: 4096
    model_revision: master
    extendsParameters:
      extract_embedding: false
---

<div align="center">
    <h1>
    EMOTION2VEC+
    </h1>
    <p>
     emotion2vec+: speech emotion recognition foundation model <br>
    <b>emotion2vec+ base model</b>
    </p>
    <p>
    <img src="logo.png" style="width: 200px; height: 200px;">
    </p>
    <p>
    </p>
</div>


# Guides
emotion2vec+ is a series of foundational models for speech emotion recognition (SER). We aim to train a "whisper" in the field of speech emotion recognition, overcoming the effects of language and recording environments through data-driven methods to achieve universal, robust emotion recognition capabilities. The performance of emotion2vec+ significantly exceeds other highly downloaded open-source models on Hugging Face.

![](emotion2vec+radar.png)

This version (emotion2vec_plus_base) uses a large-scale pseudo-labeled data for finetuning to obtain a base size model (~90M),  and currently supports the following categories:
    0: angry
    1: happy
    2: neutral
    3: sad
    4: unknown

# Model Card
GitHub Repo: [emotion2vec](https://github.com/ddlBoJack/emotion2vec)
|Model|⭐Model Scope|🤗Hugging Face|Fine-tuning Data (Hours)|
|:---:|:-------------:|:-----------:|:-------------:|
|emotion2vec|[Link](https://www.modelscope.cn/models/iic/emotion2vec_base/summary)|[Link](https://huggingface.co/emotion2vec/emotion2vec)|/|
emotion2vec+ seed|[Link](https://modelscope.cn/models/iic/emotion2vec_plus_seed/summary)|[Link](https://huggingface.co/emotion2vec/emotion2vec_plus_seed)|201|
emotion2vec+ base|[Link](https://modelscope.cn/models/iic/emotion2vec_plus_base/summary)|[Link](https://huggingface.co/emotion2vec/emotion2vec_plus_base)|4788|
emotion2vec+ large|[Link](https://modelscope.cn/models/iic/emotion2vec_plus_large/summary)|[Link](https://huggingface.co/emotion2vec/emotion2vec_plus_large)|42526|


# Data Iteration

We offer 3 versions of emotion2vec+, each derived from the data of its predecessor. If you need a model focusing on spech emotion representation, refer to [emotion2vec: universal speech emotion representation model](https://huggingface.co/emotion2vec/emotion2vec).

- emotion2vec+ seed: Fine-tuned with academic speech emotion data
- emotion2vec+ base: Fine-tuned with filtered large-scale pseudo-labeled data to obtain the base size model (~90M)
- emotion2vec+ large: Fine-tuned with filtered large-scale pseudo-labeled data to obtain the large size model (~300M)

The iteration process is illustrated below, culminating in the training of the emotion2vec+ large model with 40k out of 160k hours of speech emotion data. Details of data engineering will be announced later. 

![](emotion2vec+data.png)

# Installation

`pip install -U funasr modelscope`

# Usage

input: 16k Hz speech recording

granularity:
- "utterance": Extract features from the entire utterance
- "frame": Extract frame-level features (50 Hz)

extract_embedding: Whether to extract features; set to False if using only the classification model

## Inference based on ModelScope

```python
from modelscope.pipelines import pipeline
from modelscope.utils.constant import Tasks

inference_pipeline = pipeline(
    task=Tasks.emotion_recognition,
    model="iic/emotion2vec_plus_base")

rec_result = inference_pipeline('https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/asr_example_zh.wav', granularity="utterance", extract_embedding=False)
print(rec_result)
```


## Inference based on FunASR

```python
from funasr import AutoModel

model = AutoModel(model="iic/emotion2vec_plus_base")

wav_file = f"{model.model_path}/example/test.wav"
res = model.generate(wav_file, output_dir="./outputs", granularity="utterance", extract_embedding=False)
print(res)
```
Note: The model will automatically download.


Supports input file list, wav.scp (Kaldi style):
```cat wav.scp
wav_name1 wav_path1.wav
wav_name2 wav_path2.wav
...
```

Outputs are emotion representation, saved in the output_dir in numpy format (can be loaded with np.load())

# Note

This repository is the Huggingface version of emotion2vec, with identical model parameters as the original model and Model Scope version. 

Original repository: [https://github.com/ddlBoJack/emotion2vec](https://github.com/ddlBoJack/emotion2vec)

Model Scope repository:[https://github.com/alibaba-damo-academy/FunASR](https://github.com/alibaba-damo-academy/FunASR/tree/funasr1.0/examples/industrial_data_pretraining/emotion2vec)

Hugging Face repository:[https://huggingface.co/emotion2vec](https://huggingface.co/emotion2vec)

# Citation
```BibTeX
@article{ma2023emotion2vec,
  title={emotion2vec: Self-Supervised Pre-Training for Speech Emotion Representation},
  author={Ma, Ziyang and Zheng, Zhisheng and Ye, Jiaxin and Li, Jinchao and Gao, Zhifu and Zhang, Shiliang and Chen, Xie},
  journal={arXiv preprint arXiv:2312.15185},
  year={2023}
}
```