File size: 2,617 Bytes
0635888
 
6725cd0
0635888
8836b78
 
 
7173a60
8836b78
 
f459d73
8836b78
 
 
 
 
 
 
 
2e154fb
8836b78
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2e154fb
8836b78
2e154fb
 
 
 
 
 
 
 
 
 
53f05da
2e154fb
fe63bad
8836b78
 
 
 
 
 
7173a60
8836b78
 
573f43e
 
 
 
 
f04d9a3
 
8836b78
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
---
license: mit
inference: false
---

# Introduction

**Music2Vec** is accepted as 2-page abstract in Late Breaking Demos (LBD) at the ISMIR 2022.
It is a completely unsupervised model trained on 1000 hour music audios. 
Our model is SOTA-comparable on multiple MIR tasks even under probing settings, while keeping fine-tunable on a single 2080Ti. 
Larger models trained with more data are on the way~

# Model Usage

## Huggingface Loading

```python
from transformers import Wav2Vec2Processor, Data2VecAudioModel
import torch
from torch import nn
from datasets import load_dataset

# load demo audio and set processor
dataset = load_dataset("hf-internal-testing/librispeech_asr_demo", "clean", split="validation")
dataset = dataset.sort("id")
sampling_rate = dataset.features["audio"].sampling_rate
processor = Wav2Vec2Processor.from_pretrained("facebook/data2vec-audio-base-960h")

# loading our model weights
model = Data2VecAudioModel.from_pretrained("m-a-p/music2vec-v1")


# audio file is decoded on the fly
inputs = processor(dataset[0]["audio"]["array"], sampling_rate=sampling_rate, return_tensors="pt")
with torch.no_grad():
    outputs = model(**inputs, output_hidden_states=True)

# take a look at the output shape, there are 13 layers of representation
# each layer performs differently in different downstream tasks, you should choose empirically
all_layer_hidden_states = torch.stack(outputs.hidden_states).squeeze()
print(all_layer_hidden_states.shape) # [13 layer, 292 timestep, 768 feature_dim]

# for utterance level classification tasks, you can simply reduce the representation in time
time_reduced_hidden_states = all_layer_hidden_states.mean(-2)
print(time_reduced_hidden_states.shape) # [13, 768]

# you can even use a learnable weighted average representation
aggregator = nn.Conv1d(in_channels=13, out_channels=1, kernel_size=1)
weighted_avg_hidden_states = aggregator(time_reduced_hidden_states).squeeze()
print(weighted_avg_hidden_states.shape) # [768]
```

Our model is based on the [data2vec audio model](https://huggingface.co/docs/transformers/model_doc/data2vec#transformers.Data2VecAudioModel).

# Citation

The paper can be found at [ISMIR](https://ismir2022program.ismir.net/lbd_410.html).

```shell
@article{li2022map,
  title={MAP-Music2Vec: A Simple and Effective Baseline for Self-Supervised Music Audio Representation Learning},
  author={Li, Yizhi and Yuan, Ruibin and Zhang, Ge and Ma, Yinghao and Lin, Chenghua and Chen, Xingran and Ragni, Anton and Yin, Hanzhi and Hu, Zhijie and He, Haoyu and others},
  journal={arXiv preprint arXiv:2212.02508},
  year={2022}
}

```