---
license: mit
inference: false
---

# Introduction

**Music2Vec** is accepted as 2-page abstract in Late Breaking Demos (LBD) at the ISMIR 2022.
It is a completely unsupervised model trained on 1000 hour music audios. 
Our model is SOTA-comparable on multiple MIR tasks even under probing settings, while keeping fine-tunable on a single 2080Ti. 

# Model Usage

## Huggingface Loading

```python
from transformers import Wav2Vec2Processor, Data2VecAudioModel
import torch
from datasets import load_dataset

# load demo audio and set processor
dataset = load_dataset("hf-internal-testing/librispeech_asr_demo", "clean", split="validation")
dataset = dataset.sort("id")
sampling_rate = dataset.features["audio"].sampling_rate
processor = Wav2Vec2Processor.from_pretrained("facebook/data2vec-audio-base-960h")

# loading our model weights
model = Data2VecAudioModel.from_pretrained("m-a-p/music2vec-v1")


# audio file is decoded on the fly
inputs = processor(dataset[0]["audio"]["array"], sampling_rate=sampling_rate, return_tensors="pt")
with torch.no_grad():
    outputs = model(**inputs)

# take a look at the output shape
last_hidden_states = outputs.last_hidden_state
print(list(last_hidden_states.shape)) # [1, 292, 768]
```

Our model is based on the [data2vec audio model](https://huggingface.co/docs/transformers/model_doc/data2vec#transformers.Data2VecAudioModel).

# Citation

The paper can be found at [ISMIR](https://ismir2022program.ismir.net/lbd_410.html).

```shell
@article{li2022map,
  title={MAP-Music2Vec: A Simple and Effective Baseline for Self-Supervised Music Audio Representation Learning},
  author={Li, Yizhi and Yuan, Ruibin and Zhang, Ge and Ma, Yinghao and Lin, Chenghua and Chen, Xingran and Ragni, Anton and Yin, Hanzhi and Hu, Zhijie and He, Haoyu and others},
  journal={arXiv preprint arXiv:2212.02508},
  year={2022}
}

```