metadata
license: mit
inference: false
Introduction
Music2Vec is accepted as 2-page abstract in Late Breaking Demos (LBD) at the ISMIR 2022. It is a completely unsupervised model trained on 1000 hour music audios. Our model is SOTA-comparable on multiple MIR tasks even under probing settings, while keeping fine-tunable on a single 2080Ti.
Model Usage
Huggingface Loading
from transformers import Wav2Vec2Processor, Data2VecAudioModel
import torch
from datasets import load_dataset
# load demo audio and set processor
dataset = load_dataset("hf-internal-testing/librispeech_asr_demo", "clean", split="validation")
dataset = dataset.sort("id")
sampling_rate = dataset.features["audio"].sampling_rate
processor = Wav2Vec2Processor.from_pretrained("facebook/data2vec-audio-base-960h")
# loading our model weights
model = Data2VecAudioModel.from_pretrained("m-a-p/music2vec-v1")
# audio file is decoded on the fly
inputs = processor(dataset[0]["audio"]["array"], sampling_rate=sampling_rate, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
# take a look at the output shape
last_hidden_states = outputs.last_hidden_state
print(list(last_hidden_states.shape)) # [1, 292, 768]
Our model is based on the data2vec audio model.
Citation
The paper can be found at ISMIR.
@misc{https://doi.org/10.48550/arxiv.2212.02508,
doi = {10.48550/ARXIV.2212.02508},
url = {https://arxiv.org/abs/2212.02508},
author = {Li, Yizhi and Yuan, Ruibin and Zhang, Ge and Ma, Yinghao and Lin, Chenghua and Chen, Xingran and Ragni, Anton and Yin, Hanzhi and Hu, Zhijie and He, Haoyu and Benetos, Emmanouil and Gyenge, Norbert and Liu, Ruibo and Fu, Jie},
keywords = {Sound (cs.SD), Artificial Intelligence (cs.AI), Machine Learning (cs.LG), Multimedia (cs.MM), Audio and Speech Processing (eess.AS), FOS: Computer and information sciences, FOS: Computer and information sciences, FOS: Electrical engineering, electronic engineering, information engineering, FOS: Electrical engineering, electronic engineering, information engineering},
title = {MAP-Music2Vec: A Simple and Effective Baseline for Self-Supervised Music Audio Representation Learning},
publisher = {arXiv},
year = {2022},
copyright = {arXiv.org perpetual, non-exclusive license}
}