--- license: mit inference: false --- # Introduction **Music2Vec** is accepted as 2-page abstract in Late Breaking Demos (LBD) at the ISMIR 2022. It is a completely unsupervised model trained on 1000 hour music audios. Our model is SOTA-comparable on multiple MIR tasks even under probing settings, while keeping fine-tunable on a single 2080Ti. # Model Usage ## Huggingface Loading ```python from transformers import Wav2Vec2Processor, Data2VecAudioModel import torch from datasets import load_dataset # load demo audio and set processor dataset = load_dataset("hf-internal-testing/librispeech_asr_demo", "clean", split="validation") dataset = dataset.sort("id") sampling_rate = dataset.features["audio"].sampling_rate processor = Wav2Vec2Processor.from_pretrained("facebook/data2vec-audio-base-960h") # loading our model weights model = Data2VecAudioModel.from_pretrained("m-a-p/music2vec-v1") # audio file is decoded on the fly inputs = processor(dataset[0]["audio"]["array"], sampling_rate=sampling_rate, return_tensors="pt") with torch.no_grad(): outputs = model(**inputs) # take a look at the output shape last_hidden_states = outputs.last_hidden_state print(list(last_hidden_states.shape)) # [1, 292, 768] ``` Our model is based on the [data2vec audio model](https://huggingface.co/docs/transformers/model_doc/data2vec#transformers.Data2VecAudioModel). # Citation The paper can be found at [ISMIR](https://ismir2022program.ismir.net/lbd_410.html). ```shell @misc{https://doi.org/10.48550/arxiv.2212.02508, doi = {10.48550/ARXIV.2212.02508}, url = {https://arxiv.org/abs/2212.02508}, author = {Li, Yizhi and Yuan, Ruibin and Zhang, Ge and Ma, Yinghao and Lin, Chenghua and Chen, Xingran and Ragni, Anton and Yin, Hanzhi and Hu, Zhijie and He, Haoyu and Benetos, Emmanouil and Gyenge, Norbert and Liu, Ruibo and Fu, Jie}, keywords = {Sound (cs.SD), Artificial Intelligence (cs.AI), Machine Learning (cs.LG), Multimedia (cs.MM), Audio and Speech Processing (eess.AS), FOS: Computer and information sciences, FOS: Computer and information sciences, FOS: Electrical engineering, electronic engineering, information engineering, FOS: Electrical engineering, electronic engineering, information engineering}, title = {MAP-Music2Vec: A Simple and Effective Baseline for Self-Supervised Music Audio Representation Learning}, publisher = {arXiv}, year = {2022}, copyright = {arXiv.org perpetual, non-exclusive license} } ```