m-a-p
/

music2vec-v1

Feature Extraction

Model card Files Files and versions Community

a43992899 commited on Dec 10, 2022

Commit

2e154fb

•

1 Parent(s): 573f43e

Update README.md

Files changed (1) hide show

README.md +14 -4

README.md CHANGED Viewed

@@ -16,6 +16,7 @@ Our model is SOTA-comparable on multiple MIR tasks even under probing settings,
 ```python
 from transformers import Wav2Vec2Processor, Data2VecAudioModel
 import torch
 from datasets import load_dataset
 # load demo audio and set processor
@@ -31,11 +32,20 @@ model = Data2VecAudioModel.from_pretrained("m-a-p/music2vec-v1")
 # audio file is decoded on the fly
 inputs = processor(dataset[0]["audio"]["array"], sampling_rate=sampling_rate, return_tensors="pt")
 with torch.no_grad():
-    outputs = model(**inputs)
-# take a look at the output shape
-last_hidden_states = outputs.last_hidden_state
-print(list(last_hidden_states.shape)) # [1, 292, 768]
 ```
 Our model is based on the [data2vec audio model](https://huggingface.co/docs/transformers/model_doc/data2vec#transformers.Data2VecAudioModel).

 ```python
 from transformers import Wav2Vec2Processor, Data2VecAudioModel
 import torch
+from torch import nn
 from datasets import load_dataset
 # load demo audio and set processor
 # audio file is decoded on the fly
 inputs = processor(dataset[0]["audio"]["array"], sampling_rate=sampling_rate, return_tensors="pt")
 with torch.no_grad():
+    outputs = model(**inputs, output_hidden_states=True)
+# take a look at the output shape, there are 13 layers of representation
+# each layer performs differently in different downstream tasks, you should choose empirically
+all_layer_hidden_states = torch.stack(outputs.hidden_states).squeeze()
+print(all_layer_hidden_states.shape) # [13 layer, 292 timestep, 768 feature_dim]
+# for utterance level classification tasks, you can simply reduce the representation in time
+time_reduced_hidden_states = all_layer_hidden_states.mean(-2)
+print(time_reduced_hidden_states.shape) # [13, 768]
+# you can even use a learnable weighted average representation
+aggregator = nn.Conv1d(in_channels=12, out_channels=1, kernel_size=1)
+weighted_avg_hidden_states = aggregator(time_reduced_hidden_states).squeeze()
 ```
 Our model is based on the [data2vec audio model](https://huggingface.co/docs/transformers/model_doc/data2vec#transformers.Data2VecAudioModel).