ConvNeXt-Tiny-AT is an audio tagging CNN model, trained on AudioSet (balanced+unbalanced subsets). It reached 0.471 mAP on the test set (Paper).
The model was trained on audio recordings of duration 10 seconds, and sample rate 32kHz, but you can provide any audio file, we have included resampling and padding/cropping in the following code snippet.
The model provides logits and probabilities for the 527 audio event tags of AudioSet (see http://research.google.com/audioset/index.html).
Two methods can also be used to get scene embeddings (a single vector per file) and frame-level embeddings, see below. The scene embedding is obtained from the frame-level embeddings, on which mean pooling is applied onto the frequency dim, followed by mean pooling + max pooling onto the time dim.
Install
This code is based on our repo: https://github.com/topel/audioset-convnext-inf
You can pip install it:
pip install git+https://github.com/topel/audioset-convnext-inf@pip-install
Usage
Below is an example of how to instantiate the model, make tag predictions on an audio sample, and get embeddings (scene and frame levels).
import os
import numpy as np
import torch
from torch.nn import functional as TF
import torchaudio
import torchaudio.functional as TAF
from audioset_convnext_inf.pytorch.convnext import ConvNeXt
from audioset_convnext_inf.utils.utilities import read_audioset_label_tags
model = ConvNeXt.from_pretrained("topel/ConvNeXt-Tiny-AT", map_location='cpu')
print(
"# params:",
sum(param.numel() for param in model.parameters() if param.requires_grad),
)
if torch.cuda.is_available():
device = torch.device("cuda")
else:
device = torch.device("cpu")
if "cuda" in str(device):
model = model.to(device)
Output:
# params: 28222767
Inference: get logits and probabilities
To run the following, first download 254906__tpellegrini__cavaco1.wav
and class_labels_indices.csv
from this repository.
sample_rate = 32000
audio_target_length = 10 * sample_rate # 10 s
# AUDIO_FNAME = "f62-S-v2swA_200000_210000.wav"
AUDIO_FNAME = "254906__tpellegrini__cavaco1.wav"
current_dir=os.getcwd()
AUDIO_FPATH = os.path.join(current_dir, AUDIO_FNAME)
waveform, sample_rate_ = torchaudio.load(AUDIO_FPATH)
if sample_rate_ != sample_rate:
print("Resampling from %d to 32000 Hz"%sample_rate_)
waveform = TAF.resample(
waveform,
sample_rate_,
sample_rate,
)
if waveform.shape[-1] < audio_target_length:
print("Padding waveform")
missing = max(audio_target_length - waveform.shape[-1], 0)
waveform = TF.pad(waveform, (0,missing), mode="constant", value=0.0)
elif waveform.shape[-1] > audio_target_length:
print("Cropping waveform")
waveform = waveform[:, :audio_target_length]
waveform = waveform.contiguous()
waveform = waveform.to(device)
print("\nInference on " + AUDIO_FNAME + "\n")
with torch.no_grad():
model.eval()
output = model(waveform)
logits = output["clipwise_logits"]
print("logits size:", logits.size())
probs = output["clipwise_output"]
# Equivalent: probs = torch.sigmoid(logits)
print("probs size:", probs.size())
lb_to_ix, ix_to_lb, id_to_ix, ix_to_id = read_audioset_label_tags(os.path.join(current_dir, "class_labels_indices.csv"))
threshold = 0.25
sample_labels = np.where(probs[0].clone().detach().cpu() > threshold)[0]
print("\nPredicted labels using activity threshold 0.25:\n")
# print(sample_labels)
for l in sample_labels:
print("%s: %.3f"%(ix_to_lb[l], probs[0,l]))
Output:
Inference on 254906__tpellegrini__cavaco1.wav
Resampling rate from 44100 to 32000 Hz
Padding waveform
logits size: torch.Size([1, 527])
probs size: torch.Size([1, 527])
Predicted labels using activity threshold 0.25:
[137 138 139 140 149 151]
Music: 0.896
Musical instrument: 0.686
Plucked string instrument: 0.608
Guitar: 0.369
Mandolin: 0.710
Ukulele: 0.268
Technically speaking, it's not a Mandolin nor a Ukulele, but a Brazilian cousin, the cavaquinho!
Get audio scene embeddings
with torch.no_grad():
model.eval()
output = model.forward_scene_embeddings(waveform)
print("\nScene embedding, shape:", output.size())
Output:
Scene embedding, shape: torch.Size([1, 768])
Get frame-level embeddings
with torch.no_grad():
model.eval()
output = model.forward_frame_embeddings(waveform)
print("\nFrame-level embeddings, shape:", output.size())
Output:
Frame-level embeddings, shape: torch.Size([1, 768, 31, 7])
Zenodo
The checkpoint is also available on Zenodo: https://zenodo.org/record/8020843/files/convnext_tiny_471mAP.pth?download=1
Citation
Cite as: Pellegrini, T., Khalfaoui-Hassani, I., Labbé, E., Masquelier, T. (2023) Adapting a ConvNeXt Model to Audio Classification on AudioSet. Proc. INTERSPEECH 2023, 4169-4173, doi: 10.21437/Interspeech.2023-1564
@inproceedings{pellegrini23_interspeech,
author={Thomas Pellegrini and Ismail Khalfaoui-Hassani and Etienne Labb\'e and Timoth\'ee Masquelier},
title={{Adapting a ConvNeXt Model to Audio Classification on AudioSet}},
year=2023,
booktitle={Proc. INTERSPEECH 2023},
pages={4169--4173},
doi={10.21437/Interspeech.2023-1564}
}
- Downloads last month
- 342