Model Card: Pre-trained Audio Representation Models on AudioSet

Overview

This model card presents information about pre-trained audio representation models released by ALM. These models are pre-trained on the full AudioSet dataset and are intended for general-purpose Audio Representation Learning (ARL) tasks.

Models

1. ALM/hubert-base-audioset

Architecture: HuBERT (Hubert-Base) transformer-based model
Description: This model is based on the HuBERT architecture, pre-trained on the full AudioSet dataset.

2. ALM/hubert-large-audioset

Architecture: HuBERT (Hubert-Large) transformer-based model
Description: Similar to the hubert-base-audioset model, this variant is larger in size, providing increased capacity for capturing audio representations from the full AudioSet dataset.

3. ALM/wav2vec2-base-audioset

Architecture: Wav2Vec 2.0 (Wav2Vec2-Base) transformer-based model
Description: This model is based on the Wav2Vec 2.0 architecture, trained on the full AudioSet dataset using SSL with CPC. It offers a different approach to audio representation learning compared to the HuBERT models.

4. ALM/wav2vec2-large-audioset

Architecture: Wav2Vec 2.0 (Wav2Vec2-Large) transformer-based model
Description: Similar to the wav2vec2-base-audioset model, this variant is larger in size, providing enhanced capacity for learning audio representations from the full AudioSet dataset.

Intended Use

These pre-trained models are intended for a wide range of ARL tasks, including but not limited to speech recognition, music classification, and acoustic event detection. They serve as powerful tools for feature extraction and can be fine-tuned on task-specific datasets for downstream applications. It's important to note that while these models offer versatility across various audio domains, their performance in speech-related tasks may be relatively lower compared to specialized models such as the original Wav2Vec and HuBERT models. This is due to the diverse nature of the AudioSet dataset used for pre-training, which includes a wide range of audio sources beyond speech.

Limitations and Considerations

The models are pre-trained on the full AudioSet dataset, which may not cover all possible audio domains comprehensively.
Fine-tuning on domain-specific data may be necessary to achieve optimal performance for certain tasks.
Computational resources may be required for deploying and fine-tuning these models, especially the larger variants.

Citation

If you use these pre-trained models in your work, please cite the following

@INPROCEEDINGS{ARCH,
  author={La Quatra, Moreno and Koudounas, Alkis and Vaiani, Lorenzo and Baralis, Elena and Cagliero, Luca and Garza, Paolo and Siniscalchi, Sabato Marco},
  booktitle={2024 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops (ICASSPW)}, 
  title={Benchmarking Representations for Speech, Music, and Acoustic Events}, 
  year={2024},
  pages={505-509},
  keywords={Representation learning; Systematics; Conferences; Benchmark testing; Signal processing; Acoustics; Data models; Audio Representation Learning; Benchmark; Pre-trained Models; Self-Supervised Learning},
  doi={10.1109/ICASSPW62465.2024.10625960}
}

arXiv version: arxiv.org/abs/2405.00934

ALM
/

wav2vec2-large-audioset