Whisper Multitask Analyzer
A transformer encoder-decoder model for automatic audio captioning. As opposed to speech-to-text, captioning describes the content and features of audio clips.
- Model, codebase & card adapted from: MU-NLPC/whisper-small-audio-captioning
- Model type: Whisper encoder-decoder transformer
- Language(s) (NLP): en
- License: cc-by-4.0
- Parent Model: openai/whisper-small
Usage
The model expects an audio clip (up to 30s) to the encoder as an input and information about caption style as forced prefix to the decoder. The forced prefix is an integer which is mapped to various tasks. This mapping is defined in the model config and can be retrieved with a function.
The tag mapping of the current model is:
Task | ID | Description |
---|---|---|
tags | 0 | General descriptions, can include genres and features. |
genre | 1 | Estimated musical genres. |
mood | 2 | Estimated emotional feeling. |
movement | 3 | Estimated audio pace and expression. |
theme | 4 | Estimated audio usage (not very accurate) |
Minimal example:
# Load model
checkpoint = "DionTimmer/whisper-small-multitask-analyzer"
model = WhisperForAudioCaptioning.from_pretrained(checkpoint)
tokenizer = transformers.WhisperTokenizer.from_pretrained(checkpoint, language="en", task="transcribe")
feature_extractor = transformers.WhisperFeatureExtractor.from_pretrained(checkpoint)
# Load and preprocess audio
input_file = "..."
audio, sampling_rate = librosa.load(input_file, sr=feature_extractor.sampling_rate)
features = feature_extractor(audio, sampling_rate=sampling_rate, return_tensors="pt").input_features
# Mappings by ID
print(model.task_mapping) # {0: 'tags', 1: 'genre', 2: 'mood', 3: 'movement', 4: 'theme'}
# Inverted
print(model.named_task_mapping) # {'tags': 0, 'genre': 1, 'mood': 2, 'movement': 3, 'theme': 4}
# Prepare caption style
style_prefix = f"{model.named_task_mapping['tags']}: "
style_prefix_tokens = tokenizer("", text_target=style_prefix, return_tensors="pt", add_special_tokens=False).labels
# Generate caption
model.eval()
outputs = model.generate(
inputs=features.to(model.device),
forced_ac_decoder_ids=style_prefix_tokens,
max_length=100,
)
print(tokenizer.batch_decode(outputs, skip_special_tokens=True)[0])
Example output: 0: advertising, beautiful, beauty, bright, cinematic, commercial, corporate, emotional, epic, film, heroic, hopeful, inspiration, inspirational, inspiring, love, love story, movie, orchestra, orchestral, piano, positive, presentation, romantic, sentimental
WhisperTokenizer must be initialized with language="en"
and task="transcribe"
.
The model class WhisperForAudioCaptioning
can be found in the git repository or here on the HuggingFace Hub in the model repository. The class overrides default Whisper generate
method to support forcing decoder prefix.
Licence
The model weights are published under non-commercial license CC BY-NC 4.0 as the model was finetuned on a dataset for non-commercial use.
- Downloads last month
- 2