File size: 2,216 Bytes
f4bba2d
 
8aa4a8b
 
 
 
f4bba2d
8aa4a8b
90bb951
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1d09ca2
 
90bb951
1d09ca2
90bb951
 
 
 
 
 
 
 
1d09ca2
90bb951
 
 
 
 
 
 
 
 
 
 
 
1d09ca2
90bb951
 
 
 
1d09ca2
90bb951
 
 
1d09ca2
90bb951
1d09ca2
90bb951
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
---
license: apache-2.0
datasets:
- jp1924/AudioCaps
language:
- en
---
[![Hugging Face Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/wsntxxn/efficient_audio_captioning)
[![arXiv](https://img.shields.io/badge/arXiv-2407.14329-brightgreen.svg?style=flat-square)](https://arxiv.org/abs/2407.14329)

# Model Details
This is a lightweight audio captioning model, with an Efficient-B2 encoder and a two-layer Transformer decoder. The model is trained on AudioCaps and unlabeled AudioSet.

# Dependencies
Install corresponding dependencies to run the model:
```bash
pip install numpy torch torchaudio einops transformers efficientnet_pytorch
```

# Usage
```python
import torch
from transformers import AutoModel, PreTrainedTokenizerFast
import torchaudio


device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# use the model trained on AudioCaps
model = AutoModel.from_pretrained(
    "wsntxxn/effb2-trm-audiocaps-captioning",
    trust_remote_code=True
).to(device)
tokenizer = PreTrainedTokenizerFast.from_pretrained(
    "wsntxxn/audiocaps-simple-tokenizer"
)

# inference on a single audio clip
wav, sr = torchaudio.load("/path/to/file.wav")
wav = torchaudio.functional.resample(wav, sr, model.config.sample_rate)
if wav.size(0) > 1:
    wav = wav.mean(0).unsqueeze(0)
with torch.no_grad():
    word_idxs = model(
        audio=wav,
        audio_length=[wav.size(1)],
    )
caption = tokenizer.decode(word_idxs[0], skip_special_tokens=True)
print(caption)

# inference on a batch
wav1, sr1 = torchaudio.load("/path/to/file1.wav")
wav1 = torchaudio.functional.resample(wav1, sr1, model.config.sample_rate)
wav1 = wav1.mean(0) if wav1.size(0) > 1 else wav1[0]

wav2, sr2 = torchaudio.load("/path/to/file2.wav")
wav2 = torchaudio.functional.resample(wav2, sr2, model.config.sample_rate)
wav2 = wav2.mean(0) if wav2.size(0) > 1 else wav2[0]

wav_batch = torch.nn.utils.rnn.pad_sequence([wav1, wav2], batch_first=True)

with torch.no_grad():
    word_idxs = model(
        audio=wav_batch,
        audio_length=[wav1.size(0), wav2.size(0)],
    )
captions = tokenizer.batch_decode(word_idxs, skip_special_tokens=True)
print(captions)
```