File size: 2,444 Bytes
10e768d
 
 
 
 
 
440ae93
10e768d
 
 
 
 
bd88441
 
10e768d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b15f6ce
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
---
language:
- en
pipeline_tag: text-to-audio
tags:
- text-to-audio
license: other
---
# Improving Text-To-Audio Models with Synthetic Captions

🎵 We propose an audio captioning pipeline that uses an audio language model to synthesize accurate and diverse captions for audio at scale. We leverage this pipeline to produce a dataset of synthetic captions for AudioSet, named AF-AudioSet. We then pre-train our Tango family of text-to-audio models on these synthetic captions. 🎶

**This checkpoint was finetuned on the AudioCaps dataset**

[Read the paper](https://arxiv.org/pdf/2406.15487)

## Code

Our code is released here: [https://github.com/declare-lab/tango](https://github.com/declare-lab/tango)


Please follow the instructions in the repository for installation, usage and experiments.

## Quickstart Guide

Download the model and generate audio from a text prompt:

```python
import IPython
import soundfile as sf
from tango import Tango

tango = Tango("declare-lab/tango-af-ac-ft-ac")

prompt = "An audience cheering and clapping"
audio = tango.generate(prompt)
sf.write(f"{prompt}.wav", audio, samplerate=16000)
IPython.display.Audio(data=audio, rate=16000)
```

The model will be automatically downloaded and saved in cache. Subsequent runs will load the model directly from cache.

The `generate` function uses 100 steps by default to sample from the latent diffusion model. We recommend using 200 steps for generating better quality audios. This comes at the cost of increased run-time.

```python
prompt = "Rolling thunder with lightning strikes"
audio = tango.generate(prompt, steps=200)
IPython.display.Audio(data=audio, rate=16000)
```


Use the `generate_for_batch` function to generate multiple audio samples for a batch of text prompts:

```python
prompts = [
    "A car engine revving",
    "A dog barks and rustles with some clicking",
    "Water flowing and trickling"
]
audios = tango.generate_for_batch(prompts, samples=2)
```
This will generate two samples for each of the three text prompts.

## Citation
Please consider citing the following article if you found our work useful:
```
@article{kong2024improving,
  title={Improving Text-To-Audio Models with Synthetic Captions},
  author={Kong, Zhifeng and Lee, Sang-gil and Ghosal, Deepanway and Majumder, Navonil and Mehrish, Ambuj and Valle, Rafael and Poria, Soujanya and Catanzaro, Bryan},
  journal={arXiv preprint arXiv:2406.15487},
  year={2024}
}
```