happpylittlecat's picture
first commit
a75ebf2
metadata
license: cc-by-nc-sa-4.0
language:
  - en
tags:
  - audio

Auffusion is a latent diffusion model (LDM) for text-to-audio (TTA) generation. Auffusion can generate realistic audios including human sounds, animal sounds, natural and artificial sounds and sound effects from textual prompts. We introduce Auffusion, a TTA system adapting T2I model frameworks to TTA task, by effectively leveraging their inherent generative strengths and precise cross-modal alignment. Our objective and subjective evaluations demonstrate that Auffusion surpasses previous TTA approaches using limited data and computational resource. We release our model, inference code, and pre-trained checkpoints for the research community.

📣 We are releasing Auffusion-Full-no-adapter which was pre-trained on all datasets described in paper and created for easy use of audio manipulation.

📣 We are releasing Auffusion-Full which was pre-trained on all datasets described in paper.

📣 We are releasing Auffusion which was pre-trained on AudioCaps.

Auffusion Model Family

Code

Our code is released here: https://github.com/happylittlecat2333/Auffusion

We uploaded several Auffusion generated samples here: https://auffusion.github.io

Please follow the instructions in the repository for installation, usage and experiments.

Quickstart Guide

We try to make Auffusion-Full-no-adapter compatible with text-to-image pipeline, therefore diffusers pipeline including StableDiffusionPipeline, StableDiffusionImg2ImgPipeline, StableDiffusionInpaintPipeline etc. can be adapted. Other audio manipulation examples can be seen in https://github.com/happylittlecat2333/Auffusion/notebooks. We only show the default text-to-audio example here.

First, git clone the repository and install the requirements:

git clone https://github.com/happylittlecat2333/Auffusion/
cd Auffusion
pip install -r requirements.txt

Then, download the Auffusion-Full-no-adapter model and generate audio from a text prompt:

import IPython, torch, os
import soundfile as sf
from diffusers import StableDiffusionPipeline
from huggingface_hub import snapshot_download
from converter import Generator, denormalize_spectrogram

cuda = "cuda" if torch.cuda.is_available() else "cpu"
dtype = torch.float16

prompt = "A kitten mewing for attention"
seed = 42

pretrained_model_name_or_path = "auffusion/auffusion-full-no-adapter"
if not os.path.isdir(pretrained_model_name_or_path):
    pretrained_model_name_or_path = snapshot_download(pretrained_model_name_or_path) 

vocoder = Generator.from_pretrained(pretrained_model_name_or_path, subfolder="vocoder")
vocoder = vocoder.to(device=device, dtype=dtype)

pipe = StableDiffusionPipeline.from_pretrained(pretrained_model_name_or_path, torch_dtype=dtype)
pipe = pipe.to(device)

generator = torch.Generator(device=device).manual_seed(seed)

with torch.autocast("cuda"):
    output_spec = pipe(
        prompt=prompt, num_inference_steps=100, generator=generator, height=256, width=1024, output_type="pt"
    ).images[0] 
    # important to set output_type="pt" to get torch tensor output, and set height=256 with width=1024


denorm_spec = denormalize_spectrogram(output_spec)
denorm_spec_audio = vocoder.inference(denorm_spec)

sf.write(f"{prompt}.wav", audio, samplerate=16000)
IPython.display.Audio(data=audio, rate=16000)

The auffusion model will be automatically downloaded from huggingface and saved in cache. Subsequent runs will load the model directly from cache.

Other audio manipulation examples can be seen in https://github.com/happylittlecat2333/Auffusion/notebooks. We only show the default text-to-audio example here.

Citation

Please consider citing the following article if you found our work useful:

@article{xue2024auffusion,
  title={Auffusion: Leveraging the Power of Diffusion and Large Language Models for Text-to-Audio Generation}, 
  author={Jinlong Xue and Yayue Deng and Yingming Gao and Ya Li},
  journal={arXiv preprint arXiv:2401.01044},
  year={2024}
}