audioldm-l-full / README.md

sanchit-gandhi HF staff

Update README.md

26aec6f verified 5 months ago

preview code

raw

history blame contribute delete

No virus

4.48 kB

	---
	license: cc-by-nc-sa-4.0
	---

	# AudioLDM

	AudioLDM is a latent text-to-audio diffusion model capable of generating realistic audio samples given any text input. It is available in the 🧨 Diffusers library from v0.15.0 onwards.

	# Model Details

	AudioLDM was proposed in the paper [AudioLDM: Text-to-Audio Generation with Latent Diffusion Models](https://arxiv.org/abs/2301.12503) by Haohe Liu et al.

	Inspired by [Stable Diffusion](https://huggingface.co/CompVis/stable-diffusion-v1-4), AudioLDM
	is a text-to-audio _latent diffusion model (LDM)_ that learns continuous audio representations from [CLAP](https://huggingface.co/laion/clap-htsat-unfused)
	latents. AudioLDM takes a text prompt as input and predicts the corresponding audio. It can generate text-conditional
	sound effects, human speech and music.

	# Checkpoint Details

	This is the large version of the AudioLDM model, with twice the number of UNet channels and head channels as the small checkpoints. The four AudioLDM checkpoints are summarised in the table below:

	Table 1: Summary of the AudioLDM checkpoints.

	\| Checkpoint \| Training Steps \| Audio conditioning \| CLAP audio dim \| UNet dim \| Params \|
	\|-----------------------------------------------------------------------\|----------------\|--------------------\|----------------\|----------\|--------\|
	\| [audioldm-s-full](https://huggingface.co/cvssp/audioldm) \| 1.5M \| No \| 768 \| 128 \| 421M \|
	\| [audioldm-s-full-v2](https://huggingface.co/cvssp/audioldm-s-full-v2) \| > 1.5M \| No \| 768 \| 128 \| 421M \|
	\| [audioldm-m-full](https://huggingface.co/cvssp/audioldm-m-full) \| 1.5M \| Yes \| 1024 \| 192 \| 652M \|
	\| [audioldm-l-full](https://huggingface.co/cvssp/audioldm-l-full) \| 1.5M \| No \| 768 \| 256 \| 975M \|


	## Model Sources

	- [Original Repository](https://github.com/haoheliu/AudioLDM)
	- [🧨 Diffusers Pipeline](https://huggingface.co/docs/diffusers/api/pipelines/audioldm)
	- [Paper](https://arxiv.org/abs/2301.12503)
	- [Demo](https://huggingface.co/spaces/haoheliu/audioldm-text-to-audio-generation)

	# Usage

	First, install the required packages:

	```
	pip install --upgrade diffusers transformers accelerate
	```

	## Text-to-Audio

	For text-to-audio generation, the [AudioLDMPipeline](https://huggingface.co/docs/diffusers/api/pipelines/audioldm) can be
	used to load pre-trained weights and generate text-conditional audio outputs:

	```python
	from diffusers import AudioLDMPipeline
	import torch

	repo_id = "cvssp/audioldm-l-full"
	pipe = AudioLDMPipeline.from_pretrained(repo_id, torch_dtype=torch.float16)
	pipe = pipe.to("cuda")

	prompt = "Techno music with a strong, upbeat tempo and high melodic riffs"
	audio = pipe(prompt, num_inference_steps=10, audio_length_in_s=5.0).audios[0]
	```

	The resulting audio output can be saved as a .wav file:
	```python
	import scipy

	scipy.io.wavfile.write("techno.wav", rate=16000, data=audio)
	```

	Or displayed in a Jupyter Notebook / Google Colab:
	```python
	from IPython.display import Audio

	Audio(audio, rate=16000)
	```
	<audio controls>
	<source src="https://huggingface.co/datasets/sanchit-gandhi/audioldm-readme-samples/resolve/main/audioldm-l-full-techno.wav" type="audio/wav">
	Your browser does not support the audio element.
	</audio>


	## Tips

	Prompts:
	* Descriptive prompt inputs work best: you can use adjectives to describe the sound (e.g. "high quality" or "clear") and make the prompt context specific (e.g., "water stream in a forest" instead of "stream").
	* It's best to use general terms like 'cat' or 'dog' instead of specific names or abstract objects that the model may not be familiar with.

	Inference:
	* The _quality_ of the predicted audio sample can be controlled by the `num_inference_steps` argument: higher steps give higher quality audio at the expense of slower inference.
	* The _length_ of the predicted audio sample can be controlled by varying the `audio_length_in_s` argument.

	# Citation

	BibTeX:
	```
	@article{liu2023audioldm,
	title={AudioLDM: Text-to-Audio Generation with Latent Diffusion Models},
	author={Liu, Haohe and Chen, Zehua and Yuan, Yi and Mei, Xinhao and Liu, Xubo and Mandic, Danilo and Wang, Wenwu and Plumbley, Mark D},
	journal={arXiv preprint arXiv:2301.12503},
	year={2023}
	}
	```