---
license: cc-by-nc-nd-4.0
---

# ConsistencyTTA: Accelerating Diffusion-Based Text-to-Audio Generation with Consistency Distillation

This page shares the official model checkpoints of the paper \
"Accelerating Diffusion-Based Text-to-Audio Generation with Consistency Distillation" \
from Microsoft Applied Science Group and UC Berkeley \
by [Yatong Bai](https://bai-yt.github.io),
[Trung Dang](https://www.microsoft.com/applied-sciences/people/trung-dang),
[Dung Tran](https://www.microsoft.com/applied-sciences/people/dung-tran),
[Kazuhito Koishida](https://www.microsoft.com/applied-sciences/people/kazuhito-koishida),
and [Somayeh Sojoudi](https://people.eecs.berkeley.edu/~sojoudi/).

**[[Preprint Paper](https://arxiv.org/abs/2309.10740)]** &nbsp;&nbsp;&nbsp;&nbsp;
**[[Project Homepage](https://consistency-tta.github.io)]** &nbsp;&nbsp;&nbsp;&nbsp;
**[[Code](https://github.com/Bai-YT/ConsistencyTTA)]** &nbsp;&nbsp;&nbsp;&nbsp;
**[[Model Checkpoints](https://huggingface.co/Bai-YT/ConsistencyTTA)]** &nbsp;&nbsp;&nbsp;&nbsp;
**[[Generation Examples](https://consistency-tta.github.io/demo.html)]**


## Description

This work proposes a *consistency distillation* framework to train
text-to-audio (TTA) generation models that only require a single neural network query,
reducing the computation of the core step of diffusion-based TTA models by a factor of 400.
By incorporating *classifier-free guidance* into the distillation framework,
our models retain diffusion models' impressive generation quality and diversity.
Furthermore, the non-recurrent differentiable structure of the consistency model
allows for end-to-end fine-tuning with novel loss functions such as the CLAP score, further boosting performance.

<center>
  <img src="main_figure_.png" alt="ConsistencyTTA Results" title="Results" width="480"/>
</center>


## Model Details

We share three model checkpoints:
- [ConsistencyTTA directly distilled from a diffusion model](
  https://huggingface.co/Bai-YT/ConsistencyTTA/blob/main/ConsistencyTTA.zip);
- [ConsistencyTTA fine-tuned by optimizing the CLAP score](
  https://huggingface.co/Bai-YT/ConsistencyTTA/blob/main/ConsistencyTTA_CLAPFT.zip);
- [The diffusion teacher model from which ConsistencyTTA is distilled](
  https://huggingface.co/Bai-YT/ConsistencyTTA/blob/main/LightweightLDM.zip).

The first two models are capable of high-quality single-step text-to-audio generation. Generations are 10 seconds long.

After downloading and unzipping the files, place them in the `saved` directory.

The training and inference code are on our [GitHub page](https://github.com/Bai-YT/ConsistencyTTA). Please refer to the GitHub page for usage details.