--- license: cc-by-nc-nd-4.0 --- # ConsistencyTTA: Accelerating Diffusion-Based Text-to-Audio Generation with Consistency Distillation This page shares the official model checkpoints of the paper \ "Accelerating Diffusion-Based Text-to-Audio Generation with Consistency Distillation" \ from Microsoft Applied Science Group and UC Berkeley \ by [Yatong Bai](https://bai-yt.github.io), [Trung Dang](https://www.microsoft.com/applied-sciences/people/trung-dang), [Dung Tran](https://www.microsoft.com/applied-sciences/people/dung-tran), [Kazuhito Koishida](https://www.microsoft.com/applied-sciences/people/kazuhito-koishida), and [Somayeh Sojoudi](https://people.eecs.berkeley.edu/~sojoudi/). **[[Preprint Paper](https://arxiv.org/abs/2309.10740)]** **[[Project Homepage](https://consistency-tta.github.io)]** **[[Code](https://github.com/Bai-YT/ConsistencyTTA)]** **[[Model Checkpoints](https://huggingface.co/Bai-YT/ConsistencyTTA)]** **[[Generation Examples](https://consistency-tta.github.io/demo.html)]** ## Description This work proposes a *consistency distillation* framework to train text-to-audio (TTA) generation models that only require a single neural network query, reducing the computation of the core step of diffusion-based TTA models by a factor of 400. By incorporating *classifier-free guidance* into the distillation framework, our models retain diffusion models' impressive generation quality and diversity. Furthermore, the non-recurrent differentiable structure of the consistency model allows for end-to-end fine-tuning with novel loss functions such as the CLAP score, further boosting performance.