Similarities and distinctions from fellow work "ConsistencyTTA"
Thank you for the awesome work! Accelerating text-to-audio generation is an important goal, and AudioLCM's contributions to this area are significantly appreciated.
We would like to bring to your attention our paper from September 2023, titled ConsistencyTTA: Accelerating Diffusion-Based Text-to-Audio Generation with Consistency Distillation, that explored a similar idea. ConsistencyTTA's code and model checkpoints are available here and here.
After a discussion with @liuhuadai , we agree that while ConsistencyTTA and AudioLCM see numerous similarities, they also have distinct differences.
The main similarities include:
- Latent-space consistency model and its general single-stage distillation and inference procedures (Section 3.2 of ConsistencyTTA and Section 3.5 of AudioLCM).
- Guided Distillation (Section 3.3 of ConsistencyTTA and Section 3.3 of AudioLCM).
- The use of AudioCaps as an evaluation benchmark for the text-to-audio application and the capability of fast, high-quality generation. Both methods achieve hundreds-fold acceleration over diffusion baselines.
- A much more coarse discretization scheme for the diffusion trajectory during consistency distillation, compared to during the training of the teacher diffusion model (Section 3.2 of ConsistencyTTA and Section 3.4 of AudioLCM).
The main differences include:
- ConsistencyTTA additionally proposes to further fine-tune the consistency model by directly optimizing the CLAP score.
- AudioLCM additionally considers text-to-music generation.
- ConsistencyTTA emphasizes single-step generation, whereas AudioLCM emphasizes the few-step regime. In particular, ConsistencyTTA’s single-step performance ($FAD=2.4$, Table 1) seems stronger than AudioLCM’s single-step ($FAD\approx 4$, Table 2b), but weaker than AudioLCM’s two-step generation ($FAD=1.67$, Table 1).
- ConsistencyTTA uses TANGO as the diffusion teacher model, whereas AudioLCM uses Make-An-Audio 2. As a result, the model architecture is also different -- ConsistencyTTA uses a UNet whereas AudioLCM uses an improved diffusion transformer.
- ConsistencyTTA uses a single solver step to "jump" between the coarse discretization steps, whereas AudioLCM further divides these coarse intervals and performs multi-step ODE solving to "walk" between them. Intuitively, AudioLCM’s approach incurs a smaller solver error (assuming using the same solver), but takes more teacher queries for each training iteration.
We therefore believe that AudioLCM is a valuable complement to ConsistencyTTA, providing important insights and understandings in consistency-models-powered text-to-audio generation. Shout out to @liuhuadai for the constructive discussion. The AudioLCM paper will be revised shortly to include this comparison.
Thank you for your attention and effort to this discussion. We acknowledge that ConsistencyTTA earlier applied consistency models to text-to-sound tasks to accelerate generation. However, I would like to explain the similarities mentioned one by one.
- In addition to being used in text-to-sound, the latent consistency model, namely the Consistency Model, has also been further used in video generation (VideoLCM) and image generation (Trajectory Consistency Distillation). Therefore, it can be said that we have done further research based on the latent consistency model.
- Guided Distillation essentially applies classfier-free guidance to the distillation process, which is a strategy and method commonly used in generation tasks, such as AudioLDM and Make-an-Audio. The difference is that we borrowed the method of the latent consistency model and input the guidance scale as a condition into our backbone, so we did not regard it as one of the core contributions of our paper.
- AudioCaps is a general text-to-sound benchmark used by many text-to-sound research institutes.
- As proposed in Section 3.4 of the AudioLCM paper, we use a multi-step ODE solver to accelerate the distillation process, which can reduce the number of training steps from 1000 to 50, while better balancing performance and efficiency, as shown in the results in Section 4.2 of the paper. The ConsistencyTTA paper uses a one-step ODE solver in Section 3.2, so they are different.
In summary, the similarities between ConsistencyTTA and AudioLCM are actually more common methods/strategies. In addition, we have many obvious differences from ConsistencyTTA:
- Task Scope: We have done two tasks, text-to-sound and text-to-music generation, while ConsistencyTTA targets text-to-sound
- Generation Focus: AudioLCM focuses on few-step sampling while consistencyTTA targets one-step sampling. They have different sampling algorithms, as shown in Algorithm 2 of our paper.
- Main Contributions: We use a multi-step ODE solver to accelerate the distillation process and integrate it into the LLaMA design to achieve training stability. We further explore the balance between efficiency and quality through experimental analysis and conclude that step=20 has a better balance between efficiency and quality and can achieve model stability after 10,000 training steps. This is also the main contribution stated in our paper.
- Framework Foundation: Our work is built upon Make-An-Audio 2, which relies on an improved Diffusion Transformer structure as detailed in Section 3.2 of AudioLCM, whereas ConsistencyTTA is based on a UNet architecture. Because of this, AudioLCM and ConsistencyTTA have completely different components and codes.
Finally, I do not think that AudioLCM is a supplement to ConsistencyTTA, because we did not carry out our work based on the work of ConsistencyTTA, which can be seen from the public code and papers. In fact, we are more based on the improvement and acceleration of Make-An-Audio 2, but we agree to add comparative experiments for ConsistencyTTA in our public papers, because this will help to achieve a more comprehensive and fair comparison of text-to-sound tasks.