Papers
arxiv:2502.09935

Precise Parameter Localization for Textual Generation in Diffusion Models

Published on Feb 14
ยท Submitted by lukasz-staniszewski on Feb 17
Authors:
,
,

Abstract

Novel diffusion models can synthesize photo-realistic images with integrated high-quality text. Surprisingly, we demonstrate through attention activation patching that only less than 1% of diffusion models' parameters, all contained in attention layers, influence the generation of textual content within the images. Building on this observation, we improve textual generation efficiency and performance by targeting cross and joint attention layers of diffusion models. We introduce several applications that benefit from localizing the layers responsible for textual content generation. We first show that a LoRA-based fine-tuning solely of the localized layers enhances, even more, the general text-generation capabilities of large diffusion models while preserving the quality and diversity of the diffusion models' generations. Then, we demonstrate how we can use the localized layers to edit textual content in generated images. Finally, we extend this idea to the practical use case of preventing the generation of toxic text in a cost-free manner. In contrast to prior work, our localization approach is broadly applicable across various diffusion model architectures, including U-Net (e.g., LDM and SDXL) and transformer-based (e.g., DeepFloyd IF and Stable Diffusion 3), utilizing diverse text encoders (e.g., from CLIP to the large language models like T5). Project page available at https://t2i-text-loc.github.io/.

Community

Paper author Paper submitter

Precise Parameter Localization for Textual Generation in Diffusion Models

๐Ÿ’ก What do we do?

We show that textual content generated on images with Diffusion Models is controlled by less than 1% of their parameters. We localize the attention layers across diverse diffusion architectures such as SDXL (U-Nets), Deepfloyd IF (Pixel Diffusion), and SD3 (Diffusion Transformers) by activation patching technique.

t2i-textloc.gif

๐Ÿš€ What are the benefits?

๐Ÿ“ˆ We use the found layers to train diffusion models more efficiently, increasing text quality and maintaining generation diversity.
๐ŸŽจ We introduce a novel method for editing text on images.
๐Ÿš” We employ our method to remove toxic words from images without touching other image attributes.

sd3.png

๐Ÿ‘€ Interested? See more!

๐Ÿ“„ PDF
๐Ÿ“š arXiv
๐ŸŒ Project page

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2502.09935 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2502.09935 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2502.09935 in a Space README.md to link it from this page.

Collections including this paper 2