arxiv:2411.15466

Large-Scale Text-to-Image Model with Inpainting is a Zero-Shot Subject-Driven Image Generator

Published on Nov 23, 2024

· Submitted by

chaehun on Nov 25, 2024

Upvote

Authors:

Chaehun Shin ,

Jooyoung Choi ,

Heeseung Kim ,

Sungroh Yoon

Abstract

Subject-driven text-to-image generation aims to produce images of a new subject within a desired context by accurately capturing both the visual characteristics of the subject and the semantic content of a text prompt. Traditional methods rely on time- and resource-intensive fine-tuning for subject alignment, while recent zero-shot approaches leverage on-the-fly image prompting, often sacrificing subject alignment. In this paper, we introduce Diptych Prompting, a novel zero-shot approach that reinterprets as an inpainting task with precise subject alignment by leveraging the emergent property of diptych generation in large-scale text-to-image models. Diptych Prompting arranges an incomplete diptych with the reference image in the left panel, and performs text-conditioned inpainting on the right panel. We further prevent unwanted content leakage by removing the background in the reference image and improve fine-grained details in the generated subject by enhancing attention weights between the panels during inpainting. Experimental results confirm that our approach significantly outperforms zero-shot image prompting methods, resulting in images that are visually preferred by users. Additionally, our method supports not only subject-driven generation but also stylized image generation and subject-driven image editing, demonstrating versatility across diverse image generation applications. Project page: https://diptychprompting.github.io/

View arXiv page View PDF Project page GitHub repository Add to collection

Community

chaehun

Paper author Paper submitter Nov 26, 2024

We introduce Diptych Prompting, a novel zero-shot subject-driven text-to-image generation that reinterprets as an inpainting task with precise subject alignment by leveraging the emergent property of diptych generation in large-scale text-to-image models.

librarian-bot

Nov 27, 2024

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

antonio-c

Dec 10, 2024

unofficial implementation is here: https://github.com/antonioo-c/Diptych-Prompting

BlendBrain

Dec 21, 2024

transformer_flux.py，line 91 has an error. The enhance can not be applied on attn_weight here.

lrzjason

Jan 23

Most work looks like what I have done and published at 10/11/2024 on civitai which inspired by In-context lora.
https://civitai.com/models/933018?modelVersionId=1044405

chaehun

Paper author 2 days ago

Your work on civitai appears to involve inserting a specific subject into a target image using Flux-Fill combined with in-context LoRA.

We would like to clearly highlight the key differences between your method and our Diptych Prompting approach:

Our Diptych Prompting is training-free, leveraging an off-the-shelf, high-performance Text-to-Image (TTI) model with inpainting. In contrast, the in-context LoRA approach that inspired your work requires training a LoRA model on a moderate amount of data for image generation.
We introduce a novel training-free internal attention control aimed at enhancing performance, clearly distinguishing our Diptych Prompting contribution from in-context LoRA and your methodology.
Our Diptych Prompting focuses on subject-driven TTI generation, while your method is aimed at image editing by inserting a specific subject into an existing target image.

Considering the publication timelines, we recognize your work as an excellent concurrent approach with certain similarities and have included a citation to it in our final draft.