Papers
arxiv:2406.13735

StableSemantics: A Synthetic Language-Vision Dataset of Semantic Representations in Naturalistic Images

Published on Jun 19
· Submitted by aluo-x on Jun 21
Authors:
,
,
,

Abstract

Understanding the semantics of visual scenes is a fundamental challenge in Computer Vision. A key aspect of this challenge is that objects sharing similar semantic meanings or functions can exhibit striking visual differences, making accurate identification and categorization difficult. Recent advancements in text-to-image frameworks have led to models that implicitly capture natural scene statistics. These frameworks account for the visual variability of objects, as well as complex object co-occurrences and sources of noise such as diverse lighting conditions. By leveraging large-scale datasets and cross-attention conditioning, these models generate detailed and contextually rich scene representations. This capability opens new avenues for improving object recognition and scene understanding in varied and challenging environments. Our work presents StableSemantics, a dataset comprising 224 thousand human-curated prompts, processed natural language captions, over 2 million synthetic images, and 10 million attention maps corresponding to individual noun chunks. We explicitly leverage human-generated prompts that correspond to visually interesting stable diffusion generations, provide 10 generations per phrase, and extract cross-attention maps for each image. We explore the semantic distribution of generated images, examine the distribution of objects within images, and benchmark captioning and open vocabulary segmentation methods on our data. To the best of our knowledge, we are the first to release a diffusion dataset with semantic attributions. We expect our proposed dataset to catalyze advances in visual semantic understanding and provide a foundation for developing more sophisticated and effective visual models. Website: https://stablesemantics.github.io/StableSemantics

Community

Paper author Paper submitter

TL;DR: We created a dataset of captions, SDXL Lightning images, and cross-attention maps from the Diffusion model corresponding to individual objects in the caption.

Captions were collected from Stable Diffusion discord, specifically the ones that made it to Showdown and Pantheon (these were selected based on human preference data from all the images users submitted). We used a LLM to clean up these prompts into natural language captions. A total of 235k unique prompts were collected, 224k remained after NSFW filtering, and 200k remained after length filtering. The 200k prompts each had 10 generations from different seeds.

Images & semantic maps will be released once the paper is accepted, but I can send over the prompts now if you PM.

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2406.13735 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2406.13735 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2406.13735 in a Space README.md to link it from this page.

Collections including this paper 1