arxiv:2411.13836

CLIPer: Hierarchically Improving Spatial Representation of CLIP for Open-Vocabulary Semantic Segmentation

Published on Nov 21, 2024

Authors:

Abstract

<PRE_TAG>Contrastive Language-Image Pre-training (CLIP)</POST_TAG> exhibits strong zero-shot classification ability on various image-level tasks, leading to the research to adapt CLIP for pixel-level open-vocabulary semantic segmentation without additional training. The key is to improve spatial representation of image-level CLIP, such as replacing <PRE_TAG>self-attention map</POST_TAG> at last layer with <PRE_TAG>self-self attention map</POST_TAG> or <PRE_TAG>vision foundation model based attention map</POST_TAG>. In this paper, we present a novel <PRE_TAG>hierarchical framework</POST_TAG>, named <PRE_TAG>CLIPer</POST_TAG>, that hierarchically improves spatial representation of CLIP. The proposed <PRE_TAG>CLIPer</POST_TAG> includes an <PRE_TAG>early-layer fusion module</POST_TAG> and a <PRE_TAG>fine-grained compensation module</POST_TAG>. We observe that, the embeddings and attention maps at early layers can preserve <PRE_TAG>spatial structural information</POST_TAG>. Inspired by this, we design the early-layer fusion module to generate <PRE_TAG>segmentation map</POST_TAG> with better spatial coherence. Afterwards, we employ a <PRE_TAG>fine-grained compensation module</POST_TAG> to compensate the local details using the <PRE_TAG>self-attention map</POST_TAG>s of diffusion model. We conduct the experiments on seven segmentation datasets. Our proposed <PRE_TAG>CLIPer</POST_TAG> achieves the state-of-the-art performance on these datasets. For instance, using ViT-L, <PRE_TAG>CLIPer</POST_TAG> has the mIoU of 69.8% and 43.3% on VOC and COCO Object, outperforming ProxyCLIP by 9.2% and 4.1% respectively.

View arXiv page View PDF Add to collection

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2411.13836 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2411.13836 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2411.13836 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.