Papers
arxiv:2411.13836

CLIPer: Hierarchically Improving Spatial Representation of CLIP for Open-Vocabulary Semantic Segmentation

Published on Nov 21, 2024
Authors:
,
,
,
,

Abstract

<PRE_TAG>Contrastive Language-Image Pre-training (CLIP)</POST_TAG> exhibits strong zero-shot classification ability on various image-level tasks, leading to the research to adapt CLIP for pixel-level open-vocabulary semantic segmentation without additional training. The key is to improve spatial representation of image-level CLIP, such as replacing <PRE_TAG>self-attention map</POST_TAG> at last layer with <PRE_TAG>self-self attention map</POST_TAG> or <PRE_TAG>vision foundation model based attention map</POST_TAG>. In this paper, we present a novel <PRE_TAG>hierarchical framework</POST_TAG>, named <PRE_TAG>CLIPer</POST_TAG>, that hierarchically improves spatial representation of CLIP. The proposed <PRE_TAG>CLIPer</POST_TAG> includes an <PRE_TAG>early-layer fusion module</POST_TAG> and a <PRE_TAG>fine-grained compensation module</POST_TAG>. We observe that, the embeddings and attention maps at early layers can preserve <PRE_TAG>spatial structural information</POST_TAG>. Inspired by this, we design the early-layer fusion module to generate <PRE_TAG>segmentation map</POST_TAG> with better spatial coherence. Afterwards, we employ a <PRE_TAG>fine-grained compensation module</POST_TAG> to compensate the local details using the <PRE_TAG>self-attention map</POST_TAG>s of diffusion model. We conduct the experiments on seven segmentation datasets. Our proposed <PRE_TAG>CLIPer</POST_TAG> achieves the state-of-the-art performance on these datasets. For instance, using ViT-L, <PRE_TAG>CLIPer</POST_TAG> has the mIoU of 69.8% and 43.3% on VOC and COCO Object, outperforming ProxyCLIP by 9.2% and 4.1% respectively.

Community

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2411.13836 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2411.13836 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2411.13836 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.