CLIPer: Hierarchically Improving Spatial Representation of CLIP for Open-Vocabulary Semantic Segmentation
Abstract
<PRE_TAG>Contrastive Language-Image Pre-training (CLIP)</POST_TAG> exhibits strong zero-shot classification ability on various image-level tasks, leading to the research to adapt CLIP for pixel-level open-vocabulary semantic segmentation without additional training. The key is to improve spatial representation of image-level CLIP, such as replacing <PRE_TAG>self-attention map</POST_TAG> at last layer with <PRE_TAG>self-self attention map</POST_TAG> or <PRE_TAG>vision foundation model based attention map</POST_TAG>. In this paper, we present a novel <PRE_TAG>hierarchical framework</POST_TAG>, named <PRE_TAG>CLIPer</POST_TAG>, that hierarchically improves spatial representation of CLIP. The proposed <PRE_TAG>CLIPer</POST_TAG> includes an <PRE_TAG>early-layer fusion module</POST_TAG> and a <PRE_TAG>fine-grained compensation module</POST_TAG>. We observe that, the embeddings and attention maps at early layers can preserve <PRE_TAG>spatial structural information</POST_TAG>. Inspired by this, we design the early-layer fusion module to generate <PRE_TAG>segmentation map</POST_TAG> with better spatial coherence. Afterwards, we employ a <PRE_TAG>fine-grained compensation module</POST_TAG> to compensate the local details using the <PRE_TAG>self-attention map</POST_TAG>s of diffusion model. We conduct the experiments on seven segmentation datasets. Our proposed <PRE_TAG>CLIPer</POST_TAG> achieves the state-of-the-art performance on these datasets. For instance, using ViT-L, <PRE_TAG>CLIPer</POST_TAG> has the mIoU of 69.8% and 43.3% on VOC and COCO Object, outperforming ProxyCLIP by 9.2% and 4.1% respectively.
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper