arxiv:2206.10996

ProtoCLIP: Prototypical Contrastive Language Image Pretraining

Published on Jun 22, 2022

Authors:

Delong Chen ,

Abstract

Contrastive Language Image Pretraining (CLIP) has received widespread attention, since its learned representations can be transferred well to various downstream tasks. During the training process of the CLIP model, the InfoNCE objective aligns positive image-text pairs and separates negative ones. We show an underlying representation grouping effect during this process: the InfoNCE objective indirectly groups semantically similar representations together via randomly emerged within-modal anchors. Based on this understanding, in this paper, Prototypical Contrastive Language Image Pretraining (Proto<PRE_TAG>CLIP)</POST_TAG> is introduced to enhance such grouping by boosting its efficiency and increasing its robustness against the modality gap. Specifically, ProtoCLIP sets up prototype-level discrimination between image and text spaces, which efficiently transfers higher-level structural knowledge. Further, Prototypical Back Translation (PBT) is proposed to decouple representation grouping from representation alignment, resulting in effective learning of meaningful representations under large modality gap. The PBT also enables us to introduce additional external teachers with richer prior language knowledge. ProtoCLIP is trained with an online episodic training strategy, which makes it can be scaled up to unlimited amounts of data. We train our ProtoCLIP on Conceptual Captions and achieved an +5.81% ImageNet linear probing improvement and an +2.01% ImageNet zero-shot classification improvement. On the larger YFCC-15M dataset, ProtoCLIP matches the performance of CLIP with 33% of training time. Codes are available at https://github.com/megvii-research/protoclip.

View arXiv page View PDF Add to collection

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2206.10996 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2206.10996 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2206.10996 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.