Papers
arxiv:2111.07832

iBOT: Image BERT Pre-Training with Online Tokenizer

Published on Nov 15, 2021
Authors:
,
,
,
,
,
,

Abstract

The success of language Transformers is primarily attributed to the pretext task of masked language modeling (MLM), where texts are first tokenized into semantically meaningful pieces. In this work, we study masked image modeling (MIM) and indicate the advantages and challenges of using a semantically meaningful visual tokenizer. We present a self-supervised framework iBOT that can perform masked prediction with an online tokenizer. Specifically, we perform self-distillation on masked patch tokens and take the teacher network as the online tokenizer, along with self-distillation on the class token to acquire visual semantics. The online tokenizer is jointly learnable with the MIM objective and dispenses with a multi-stage training pipeline where the tokenizer needs to be pre-trained beforehand. We show the prominence of iBOT by achieving an 82.3% linear probing accuracy and an 87.8% fine-tuning accuracy evaluated on ImageNet-1K. Beyond the state-of-the-art image classification results, we underline emerging local semantic patterns, which helps the models to obtain strong robustness against common corruptions and achieve leading results on dense downstream tasks, eg., object detection, instance segmentation, and semantic segmentation.

Community

Proposes iBOT (Image BERT Pretraining with Online Tokenizer): studies masked image modeling (MIM) with an online tokenizer. Performs self-distillation (knowledge distillation) on masked patch tokens, teacher network is the online tokenizer; CLS token is the visual semantics. Tokenizer gets whole image and target network gets masked image, target network has to predict tokens (of the masked part) through distillation from tokenizer (as teacher). Uses cross-entropy loss for distillation (student in the log term and teacher outside). Generate two augmented views from an underlying image and apply block-wise masking. MIM loss: student gets masked patches of the view and teacher gets unmasked (all patches) view, student has to mimic teacher's probability distribution (output); symmetrically applied to both views. CLS (class - classification-like token for global context) loss: distillation between student's CLS (through masked view) and teacher's CLS of the other view (cross-view). Tested ViT (S, B, L) and SwinT; projection head (for CLS and patch tokens) is a shared 3-layer MLP; sharing head ensures MIM and CLS objectives benefit each other better. Better performance on ImageNet-1K (linear probing and kNN on frozen features); fine-tuning results on ImageNet-1K (with pre-training on ImageNet-22K). Better than DINO, SimCLR, SwAV, and BYOL on self-supervised learning on ImageNet-1K; better than DINO on unsupervised learning on ImageNet-1K. Better for downstream tasks of object detection and instance segmentation on COCO (compared to MoBY Swin-T and supervised methods) and semantic segmentation on ADE20K (compared to DINO, BEiT). The ViT learns more semantic patters (patch tokens carry semantic and class data); self-attention of CLS with different heads shows that each head learns distinct features. Appendix has algorithm pseudocode, multi-crop experiments, additional implementations, results, and ablations, and visualizations (along with sparse contextual matching). From ByteDance, Johns Hopkins, SJTU, UC Santa Cruz.

Links: GitHub, Colab

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2111.07832 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2111.07832 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2111.07832 in a Space README.md to link it from this page.

Collections including this paper 1