arxiv:2109.14279

Localizing Objects with Self-Supervised Transformers and no Labels

Published on Sep 29, 2021

Authors:

Gilles Puy ,

Abstract

Localizing objects in image collections without supervision can help to avoid expensive annotation campaigns. We propose a simple approach to this problem, that leverages the activation features of a vision transformer pre-trained in a self-supervised manner. Our method, LOST, does not require any external object proposal nor any exploration of the image collection; it operates on a single image. Yet, we outperform state-of-the-art object discovery methods by up to 8 CorLoc points on PASCAL VOC 2012. We also show that training a class-agnostic detector on the discovered objects boosts results by another 7 points. Moreover, we show promising results on the unsupervised object discovery task. The code to reproduce our results can be found at https://github.com/valeoai/LOST.

View arXiv page View PDF Add to collection

Community

TheProjectsGuy

Nov 30, 2023

Proposes LOST (localizing objects using self-supervised transformers): localising objects in image collections using latent features (activation maps) of transformers (ViTs) that are trained in a self-supervised (SSL) manner; object discovery in an image (no annotation required) and detection by clustering (no annotations in pipeline). Use DINO’s key facet/component of the last layer for getting similarities between patches; patches of foreground are less correlated than background (choose least similar patch as seed). Get DINO image patch features (key of last layer) at all spatial positions in the image; pick patch with smallest number of positive correlations: build patch similarity graph with binary symmetric adjacency matrix (1 if two patches have positive dot product) - positively correlated features are connected, row-wise sum (to get degree/number of connections per node) and take argmin (patch with least connections/positive correlations). Seed expansion by adding positively correlated features (distance to initial seed features). Remove (mask out) patches in the seed expansion set that are not correlated (positive dot product/cosine similarity) with every other patch in the set; add bounding box to the final patch set. Can use LOST to train class agnostic (CAD) and class-aware (OD) object detection through self-supervised object annotations; to get class labels: apply k-means clustering to the cropped BB (by LOST) CLS token (DINO feature - scale the crop and input to DINO); give names from ground truth class annotations (if available during testing) using Hungarian algorithm. Better than LOD, rOSD, DINO-seg on single object discovery (VOC, COCO); ViT-S/16 backbone performs best; used Faster R-CNN (with ResNet backbone pretrained like DINO) for OD - better mAP than rOSD detections. Works best for non-overlapping objects that occupy lesser area than background. Appendix has ablations for transformer feature selection, importance of seed expansion, analysis of DINO-seg, k-means clustering analysis (k value and non-determinism); additional results on more datasets, using ViT (DEiT) from supervised method (ImageNet) - doesn’t work as good; also has more qualitative visualizations and training details for Faster-RCNN OD setup. From Valeo.ai, INRIA, CNRS, NYU.