arxiv:2308.00090

Visual Geo-localization with Self-supervised Representation Learning

Published on Jul 31, 2023

Authors:

Abstract

Visual Geo-localization (VG) has emerged as a significant research area, aiming to identify geolocation based on visual features. Most VG approaches use learnable feature extractors for representation learning. Recently, Self-Supervised Learning (SSL) methods have also demonstrated comparable performance to supervised methods by using numerous unlabeled images for representation learning. In this work, we present a novel unified VG-SSL framework with the goal to enhance performance and training efficiency on a large VG dataset by SSL methods. Our work incorporates multiple SSL methods tailored for VG: SimCLR, MoCov2, BYOL, SimSiam, Barlow Twins, and VICReg. We systematically analyze the performance of different training strategies and study the optimal parameter settings for the adaptation of SSL methods for the VG task. The results demonstrate that our method, without the significant computation and memory usage associated with Hard Negative Mining (HNM), can match or even surpass the VG performance of the baseline that employs HNM. The code is available at https://github.com/arplaboratory/VG_SSL.

View arXiv page View PDF Add to collection

Community

TheProjectsGuy

Aug 12, 2023

Proposes VG-SSL (Visual Geolocalization with Self-Supervised Learning): tests adaptation of various SSL methods (SimCLR, MoCo v2, BYOL, SimSiam, Barlow Twins, and VICReg); can rasch performance of supervised methods without memory-heavy hard negative mining (HNM) - SSL only requires selecting positive samples. Summary of methods: SimCLR and MoCo use contrastive learning with InfoNCE loss (MoCo has ME - momentum encoder); SimSiam and BYOL are self-distillation with stop-gradient (SG) for target, predictor on target encoder (PR), batch norm (BN) in projector or predictor (BYOL has ME also) with embedding prediction loss; Barlow Twins and VICReg use Information maximization with cross correlation and VIC regularisation losses (contain BN with large dimensional embeddings - LP). Given database and query, get positives and negatives from database, group them (query-positive and identical negative groups), pass through trainable feature (embedding) extractor, use SSL loss. InfoNCE loss (group positives together and push others apart), embedding prediction loss (student has shallow MLP projection and has to match non-trainable/stopgrad teacher), cross-correlation (CC) methods in Barlow Twins loss (make CC matrix with positive pairs and enforce strong correlation with diagonal and -1 correlation for off-diagonal), VICReg loss (invariance, variance, and covariance terms) - don’t (batch) normalize embeddings when computing variance terms (it’ll not give true representations). You might miss information-worthy negative samples in mining, do random sampling of negatives (all without positives for a query) as a database negative ratio (fraction of number of queries sampled per epoch); form query-positive pairs and identical negative pairs for sampling (with same ratio). Uses ResNet-50 as local feature extractor and NetVLAD as global feature aggregator (trying out different SSL losses). MoCov2, BYOL, SimCLR, and BT are better than SimSiam and VICReg. Extended ablations in appendix. From NYU.