arxiv:2307.05845

PIGEON: Predicting Image Geolocations

Published on Jul 11, 2023

Authors:

Michal Skreta ,

Abstract

Planet-scale image geolocalization remains a challenging problem due to the diversity of images originating from anywhere in the world. Although approaches based on vision transformers have made significant progress in geolocalization accuracy, success in prior literature is constrained to narrow distributions of images of landmarks, and performance has not generalized to unseen places. We present a new geolocalization system that combines semantic geocell creation, multi-task contrastive pretraining, and a novel loss function. Additionally, our work is the first to perform retrieval over location clusters for guess refinements. We train two models for evaluations on street-level data and general-purpose image geolocalization; the first model, PIGEON, is trained on data from the game of Geoguessr and is capable of placing over 40% of its guesses within 25 kilometers of the target location globally. We also develop a bot and deploy PIGEON in a blind experiment against humans, ranking in the top 0.01% of players. We further challenge one of the world's foremost professional Geoguessr players to a series of six matches with millions of viewers, winning all six games. Our second model, PIGEOTTO, differs in that it is trained on a dataset of images from Flickr and Wikipedia, achieving state-of-the-art results on a wide range of image geolocalization benchmarks, outperforming the previous SOTA by up to 7.7 percentage points on the city accuracy level and up to 38.8 percentage points on the country level. Our findings suggest that PIGEOTTO is the first image geolocalization model that effectively generalizes to unseen places and that our approach can pave the way for highly accurate, planet-scale image geolocalization systems. Our code is available on GitHub.

View arXiv page View PDF Add to collection

Community

TheProjectsGuy

Mar 28

Proposes PIGEON and PIGEOTTO for street-level general purpose (large scale) image geolocalization (VPR); end-to-end planet-scale image geolocalization. PIGEON is trained on Geoguessr (game) and PIGEOTTO is trained on Flickr and Wikipedia images. Geocell (geographical class/partition/area) creation through: naive geocells through subdividing into balanced rectangular sections or semantic geocells (through GADM hierarchical maps - country, administrative, granular) that are clustered using OPTICS clustering with Voronoi tessellation for contiguous regions. Smooth the one-hot classification using Haversine distance (which is distance between two lat-long pairs on the world), weigh this with prediction probability (output/likelihood of taking a particular geocell for the training sample) to get loss per training sample; neighboring geocells are more likely to be similar. Uses CLIP's ViT-L/14-336 vision encoder; linear layer on top of that to predict geocells. Average the image embeddings for PIGEON (that uses four-view image panoramas). Resume training of CLIP in a multi-task fashion; generate location, climate, compass direction, season (month), and traffic captions (geographic synthetic captions) from metadata; losses for location (based on weighed haversine distances), climate (cross-entropy over 28 Koppen Geiger climate zones), month (cross-entropy), and MSE regression over temperature, precipitation, elevation, and population density. Cluster each geocell further (using OPTICS clustering); during inference predict the cluster that has closest average image embedding (using CLIP encoder with linear layer); select best location within cluster by using Euclidean embedding distance. PIGEOTTO, trained on Flickr and Google Landmarks v2 (derived from Wikipedia), outperforms GeoDecoder, Translocator, ISNs, CPlaNet, and PlaNet on IM2GPS3k, YFCC4k (and 6k), and GWS15k (most challenging) datasets; second to GeoDecoder on IM2GPS. Appendix A has semantic geocell creation details. Appendix B has implementation details; hyperparameters for pretraining of CLIP’s vision encoder for image geolocalization and for fine-tuning via linear projection layer. Appendix C has data sources; D has ablations on non-distance metrics; E has additional analysis; F has deployment details for Geoguessr, chrome extension details, and details on FastAPI endpoints for inference and statistics. From Stanford (Chelsea Finn).

Links: PapersWithCode, GitHub, Related works (GeoDecoder, Translocator, PlaNet), GADM Database

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2307.05845 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2307.05845 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2307.05845 in a Space README.md to link it from this page.