arxiv:2203.16258

Image-to-Lidar Self-Supervised Distillation for Autonomous Driving Data

Published on Mar 30, 2022

Authors:

Gilles Puy ,

Abstract

Segmenting or detecting objects in sparse Lidar point clouds are two important tasks in autonomous driving to allow a vehicle to act safely in its 3D environment. The best performing methods in 3D semantic segmentation or object detection rely on a large amount of annotated data. Yet annotating 3D Lidar data for these tasks is tedious and costly. In this context, we propose a self-supervised pre-training method for 3D perception models that is tailored to autonomous driving data. Specifically, we leverage the availability of synchronized and calibrated image and Lidar sensors in autonomous driving setups for distilling self-supervised pre-trained image representations into 3D models. Hence, our method does not require any point cloud nor image annotations. The key ingredient of our method is the use of superpixels which are used to pool 3D point features and 2D pixel features in visually similar regions. We then train a 3D network on the self-supervised task of matching these pooled point features with the corresponding pooled image pixel features. The advantages of contrasting regions obtained by superpixels are that: (1) grouping together pixels and points of visually coherent regions leads to a more meaningful contrastive task that produces features well adapted to 3D semantic segmentation and 3D object detection; (2) all the different regions have the same weight in the contrastive loss regardless of the number of 3D points sampled in these regions; (3) it mitigates the noise produced by incorrect matching of points and pixels due to occlusions between the different sensors. Extensive experiments on autonomous driving datasets demonstrate the ability of our image-to-Lidar distillation strategy to produce 3D representations that transfer well on semantic segmentation and object detection tasks.

View arXiv page View PDF Add to collection

Community

TheProjectsGuy

Aug 15, 2023

Introduces SLidR (Superpixel driven LiDAR Representations): With a synchronised and calibrated image and LiDAR features, using superpixels to pool 2D and 3D features/data, train a 3D network in a self-supervised (SSL) manner to match pooled point and image pixel features (CL - contrastive learning). Resulting model can be used for downstream segmentation and object detection tasks. Uses knowledge distillation (KD) from a 2D pre-trained teacher into a 3D student network. LiDAR and camera(s) have known extrinsics (can project 3D point to any pixel, or null, for each camera). Uses InfoCNE CL loss between 3D DNN (point encoder) and a pre-trained image network/backbone - there are trainable projection heads to embed features into the same latent space (projection head for LiDAR/3D is a linear layer; for image it is pre-trained/frozen modified ResNet-50 using MoCo-v2 with 1x1 conv followed by upsampling). Get superpixels (set of pixels - fully disjoint and unitive) using SLIC; get sets (belonging superpixel) for cameras and points (taken by LiDAR); use contrastive loss across all scenes in the batch. The 3D encoder is a sparse residual UNet (with 3D CNNs), voxels in cylindrical coordinates as input. Used nuScenes dataset with augmentations for flip, random rotation (about Z axis), drop random points; use crops and horizontal flips for image. Better semantic segmentation results on nuScenes and Semantic KITTI, compared to adapter/self-implemented PPKT (similar work but with pixels-to-point contrastive and using strided convolutions instead of dilated), PointContrast, and DepthContrast; compared linear probing (last few layers/append a linear classification head) with all and finetuning (entire model) with little training data. Also has better object detection (against PPKT) mAP on KITTI. Supplementary material includes more results (visual and choice ablations), data augmentations, baseline implementation details, and per-class segmentation performances. From Valeo.ai and LIGM (CNRS, France).

Links: GitHub, PapersWithCode

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2203.16258 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2203.16258 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2203.16258 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.