Tri-Perspective View for Vision-Based 3D Semantic Occupancy Prediction

Tri-Perspective View for Vision-Based 3D Semantic Occupancy Prediction

Abstract

Modern methods for vision-centric autonomous driving perception widely adopt the bird's-eye-view (BEV) representation to describe a 3D scene. Despite its better efficiency than voxel representation, it has difficulty describing the fine-grained 3D structure of a scene with a single plane. To address this, we propose a tri-perspective view (TPV) representation which accompanies BEV with two additional perpendicular planes. We model each point in the 3D space by summing its projected features on the three planes. To lift image features to the 3D TPV space, we further propose a transformer-based TPV encoder (TPVFormer) to obtain the TPV features effectively. We employ the attention mechanism to aggregate the image features corresponding to each query in each TPV plane. Experiments show that our model trained with sparse supervision effectively predicts the semantic occupancy for all voxels. We demonstrate for the first time that using only camera inputs can achieve comparable performance with LiDAR-based methods on the LiDAR segmentation task on nuScenes. Code: https://github.com/wzzheng/TPVFormer.

Introduction

We implement TPVFormer and provide the results and checkpoints on nuScenes dataset.

Usage

Training commands

In MMDetection3D's root directory, run the following command to train the model:

Downloads the pretrained backbone weights to checkpoints/
For example, to train TPVFormer on 8 GPUs, please use

bash tools/dist_train.sh projects/TPVFormer/config/tpvformer_8xb1-2x_nus-seg.py 8

Testing commands

In MMDetection3D's root directory, run the following command to test the model on 8 GPUs:

bash tools/dist_test.sh projects/TPVFormer/config/tpvformer_8xb1-2x_nus-seg.py  ${CHECKPOINT_PATH} 8

Results and models

nuScenes

Backbone	Neck	Mem (GB)	Inf time (fps)	mIoU	Downloads
ResNet101 w/ DCN	FPN	32.0	-	68.9	model \| log

Citation

@article{huang2023tri,
    title={Tri-Perspective View for Vision-Based 3D Semantic Occupancy Prediction},
    author={Huang, Yuanhui and Zheng, Wenzhao and Zhang, Yunpeng and Zhou, Jie and Lu, Jiwen },
    journal={arXiv preprint arXiv:2302.07817},
    year={2023}
}