arxiv:2402.04555

FM-Fusion: Instance-aware Semantic Mapping Boosted by Vision-Language Foundation Models

Published on Feb 7, 2024

Authors:

Abstract

Semantic mapping based on the supervised object detectors is sensitive to image distribution. In real-world environments, the object detection and segmentation performance can lead to a major drop, preventing the use of semantic mapping in a wider domain. On the other hand, the development of vision-language foundation models demonstrates a strong zero-shot transferability across data distribution. It provides an opportunity to construct generalizable instance-aware semantic maps. Hence, this work explores how to boost instance-aware semantic mapping from object detection generated from foundation models. We propose a probabilistic label fusion method to predict close-set semantic classes from open-set label measurements. An instance refinement module merges the over-segmented instances caused by inconsistent segmentation. We integrate all the modules into a unified semantic mapping system. Reading a sequence of RGB-D input, our work incrementally reconstructs an instance-aware semantic map. We evaluate the zero-shot performance of our method in ScanNet and SceneNN datasets. Our method achieves 40.3 mean average precision (mAP) on the ScanNet semantic instance <PRE_TAG>segmentation</POST_TAG> task. It outperforms the traditional semantic mapping method significantly.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

No model linking this paper

Cite arxiv.org/abs/2402.04555 in a model README.md to link it from this page.

No Space linking this paper

Cite arxiv.org/abs/2402.04555 in a Space README.md to link it from this page.

No Collection including this paper

Add this paper to a collection to link it from this page.