Papers
arxiv:1910.02527

3D Scene Graph: A Structure for Unified Semantics, 3D Space, and Camera

Published on Oct 6, 2019
Authors:
,
,
,
,
,
,

Abstract

A comprehensive semantic understanding of a scene is important for many applications - but in what space should diverse semantic information (e.g., objects, scene categories, material types, texture, etc.) be grounded and what should be its structure? Aspiring to have one unified structure that hosts diverse types of semantics, we follow the Scene Graph paradigm in 3D, generating a 3D Scene Graph. Given a 3D mesh and registered panoramic images, we construct a graph that spans the entire building and includes semantics on objects (e.g., class, material, and other attributes), rooms (e.g., scene category, volume, etc.) and cameras (e.g., location, etc.), as well as the relationships among these entities. However, this process is prohibitively labor heavy if done manually. To alleviate this we devise a semi-automatic framework that employs existing detection methods and enhances them using two main constraints: I. framing of query images sampled on panoramas to maximize the performance of 2D detectors, and II. multi-view consistency enforcement across 2D detections that originate in different camera locations.

Community

Introduces 3D scene graphs (3DSGs): a data structure/scene representation (understanding) for encoding scene semantics (objects, material, categories, locations, etc.) and regulations from 3D mesh and registered panoramic images/views in a hierarchy; proposes a semi-automatic process for the same; extends scene graph idea from Visual Genome to 3D space and integrates modality into Gibson’s environment database. Has a predefined set of 3DSG attributes and relations between elements (objects, rooms, buildings, and camera). Input is 3D mesh with registered RGB panoramas and camera parameters; output is indoor scene graph with four layers (bottom to top): cameras (with pose), objects (with semantic classes and information), rooms (with spatial order and collection attributes), and buildings; use existing semantic detectors (with 2D framing and 3D multi-view consistency) to boost annotation pipeline and minimize human labor. Rectilinear framing (selective cropping) of panorama images for better detection accuracy: densely sample images on panorama, aggregate using weighed voting scheme (proportional to detection confidence score and inversely to distance from center of image); retain highest-scoring classes; panorama segmented per class, use connected components for instance segmentation mask. Directly projecting registered panoramas on 3D mesh gives inconsistencies (bleeding segmentation labels), use multi-view consistency: weighing mechanism on how close observed point (camera location) is to surface (mesh’s triangle/polygon face center) - closer cameras have better visibility; do majority voting and retain highest classes; find connected components in 3D (can also project back to 2D panorama). User-in-the-loop (manual) verification (using Amazon MTurk) for label and segmentation mask. Attributes and relations computed using pre-existing methods: room through 3D semantic parsing (classification), volume through convex hull, material defined manually, 2D amoral mask using ray-tracing, etc. Uses Mask R-CNN with bells and whistles (ResNeXt-152 with feature pyramid network - FPN) pretrained on ImageNet and fine-tuned on COCO; better average precision and recall (AP and AR) for 2D and 3D modalities. The partially automatic method (with manual verification) saves time (compared to fully manual method). Scene graph verified using spatial order of object pair (front-back and left-right), relative volume (smaller or larger), and amodal mask segmentation (object occluded mask); trained U-Net on projected masks for amodal segmentation. Supplementary material (appendix) contains attributes and relationships for 3DSG (and computation method), more analysis, results, details of manual verification pipelines, and more experiments. Relationships include: amodal mask (object and camera) for occlusion/visible segmentation; occlusion relationship (two objects and a camera); relative magnitude of physical quantities (between objects from rooms); spatial order (location with respect to a query object in a camera/view, maybe across rooms), parent space (room ID of object or camera), parent building, and same parent room (conditional/binary). Semantic analysis based on distribution of objects per building, class-wise histograms, volume (from 3D hull) per class, surface coverage (mesh area) per class, and nearest object distribution. Contains images of web interface for verification of object label, object mask, addition of object mask, verify completeness; masks drawn manually. Data generation for occlusion mask (amodal mask) involves rendering rectilinear image with object of interest at center and doing ray-tracing on mesh (object surface visibility and order of hit by the rays); filter by occlusion percentage (occluded pixels over occluded and visible pixels) in a meaningful range (0.2 to 0.8). Training data has random jitter and horizontal flips; dense semantic segmentation problem (empty, occluded, visible); UNet CNN network. From Stanford (Martin Fischer, Silvio Savarese), UC Berkeley (Jitendra Malik).

Links: website, Gibson environment database, Video, Supplementary material, PapersWithCode, GitHub

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/1910.02527 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/1910.02527 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/1910.02527 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.