Data-Driven 3D Scene Understanding | Computer Science Department at Princeton University

Report ID:

TR-020-18

Authors:

Song, Shuran

Date:

October 16, 2018

Pages:

141

Download Formats:

[PDF]

Abstract:

Intelligent robots require advanced vision capabilities to perceive and interact with
the real physical world. While computer vision has made great strides in recent
years, its predominant paradigm still focuses on analyzing image pixels to infer two
dimensional outputs (e.g. 2D bounding boxes, or labeled 2D pixels.), which remain
far from sufficient for real-world robotics applications.
This dissertation presents the use of amodal 3D scene representations that enable
intelligent systems to not only recognize what is seen (e.g. Am I looking at a chair?),
but also predict contextual information about the complete 3D scene beyond visible
surfaces (e.g. What could be behind the table? Where should I look to find an exit?).
More specifically, it presents a line of work that demonstrates the power of these
representations: First it shows how 3D amodal scene representation can be used to
improve the performance of a traditional tasks such as object detection. We present
SlidingShapes and DeepSlidingShapes for the task of amodal 3D object detection,
where the system is designed to fully exploit the advantage of 3D information provided
by depth images. Second, we introduce the task of semantic scene completion and
our approach SSCNet, whose goal is to produce a complete 3D voxel representation of
volumetric occupancy and semantic labels for a scene from a single-view depth map
observation. Third, we introduce the task of semantic-structure view extrapolation
and our approach Im2Pano3D, which aims to predict the 3D structure and semantic
labels for a full 360◦panoramic view of an indoor scene when given only a partial
observation. Finally, we present two large-scale datasets (SUN RGB-D and SUNCG)
that enable the research on data-driven 3D scene understanding.
This dissertation demonstrates that leveraging a complete 3D scene representations
not only significantly improves algorithm’s performance for traditional computer
vision tasks, but also paves the way for new scene understanding tasks that have previously
been considered ill-posed given only 2D representations