A 2D + 3D Rich Data Approach to Scene Understanding
Jianxiong Xiao, Massachusetts Institute of Technology
On your one-minute walk from the coffee machine to your desk each morning, you pass by dozens of scenes -- a kitchen, an elevator, your office -- and you effortlessly recognize them and perceive their 3D structure. But this one-minute scene-understanding problem has been an open challenge in computer vision for decades. Recently, researchers have come to realize that big data is critical for building scene-understanding systems that can recognize the semantics and reconstruct the 3D structure. In this talk, I will share my experience in leveraging big data for scene understanding, shifting the paradigm from 2D view-based categorization to 3D place-centric representations.
To push the traditional 2D representation to the limit, we built the Scene Understanding (SUN) Database, a large collection of images that exhaustively spans all scene categories. However, the lack of a "rich" representation still significantly limits the traditional recognition pipeline. While an image is a 2D array, the world is 3D and our eyes see it from a viewpoint, but this is not traditionally modeled. This paradigm shift toward rich representation also opens up new challenges that require a new kind of big data -- data with extra descriptions, namely rich data. Specifically, we focus on a highly valuable kind of rich data -- multiple viewpoints in 3D -- and we build the SUN3D database to obtain an integrated "place-centric" representation of scenes. This novel representation with rich data opens up exciting new opportunities for integrating scene recognition over space and for obtaining a scene-level reconstruction of large environments. It also has many applications such as organizing big visual data to provide photo-realistic indoor 3D maps. Finally, I will discuss some open challenges and my future plans for rich data and representation.
Jianxiong Xiao is a Ph.D. candidate in the Computer Science and Artificial Intelligence Laboratory (CSAIL) at Massachusetts Institute of Technology (MIT). Before that, he received a B.Eng. and a M.Phil. from the Hong Kong University of Science and Technology. His research interests are in computer vision, with a focus on scene understanding. His work has received the Best Student Paper Award at the European Conference on Computer Vision (ECCV) in 2012, and has appeared in popular press. Jianxiong was awarded the Google U.S./Canada Ph.D. Fellowship in Computer Vision in 2012 and MIT CSW Best Research Award in 2011. More information can be found on his website: http://mit.edu/jxiao.