Pixel-Level Prediction: From Geometry to Semantics
Pixel-level prediction generalizes a wide range of computer vision tasks including
semantic image segmentation and dense depth prediction. They are fundamental
for image recognition, receiving continual attention from the community. However,
although they share common traits that may admit a general solution, they are usually studied in isolation because of di↵erent domain characteristics. This thesis aims to study the essential problems behind those tasks and shed light on a general framework.
This thesis starts with an algorithm that can predict plausible depth from almost
identical images based on geometric optimization. The motion between those images is called “Accidental Motion”. The analysis of accidental motion shows that motion optimization has special convexity properties. It leads to a reconstruction pipeline that can produce a plausible dense depth map for the reference image, which is shown to enable depth based camera e↵ects.
The second part then studies learning pixel representation to predict semantic
properties based on the single reference image. Previous works usually use learned upsampling to recover the pixel-level information. This work proposes to use Dilated Convolution to transform the classification networks such that high-resolution prediction is achieved without upsampling. Dilated Convolution can also render an exponential increase in receptive field, which is ideal for learning global context. A context module is proposed based on this property that can improve the network performance significantly and consistently. Dilation is still a standard component in the state-of-the-art method for semantic image segmentation.
The further study of dilated residual networks shows that same high-resolution
prediction can also improve image classification results. This indicates no essential network architecture di↵erence exists between image classification and segmentation
Further inspection of class activation maps and layer responses uncover peculiar gridding patterns and their cause. This finding leads to new designs of convolutional networks that can remove the gridding artifacts and produce activations with better spatial consistency. The new networks can improve the performance of both image classification and semantic segmentation.
The presented method and results may inspire new research in building a unified
framework for image recognition of geometry and semantics.