An Empirical Approach to Computer Vision
Toward the goal of modeling perceptual grouping, we have constructed a novel dataset of 12,000 segmentations of 1,000 natural images by 30 human subjects. The subjects marked the locations of objects in the images, providing ground truth data for learning grouping cues and benchmarking grouping algorithms. We feel that the data-driven approach is critical for two reasons: (1) the data reflects ecological statistics that the human visual system has evolved to exploit, and (2) innovations in computational vision should be evaluated quantitatively.
I will first present local boundary models based on brightness, color, and texture cues, where the cues are individually optimized with respect to the dataset and then combined in a statistically optimal manner with classifiers. The resulting detector is shown to significantly outperform prior state-of-the-art algorithms. Next, we learn from the dataset how to combine the boundary model with patch-based features in a pixel affinity model to settle long-standing debates in computer vision with empirical results: (1) brightness boundaries are more informative than patches, and vice a versa for color; (2) texture boundaries and patches are the two most powerful cues; (3) proximity is not a useful cue for grouping, it is simply a result of the process; and (4) both boundary-based and region-based approaches provide significant independent information for grouping.
Within this domain of image segmentation, this work demonstrates that from a single dataset encoding human perception on a high-level task, we can construct benchmarks for the various levels of processing required for the high-level task. This is analogous to the micro-benchmarks and application-level benchmarks employed in computer architecture and computer systems research to drive progress towards end-to-end performance improvement. I propose this as a viable model for stimulating progress in computer vision for tasks such as segmentation, tracking, and object recognition, using human ground truth data as the end-to-end goal.