Discovering Meaning in the Visual World

Fei-Fei Li

Computer Science, Princeton University

When humans encounter images or videos of the visual world, our visual system is capable of extracting a rich plethora of information in a fleeting speed. Among them, a large portion of this information is related to the semantic meanings, such as objects, scenes, purposeful motions and events. This ability still poses a big challenge to today's computer vision algorithms. In this talk, we begin by introducing a bag of words model (BoW) for natural scene categorization based on the Latent Dirichlet Allocation model. We show an excellent result on a dataset of thirteen classes. In object recognition, we present a recent algorithm that aims to collect large object class datasets from the Internet via unsupervised incremental learning. This is achieved by a non-parametric latent topic model. While the bag of words models can characterize objects and images in a computationally efficient way, the classical model representations omit the encoding of spatial information, hence limiting the BoW models from more detailed image understanding. We show briefly that by introducing a hierarchical region-based representation to capture the spatial coherency of the image, we are able to achieve simultaneous segmentation and recognition of objects far better than the traditional BoW models. Last we conclude the talk by showcasing a couple of recent and to-be-published works in human motion recognition in videos and human activity classification in static images.

1) S. Savarese and L. Fei-Fei. 3D generic object categorization,
localization and pose estimation. IEEE Intern. Conf. in Computer Vision
(ICCV). 2007.

2) J. Li and L. Fei-Fei. What, where and who? Classifying event by
scene
and object recognition. IEEE Intern. Conf. in Computer Vision (ICCV). 2007.

3) L. Cao and L. Fei-Fei. Spatially coherent latent topic model for
concurrent object segmentation and classification . IEEE Intern. Conf.
in Computer Vision (ICCV). 2007.

4) L. Fei-Fei, R. Fergus and P. Perona. Learning generative visual
models for 101 object categories. Computer Vision and Image
Understanding. 2007.

5) J. Li, G. Wang and L. Fei-Fei. OPTIMOL: automatic Object Picture
collecTion via Incremental MOdel Learning. IEEE Computer Vision and
Pattern Recognition (CVPR). 2007.

6) J.C. Niebles and L. Fei-Fei. A hierarchical model of shape and
appearance for human action classification. IEEE Computer Vision and
Pattern Recognition (CVPR). 2007.

7) L. Fei-Fei, Iyer, A., Koch, C., & Perona, P. What do we perceive in
a
glance of a real-world scene? Journal of Vision, 7(1):10, 1-29,
http://journalofvision.org/7/1/10/, doi:10.1167/7.1.10. 2007.

8) J.C. Niebles, H. Wang, L. Fei-Fei. Unsupervised learning of human
action categories using spatial-temporal words. British Machine Vision
Conference (BMVC) 2006.

9) L. Fei-Fei and P. Perona. A Bayesian Hierarchical Model for Learning

Natural Scene Categories. IEEE CVPR. 2005.