Content-Based Search of Non-Text Data: What Google Does Not Do

Kai Li

Computer Science, Princeton University

Commercial search engines such as Google have been quite successful in searching information in a large set of text documents, but not so successful in dealing with complex data types such as images, video, audio and scientific sensor data. The main challenge is that it is not clear how to extract features of such data effectively and how to perform high-dimensional similarity search of very large datasets. The goal of Princeton Content-Aware Search Systems Project is to investigate how to attack this challenge. This talk will present our recent results in dimension reduction and indexing for similarity search. We will also describe a toolkit we have built and its applications in content-based search as well as image spam detections.