Content-Based Similarity Search

Overview

Digital data volume has been increasing at a phenomenal rate during the past decade. The ``Moore's law curve'' (doubling every 18 months) no longer refers only to the exponential improvement rate of processor performance, storage density and network bandwidth, but also to the data growth rates of many disciplines. The dominating data types are feature-rich data such as audio, digital photos, videos, and scientific sensor data. As we are moving into a digital society where all information is digitized and where the world is interconnected by digital means, it is highly desirable for next-generation systems to provide users with abilities to access, search, explore and manage feature-rich data.

Although several new operating systems attempt to provide users with content-based search capabilities, they are limited to text documents. A key challenge in implementing a content-based similarity search system for feature-rich data is that such data is noisy and complex. For example, consider two different photographs of an identical scene, or two separate recordings of a person speaking the same sentence. Despite the high degree of similarity between the two images or between the audio recordings, the digital representations are different at the bit level. Comparing noisy, feature-rich data requires matching based on similarity instead of exact match, and thus searching for noisy data requires similarity search instead of exact search. However, similarity search in high-dimensional spaces is notoriously difficult (the so called curse of dimensionality). Hence, practical advanced search solutions, such as database tools and search engines (e.g. Google), have been limited to searching for exact matches and tend to work only for text documents and text annotations. To date, there is no practical content-based search engine for massive amounts of inherently noisy, feature-rich data.

A key component in our research is a general-purpose similarity search engine. To deliver high-quality similarity search results with minimal CPU cycles and memory resources, we have developed novel techniques based on dimension-reduction ideas recently developed in the theory community. We use these to construct sketches -- tiny data structures that can be used to estimate properties of the original data -- from feature vectors as highly compact metadata for the similarity search engine. This approach allows us to attack the ``curse of dimensionality'' problem in the design of the similarity search engine for feature-rich data.

Princeton University Computer Science Department