Quick links

High-Dimensional Similarity Search for Large Datasets

Report ID:
August 2011
Download Formats:


The volume of images and other non-text data is growing exponentially in
todays digital universe. A popular way of extracting useful information
from such data is to conduct content-based similarity search, e.g., to
search for images with content similar to a query image. How to build
data structures to support efficient similarity search on a large scale
is an issue of increasing importance. The challenge is that feature-rich
data are usually represented as high-dimensional feature vectors, and
the curse of dimensionality dictates that as dimensionality grows, any
search strategy examines an increasingly large portion of the dataset
and eventually degenerates to brute-force scan. In this dissertation,
we study several key issues to improve the accuracy and efficiency of
high-dimensional similarity search. Specifically, we make the following
First, we enhance Multi-Probe LSH, the state-of-the-art high-dimensional
similarity search technique, by developing an accurate performance
model that predicts search accuracy with an error rate of less than 5%
using only a 1/10 sample of the whole dataset. We use this model to
realize automatic performance tuning, which would otherwise be tedious,
and to develop an adaptive query processing strategy to ensure the high
search quality of each individual query.

Second, we develop an efficient offline method for simultaneously finding
the K nearest neighbors of every point in the dataset. Our method is 2
to 16 times faster than the state-of-the-art technique when evaluated
with different datasets. Furthermore, we show how to use the offline
computed nearest-neighbor information to double the speed of online
similarity search.

Third, we develop compact representations of high-dimensional feature
vectors optimized for similarity search tasks. We present an asymmetric
distance estimation framework to exploit the information in the
uncompressed query data. We use that to further compress the indexed
dataset, by 10% to 40%, without compromising search quality.

Fourth, we develop a scheme to compactly represent sets of feature
vectors, an increasingly popular data representation that is more accurate
than single vectors, but also more expensive. Our method
dramatically reduces the matching cost in both space and time. In an
image classification task, our method achieves 25 times speedup over
one of the best existing techniques.

Finally, we demonstrate some of the proposed techniques by building a
large-scale near-duplicate image search engine. Our system serves more
than 50 million web images on a single commodity server and responds to
queries within a few seconds.

Follow us: Facebook Twitter Linkedin