Analysis of Filtering for Similarity Search Using Sketches
Sketches are tiny data structures that can be used to efficiently perform filtering high-dimensional data for similarity search. To design real systems with sketching techniques, an important design decision is the choice of sketch size given the targeted dataset size and desired filtering quality. Such design decisions need to be made without the dataset, or at least without the whole dataset. This paper provides analytical and experimental results to help system designers make such design decisions. We first show that the $\ell_1$ distances between data objects in these datasets fits lognormal distributions well. We then present a rank-based filtering model for the sketching algorithm that uses Hamming distance to approximate $\ell_1$ distance. Our experimental results show that the model gives conservative and good prediction for image, audio and 3D shape datasets. Finally, we show that the parameters of the model can be set with a small sample dataset and the resulting model can make good predictions for a large dataset.