Sparse spatial sample kernels - An efficient way of measuring remote sequence similarity

Vladimir Pavlovic
Computer Science, Rutgers University

Analysis of sequences, such as sequence classification or clustering, is a challenging problem that spans a number of fields in science and technology. For example, elucidation of protein functions from primary DNA sequence, "barcoding" of living species, understanding of text or music corpora are among the tasks that, in their core, rely on one's ability to accomplish classification of strings of varying lengths and content. In this talk I will present my group's recent work that aims to address the string classification problem in a simple, robust yet computationally highly efficient manner. I will discuss a new way to embed strings into a finite dimensional subspace which we denote as the Sparse Spatial Sample (SSS). The SSS is astoundingly simple in its representation, yet the subspace it determines appears to very closely match the sequence manifolds for many practical problems. In addition, the SSS representation gives rise to a new family of fast string matching algorithms. This new family of linear time algorithms improves theoretical complexity bounds of existing approaches while scaling well with respect to the sequence alphabet size, the number of allowed mismatches and the size of the dataset. In particular, on large alphabets with loose mismatch constraints our algorithms are several orders of magnitude faster than the existing algorithms for string comparison under the mismatch similarity measure. I will then present a number of results that demonstrate both the computational efficiency and the predictive ability induced by the SSS manifold. Experiments on music genre classification, text classification, protein remote homology and fold prediction, and DNA barcoding will illustrate some of the approach's benefits on a rather disparate set of problems. The scalability of the algorithms allows one to consider complex sequence transformations, modeled using longer string features and larger numbers of mismatches, leading to a state-of-the-art performance with significantly reduced running times.

Kuksa, P., Huang, P. & Pavlovic, V. (2008), "Scalable Algorithms for String Kernels with Inexact Matching", In Neural Information Processing Systems (NIPS). Vancouver, Canada. Dec. 2008.

Kuksa, P., Huang, P. & Pavlovic, V. (2008), "Fast Protein Homology and Fold Detection with Sparse Spatial Sample Kernels", In Int'l Conf. Pattern Recognition. Tampa, FL. Dec. 2008.

Huang, P. & Pavlovic, V. (2008), "Protein homology detection with biologically inspired features and interpretable statistical models", Int. J. Data Min. Bioinformatics. Vol. 2(2), pp. 157-175.