New directions in learning and clustering :   research directions

The starting point is the observation that no clear boundary separates the twin notions of clustering and learning.
Clustering is usually driven by the end goal of learning, but can also be viewed as a learning task in itself since it results in a more compact description of the data. By the same token all learning involves clustering of some sort. The project takes an integrated view of the entire problem of learning patterns in data, starting from streaming computations that  might produce representative sketches of the data as it streams by, to problems of clustering data into meaningful patterns (with attendant problems of outlier removal, multiobjective optimization etc.), to learning algorithms that fit sophisticated models (SVMs, bayesian nets, gaussian mixtures etc.) for inference and reasoning tasks. The project seeks to develop fundamental new algorithms for such problems, while bearing in mind that even defining the problems in this area involves an interesting blend of algorithms, statistics, economics, AI, and machine learning.  We seek to design algorithms with provable performance guarantees but also good practical performance. Ongoing investigations include the following.
 

bullet Studying the problem of compact representation of data, especially in the setting of metric spaces and geometric spaces. These connect up to fundamental problems in mathematics. One breakthrough result was that the dimension of l_1 normed spaces cannot be substantially reduced without greatly distorting their structure.
(Dimension reduction is often attempted in practice as a way to combat the "curse of dimensionality.")
bullet Use of approximation ratio to quantify the performance of clustering heuristics. The clustering output by the algorithm is near-optimal in the sense that the objective function is within some small factor of the best achievable. By an appropriate choice of the objective function, this can be viewed as a generalization of existing approaches such as maximum likelihood estimation. Some representative problems being studied: learning mixtures of gaussians, correlation clustering, k-median, etc.
bullet Better clustering via graph partitioning.  Graph partitioning (specifically, problems such as SPARSEST CUT, BALANCED SEPARATOR, etc.) are used as a subroutine in learning/clustering tasks such as Image segmentation, dividing a graph into small number of pieces so as to minimize the cross-edges, etc. We have developed new algorithms for SPARSEST CUT which advance the state of the art, and are also very efficient. Practical implementations are under development.
bullet Incorporating privacy considerations into data mining.