Comparison of Clustering Algorithms and Its Application to Document Clustering (thesis) | Computer Science Department at Princeton University

Report ID:

TR-758-06

Authors:

Chen, Jie

Date:

May 2006

Pages:

214

Download Formats:

[PDF]

Abstract:

We investigate the methodology to evaluate and compare the quality of clustering algorithms. We study the issues raised in evaluation, such as data generation and choice of evaluation metrics. We give head-to-head comparison of six important clustering algorithms from different research communities.

We generate data using an extended mixture of Gaussian model that controls data characteristics such as shape, volume and size of a class. The added control facilitates more thorough exploration into the parameter space of the generative model and therefore makes the algorithmic evaluation more comprehensive.

We summarize datasets by their data characteristics. Previous comparison conclusions based on specific datasets are hard to be applied to other datasets. In contrast, conclusions based on data characteristics can be easily generalized. Therefore, we can predict the performance of an algorithm. The prediction is a significant step forward for comparison studies.

We compare three evaluation indices from different research fields. We recommend the f-score measurement if a single evaluation index is used. However, it is more desirable to apply multiple indices and vote among them to determine the ranking of algorithms.

We isolate the objective function (sum-of-squares) and the optimization approach (greedy) used in the popular k-means algorithm. We investigate the effectiveness and efficiency in the four algorithms resulting from substituting another objective function (min-max cut) and another optimization approach (Kernighan-Lin). We illustrate by a quantitative study that contrary to conventional wisdom, k-means is not only fast but also produces quality clusters. It achieves 95% of the best quality among the candidate algorithms running in time an order of magnitude faster.

We present detailed studies of the algorithmic behavior in response to data characteristics. We give an alternative definition of cluster separation and show that the new definition measures the degree of cluster difficulty for spherical and ellipsoidal clusters more consistently.

We develop and systematically evaluate a practical version of a spectral clustering algorithm originally specified for provable guarantees of correctness. We observe that the modified algorithm can find perfect solutions when the clusters are well separated, where iterative algorithms such as k-means tend to miss the perfect solution.