Approximation Algorithms for Clustering (Thesis) | Computer Science Department at Princeton University

Report ID:

TR-716-04

Authors:

Wirth, Anthony

Date:

September 2004

Pages:

133

Download Formats:

[PDF]

Abstract:

Clustering items into groups is a fundamental problem in the
information sciences. Many typical clustering optimization problems are
NP-hard and so cannot be expected to be solved optimally in a
reasonable amount of time. Although the use of heuristics is common, in
this dissertation we seek approximation algorithms, whose performance
ratio in relation to the optimal solution can be guaranteed and whose
running time is a polynomial function of the problem instance size. We
start by examining variants of the asymmetric k-center problem. We
demonstrate an O(log* n)-approximation algorithm for the asymmetric
weighted k-center problem. Here, the vertices have weights and we are
given a total budget for opening centers. In the p-neighbor variant
each vertex must have p (unweighted) centers nearby: we give an O(log*
k)-bicriteria algorithm using 2k centers, for small p. We also show
that the following three versions of the asymmetric k-center problem
are inapproximable: priority k-center, k-supplier, and outliers with
forbidden centers. The bulk of the dissertation concerns correlation
clustering: clustering a collection of elements based on pairwise
judgments of similarity and dissimilarity. The problem instance does
not include a distance relation between the elements. We partition the
elements into clusters so that the number of pairs correctly (resp.
incorrectly) classified with respect to the input judgment labeling is
maximized (resp. minimized). It is worthwhile studying both complete
instances, in which every pair is labeled, and general instances, in
which some input pairs might not have labels. Specifically, we
demonstrate a factor 4 approximation for minimization on complete
instances, and a factor O(log n) approximation for general instances.
For the maximization version, we give a factor 0.7664 approximation for
general instances, noting that a PTAS is unlikely by proving
APX-hardness. We also prove the APX-hardness of minimization on
complete instances. We provide the first nontrivial approximation
algorithm for maximizing the correlation: the difference between the
number of pairs correctly classified and the number incorrectly
classified. The factor ?(1/ log n) algorithm is derived from an
approximation algorithm for maximizing a fairly general type of
quadratic program on the unit hypercube.