COS 226 Spring 2000, Programming Assignment 6

Program 6: Search for clusters

Search for clusters

Design and implement an algorithm for finding the smallest cluster of M points in a large random set of N points in the plane. A ``cluster'' is defined for the purposes of this assignment to be a set of points falling within a disc in the plane centered at one of the points in the point set. Assume that the point coordinates are random integers (say, 20 bit long; generate your own) and that there are fewer than 50,000 points. Also assume that M is much smaller than N, say, M <= 100.

As usual, you will first want to implement a brute-force method to solve the problem, to get some idea of how much computation is required and to have a reliable check for small point sets when debugging. The brute-force algorithm calculates, for every point, the distance to every other point, then finds the (M-1)-st smallest number in this list. This gives the size of the cluster centered at each point; the smallest of these N numbers is the desired answer.

We have previously studied several ways to find the (M-1)st smallest, etc.; a heap is probably the most appropriate for this assignment, though you may wish to try some other priority queue implementation first, as a part of the brute-force solution.

You should focus on keeping the number of ``distance computations'' (calculating the distance between two points) and ``distance comparisons'' (deciding whether the distance between one pair of points is lower than another pair) as low as possible. The latter is partially dependent on the priority queue implementation (that's why you should use a heap), but the primary goal for this problem is to eliminate computations involving points that are known to be far apart. You can use a grid method or a quad tree or a k-d tree. All of these methods will make it possible to find the M-1 points closest to each point with substantially less than N distance calculations per point. Your major task is to figure out how exactly to adapt one or more of these data structures/algorithms to the cluster problem.

Turn in your documented source code and runs for at least N=1000 and N=10,00, with M=10 and M=100. Show the cluster size (disk radius), number of distance computations, and number of distance comparisons for each run. If your program is fast enough, run it for N=100,000 as well.

Advice.

Think about and debug your priority queue routine first: it's straightforward, but you won't want to mix debugging that with debugging the geometric method. The grid method is perhaps the easiest to think about at first, but it may not be the easiest to get working. For any method that you try, remember that the smallest cluster size is a property of the point set, {\it not your algorithm or implementation}. If you get a different answer when you change some parameter, then you probably have a bug. Final hint: you can save a lot of time by not taking the square root until the end.

Due: 11:59PM Thursday, April 6.