COS 126
Closest Pairs
Due: Wed. 4/9, 11:59pm

Suppose that N random points have been thrown onto a chessboard. In this assignment, you'll write a program to find and plot all pairs of points that are closer to one another than to any other on a two-dimensional space, like a chessboard. You'll be using arrays of linked lists to make it possible to handle large sets of points. This problem is representative of "clustering" problems that arise in database and graphics problems. The following 3-part sequence will lead to the final solution in manageable steps.

Part 1: Warm Up

Write gridcount.c, a program that prints how many points fell into each square. The input is (x,y) coordinates in the unit square (that is, x and y are both between 0 and 1), one pair per line. You can use /u/cs126/bin/randpoints to check the input format and to generate test data for small cases. (The source code for randpoints is in /u/cs126/examples/randpoints.c). With no command-line arguments, randpoints generates 10 random points:

% /u/cs126/bin/randpoints
0.242578 0.013470
0.383139 0.414653
0.067769 0.993127
0.484308 0.765338
0.031834 0.030935
0.932640 0.887880
0.591330 0.478779
0.833354 0.186335
0.735653 0.115053
0.698659 0.355604

You can specify the number of points with a command-line argument, e.g., /u/cs126/bin/randpoints 100 emits 100 points. You can pipe the output of randpoint directly into your program, e.g.,

% /u/cs126/bin/randpoints 100 | a.out
   2   1   0   1   0   2   2   3
   2   0   2   3   1   1   3   1
   1   1   2   2   0   0   0   1
   4   2   2   4   1   0   3   2
   1   0   0   2   1   1   1   3
   6   0   2   0   1   1   1   2
   2   4   5   2   1   1   2   2
   2   1   1   1   1   2   0   2

Your program should represent a G×G chessboard with a two-dimensional array, int grid[G][G], where grid[i][j] is the number of points that land in the square at row i, column j. For example, with G = 10, the input line 0.123 0.456 causes grid[1][4] to be incremented. The output above shows the result for G = 8. Here's a outline of gridcount.c:

#include <stdio.h>

#define G 8

int grid[G][G];

void gridinit(void) { /* initialize grid */ }

void gridinsert(float x, float y) { /* increment grid[i][j] */ }

void gridcount(void) { /* print the grid */ }

int main(void) {
	float x, y;

	gridinit();
	while (scanf("%f %f\n", &x, &y) != EOF) {
		gridinsert(x, y);
	}
	gridcount();
	return 0;
}

Copy this code as a starting point for your gridcount.c. Implement the bodies of the functions gridinit, gridinsert, and gridcount; each body is only a few lines of code. Your goal is to be able to handle the data file /u/cs126/data/stars, which is a map of about 5000 stars in the sky projected onto the unit square. Here's the output with G = 8:

% a.out </u/cs126/data/stars
   1  12  50  86 122 166  23   0
   4  45  76  80 130 143 100   9
  29  64  74  92 117 114  72  43
  65  65  87  94 137 122  73  46
  52  79  85 111 149 110  65  55
  45  72  82 149 166  96  54  39
  16  62 107 162 111  67  67  10
   0  14  54 102  79  50   7   0

Do not submit gridcount.c. The purpose of this warm up is to familiarize yourelf with the problem and with the grid data structure to prepare for solving the more difficult problem described below. Run your program for 100 random points, 1000 random points and the star data, and try changing G. Note that G is built into the program; to change it, you have to change the #define line and recompile your program.

Part 2: Counting with Linked Lists

The first refinement to the program above is to replace the counts by linked lists of points. That is, in gridlistcount.c, grid[i][j] points to a linked list of the points that land in the i,j square. The number of points in each square can be computed by counting the number of nodes in each linked list. Here's an outline of gridlistcount.c; bold type identifies those lines that are different from the corresponding lines in gridcount.c.

#include <stdio.h>
#include "misc.h"

#define G 8

struct node {
	float x, y;
	struct node *next;
} *grid[G][G];

void gridinit(void) { /* initialize grid */ }

void gridinsert(float x, float y) { /* add a node for x,y to grid[i][j] */ }

void gridcount(void) { /* print the grid */ }

int main(void) {
	float x, y;

	gridinit();
	while (scanf("%f %f\n", &x, &y) != EOF) {
		gridinsert(x, y);
	}
	gridcount();
	return 0;
}

The linked lists are built from node structures; the next field points to the next node on a list. As each point is read, gridinsert creates a node for the point and adds it to appropriate linked list. gridcount traverses grid to count the number of points in each square, printing out the results in a G ×G array as shown above. The output of gridlistcount should be identical to the output of gridcount.

You'll need to allocate struct nodes for each point in gridinsert. Use emalloc described in lectures 14 and 15. Include misc.h as shown in the outline above, and compile your program with the command

% lcc -n -I/u/cs126/include gridlistcount.c /u/cs126/lib/libmisc.a

/u/cs126/lib/libmisc.a is the library that holds emalloc. See the lcc man page for information about the -n option; you can appreciate the value of this option only by attempting to debug without it.

Part 3: Closest Pairs

The third and last refinement is the program gridclose.c, which finds the closest pairs and plots the points. That is, for each poin P, find the point that is closest to P and draw a line between the two points. As in gridlistcount.c, build a linked list of all the points that fall into each square in a G×G grid. Then, for each point, look in the lists associated with the neighboring squares for the closed point. For large point sets, this approach makes the difference between being able to solve the problem and not being able to afford to wait for the answer. Here's a outline of gridclose.c; as above, bold type identifies those lines that are different from the corresponding lines in gridlistcount.c.

#include <stdio.h>
#include "misc.h"
#include "print.h"

#define G 20

struct point { float x, y; };
struct node {
	struct point pt;
	struct node *closest;
	struct node *next;
} *grid[G][G];

void gridinit(void) { /* initialize grid */ }

void gridinsert(float x, float y) { /* add a node for x,y to grid[i][j] */ }

void gridclosest(struct node *p) { /* find the node closest to p */ }

int main(void) {
	float x, y;

	printprolog(G);
	gridinit();
	while (scanf("%f %f\n", &x, &y) != EOF) {
		printpoint(x, y);
		gridinsert(x, y);
	}
	/* for every node t,
		set t->closest by calling gridclosest(t) */
	/* for every node t,
		call printpair(&t->pt, &u->pt) if t->closest = u and u->closest = t */ 
	printepilog();
	return 0;
}

You may copy this code as a starting point for your gridclose.c. Notice that the x,y coordinates are now packaged in point structures, node structures have a point field named pt, and node structures have new field, closest, which holds a pointer to another node.You'll be able to use your gridinit and gridinsert functions from gridlistcount.c with only a few modications.

The new function, gridclosest, sets the closest field in a node p to point to the node that holds the point closest to p. As suggested in the pseudocode comments above, you need to set the closest fields in main. Once all of the closest fields are set, make a pass through the nodes and call printpair(t, t->closest) for every node t for which t == t->closest->closest. You may ignore the possibility that a point is so isolated that there are no points in any nearby grid squares (for real data, one would presumably find that out by using gridcount).

The various print functions emit PostScript to draw points (printpoint) and lines between points (printpair). The source code for these functions is in /u/cs126/examples/print.h and /u/cs126/examples/print.c. You'll need to compile your program with the command line

% lcc -n -I/u/cs126/examples -I/u/cs126/include gridclose.c /u/cs126/examples/print.c /u/cs126/lib/libmisc.a

You can make this a bit less painful by using /u/cs126/bin/lcc instead of the default lcc; /u/cs126/bin/lcc automatically looks in /u/cs126/examples, /u/cs126/include and /u/cs126/lib for include files, source files, object files, and libraries, and it automatically searches /u/cs126/lib/libmisc.a when it builds an a.out. Thus, you can compile your program and /u/cs126/examples/print.c with the command

% /u/cs126/bin/lcc -n gridclose.c print.c

If you add /u/cs126/bin to your path, you can omit /u/cs126/bin/ in the command above.

Try your program on 100 random points and on the star data. Redirect the output of your program in a file and use gs, the Ghostscript previewer, to view the generated Postscript. Use commands like

% /u/cs126/bin/randpoints 100 | a.out >100.ps
% a.out </u/cs126/data/stars >stars.ps

Follow the links above to see the expected output. 100.ps should be 174 lines and stars.ps should be 6062 lines! Of course, you should test your program first on very small point sets.

Submitting Your Programs

Turn in your code and your documentation with the command

/u/cs126/bin/submit 8 readme gridlistcount.c gridclose.c

Do not turn in gridcount.c.

Extra Credit

Make your program work in the presence of isolated points.


Copyright © 1996 David R. Hanson / drh@cs.princeton.edu
$Revision: 1.2 $ $Date: 1996/11/08 16:51:11 $