COS 126
Counting Words
Due: Wed. 4/2/97, 11:59pm

Write, in wf.c, a program that prints, in lexicographic order, the identifiers in the standard input and the number of times each one appears. For example:

% a.out </u/cs126/examples/bsearch.c | pr -3 -t
1       Quicksort       5       i               2       quicksort
1       Read            4       if              5       return
1       and             2       include         1       scanf
2       argc            1       input           1       sort
2       argv            10      int             1       sscanf
4       array           1       integers        1       standard
4       bsearch         1       is              1       stdio
1       char            4       k               1       them
5       d               4       lb              1       to
1       element         6       m               4       ub
4       else            1       main            1       up
1       for             8       n               1       using
1       found           1       not             1       while
1       from            2       printf          10      x
2       h               5       q

a.out emits just a single-column list, but this output has been folded into 3 columns with the pr command to save space. This output shows, for example, that there are 10 occurrences of "int" and "x" in /u/cs126/examples/bsearch.c. In this assignment, you'll use pointers and structures to hold the identifiers and their counts in a table, and you'll adapt the recursive binary search in bsearch.c to search this table as you read identifiers from the input. You'll also learn more about multi-file programs, because you'll use /u/cs126/examples/getword.c.

Your program is actually more general than suggested above, because an "identifier" is defined only by getword, and you could use a different implementation of getword to count a different set of "words". The implementation in getword.c defines a word as a C identifier. To acknowledge this generality, we'll use "word" below instead of "identifier". Here's the general form for your program and pseudocode for the main function; you can copy this skeleton from this page and edit it into your wf.c.

#include <stdio.h>
#include "getword.h"

struct word {
	char *str;
	int count;
};
struct word words[1000], *ptrs[1000];

/* Function to search for a word in ptrs[0...] */

/* Function to add a word to words and ptrs */

int main(void) {
	for each word in the input
		if word is in ptrs[0...] then
			increment its count field
		else
			add word to words and ptrs
			set its count field to 1
	for each element in ptrs[0...]
		print its count and str fields
}

The global variable words is an array of struct word structures; each structure holds a pointer to a word in its str field and the number of times that word occurs in its count field. The global variable ptrs is an array of pointers to struct word structures. The elements in words are in the order in which the words appear in input; the elements in ptrs are in lexicographic order of the words in the struct word structures to which they point. As words are read and added to words and ptrs, the elements in ptrs are rearranged so that ptrs[0...] are kept in sorted order. ptrs and words are related in the same way as are deck and cards shown on page 13-7 in lecture 13.

The search function suggested in the pseudocode above looks for a word by doing a binary search of ptrs. Copy the binary search function described in lecture 10, available in /u/cs126/examples/bsearch.c, and edit it as necessary to search an array of pointers to struct word structures. You can adapt bsearch by changing only a few lines.

The add function suggested in the pseudocode above adds a word to words and ptrs. Adding a word to words is easy: Keep track of the number of elements currently used, fill in the next element, and increment the number of elements used. A similar scheme can be used to add a pointer to ptrs, but after you initialize the last element of ptrs, you need to move that element to its proper location. You do not have to sort ptrs; just use exchanges to percolate the new element down to its proper location. (Technically, this percolation is one step of an insertion sort.)

int getword(char *word, int size), declared in /u/cs126/examples/getword.h and defined in /u/cs126/examples/getword.c, reads the next word from the standard input, stores it as a null-terminated string in word, and returns its length. When the input is exhausted, getword returns EOF. getword assumes that word can hold up to size characters including the null character.

As you read in the words, you'll need to store the new words some place permanent, because the str fields point to them. You can store them end-to-end in a large, say, 8000-element, character array. The first time you see a word, append it and its terminating null character to this array; the address of its first character in this array is thus a character pointer to the word.

The figure below depicts the values of words, ptrs, and the input array for the one-line input

int main(int argc, char *argv[]) {

Note: You should not useand you may not usethe C library functions malloc, calloc, realloc, or free in this program. You many assume there are no more than 1000 words and that they fit in 8000 bytes.

You don't have to copy getword.h and getword.c; just compile getword.c directly with your wf.c with the command

% lcc -I/u/cs126/examples wf.c /u/cs126/examples/getword.c

The -Idir option causes lcc to look in dir for header files. You can also compile getword.c once and load the resulting object code when you compile wf.c. To compile just getword.c, use the command

% lcc -c -I/u/cs126/examples /u/cs126/examples/getword.c

The -c option causes lcc to leave the object code for getword.c in getword.o in your directory. You can load getword.o when you compile wf.c:

% lcc -I/u/cs126/examples wf.c getword.o

This assignment isn't difficult, but it's full of pitfalls for the unwary, because pointer bugs are particularly difficult to find. A common error is to dereference null pointers, e.g., using an expression like p->count when p is NULL. lcc's -n option can help catch these errors; -n causes lcc to generate code that checks for null pointers and, if it trips across one, to issue a one-line diagnostic and terminate the program. See the lcc man page for more information.

Turn in your code and your documentation with the command

/u/cs126/bin/submit 7 readme wf.c

Extra Credit

Add a command-line option "-n" to your program that causes it to print the identifiers in decreasing order of occurrence instead of in alphabetical order. Use the sorting code from /u/cs126/examples/quicksort.c, or use the qsort function from the C library (see Dietel and Deitel, p. 879). To use qsort, you'll need to include stdlib.h, and you'll need to use a different name for bsearch, because there's a bsearch in stdlib.h, too.


Copyright © 1996 David R. Hanson / drh@cs.princeton.edu
$Revision: 1.2 $ $Date: 1996/11/04 16:06:55 $