Princeton University
COS 217: Introduction to Programming Systems

Assignment 3: A Symbol Table ADT


Purpose

The purpose of this assignment is to help you learn/review (1) C dynamic memory management, and (2) how to create abstract data types (ADTs) in C. It also will give you the opportunity to gain more experience with the GNU/Linux programming tools, especially Bash, Emacs, and GDB.


Background

A symbol table is an unordered collection of bindings. A binding consists of a key and a value. A key is a string that uniquely identifies its binding; a value is data that is somehow pertinent to its key. A symbol table allows its client to insert (put) new bindings, to retrieve (get) the values of bindings with specified keys, and to remove bindings with specified keys. Symbol tables are used often in programming systems; compilers, assemblers, and execution profilers use them extensively.

There are several reasonable ways to implement a symbol table. A simple implementation might store the bindings in a linked list. Linked lists are described in Section 2.7 of The Practice of Programming (Kernighan and Pike) and Section 17.5 of C Programming: A Modern Approach (King). A more efficient implementation might use a hash table. Hash tables are described in Section 2.9 of The Practice of Programming (Kernighan & Pike) and Chapter 14 of Algorithms in C, Parts 1-4 (Sedgewick).


Your Task

Your task in this assignment is to create an ADT named SymTable. Each SymTable object should be a symbol table. A SymTable object should be generic. That is, a SymTable object should contain values which are void pointers, and thus can point to data of any type.

You should create two implementations of your SymTable ADT: one that uses a linked list and another that uses a hash table.


The SymTable Interface

Store your SymTable interface in a file named symtable.h. It should contain these function declarations:

SymTable_T SymTable_new(void);

void SymTable_free(SymTable_T oSymTable);

int SymTable_getLength(SymTable_T oSymTable);

void *SymTable_put(SymTable_T oSymTable, const char *pcKey, const void *pvValue);

void *SymTable_remove(SymTable_T oSymTable, const char *pcKey);

int SymTable_contains(SymTable_T oSymTable, const char *pcKey);

void *SymTable_get(SymTable_T oSymTable, const char *pcKey);

void SymTable_map(SymTable_T oSymTable,
   void (*pfApply)(const char *pcKey, void *pvValue, void *pvExtra),
   const void *pvExtra);

SymTable_new() should return a new SymTable object that contains no bindings.

SymTable_free() should free all memory occupied by oSymTable. If oSymTable is NULL, then the function should do nothing.

SymTable_getLength() should return the number of bindings in oSymTable. It should be a checked runtime error for oSymTable to be NULL.

If oSymTable does not contain a binding with key pcKey, then SymTable_put() should add a new binding to oSymTable consisting of key pcKey and value pvValue, and return NULL. Otherwise the function should replace the existing binding's value with pvValue, and return the old value. It should be a checked runtime error for oSymTable or pcKey to be NULL.

If oSymTable contains a binding with key pcKey, then SymTable_remove() should remove that binding from oSymTable and return the binding's value. Otherwise the function should not change oSymTable, and should return NULL. It should be a checked runtime error for oSymTable or pcKey to be NULL.

SymTable_contains() should return 1 (TRUE) if oSymTable contains a binding whose key is pcKey, and 0 (FALSE) otherwise. It should be a checked runtime error for oSymTable or pcKey to be NULL.

SymTable_get() should return the value of the binding within oSymTable whose key is pcKey, or NULL if no such binding exists. It should be a checked runtime error for oSymTable or pcKey to be NULL.

SymTable_map() should apply function *pfApply to each binding in oSymTable, passing pvExtra as an extra parameter. It should be a checked runtime error for oSymTable or pfApply to be NULL.

A SymTable object should "own" its keys. That is, a SymTable object is responsible for allocating and freeing the memory in which its keys reside. Specifically, the SymTable_put() function should not simply store the value of pcKey within the binding that it creates. Rather, the SymTable_put() function should make a copy of the string pointed to by pcKey, and store the address of that copy within the new binding. You will find the standard C functions strlen(), malloc(), and strcpy() useful for making the copy. A SymTable object also should free the memory in which its keys reside when that memory is no longer required.

Conversely, a SymTable should not own its values. Indeed it cannot own its values; since it cannot determine the types of its values, it cannot create copies of them.


The SymTable Linked List Implementation

Your SymTable linked list implementation should:


The SymTable Hash Table Implementation

Your SymTable hash table implementation should:

enum {HASH_MULTIPLIER = 65599};
...
static int SymTable_hash(const char *pcKey, int iBucketCount)

/* Return a hash code for pcKey that is between 0 and iBucketCount-1,
   inclusive.  Adapted from the COS 217 lecture notes. */

{
   int i;
   unsigned int uiHash = 0U;
   for (i = 0; pcKey[i] != '\0'; i++)
      uiHash = uiHash * (unsigned int)HASH_MULTIPLIER
               + (unsigned int)pcKey[i];
   return (int)(uiHash % (unsigned int)iBucketCount);
}

Implementing those features is worth 94% of the assignment. To receive the remaining 6%, your SymTable hash table implementation also should:

You will receive a better grade if you submit a working non-expanding hash table implementation instead of a non-working expanding hash table implementation. If your attempts to develop the expansion code fail, then leave the expansion code in your symtablehash.c file as comments, and describe your attempts in your readme file.


Logistics

Develop on hats using Emacs to create source code, gcc217 to build, and GDB to debug.

A client program named testsymtable.c is available in the /u/cos217/Assignment3 directory on hats. That program requires you to provide a single command-line argument, which should be an integer that specifies a binding count. The program tests your SymTable ADT by manipulating several SymTable objects. One of those SymTable objects contains the specified number of bindings. The program prints to stdout an indication of how much CPU time it consumed while manipulating that SymTable object. You should use testsymtable.c to test your SymTable ADT. You should create additional test programs, as you deem necessary, to test your SymTable ADT.

Create a "readme" text file that contains:

Submit your work electronically on hats via the command:

submit 3 symtable.h symtablelist.c symtablehash.c readme

Extra Credit (up to 10 points)

Create a program named concord. The concord program should be a multi-file program consisting of a file named concord.c and the files that comprise your SymTable module: symtable.h and either symtablelist.c or symtablehash.c. That is, your concord.c file should define a client of your SymTable module.

The program should read text from stdin. It may assume that no line of stdin contains more than 1023 characters. The program should write a concordance to stdout. The concordance should relate each "word" of stdin to the number of times it occurs in stdin.

The program should consider a "word" to be a sequence of contiguous alphabetic characters. The program should ignore differences in case; for example, the program should consider the words "Hello", "HELLO", and "hello" to be equivalent.

The program's output should be of the form:

word: count
anotherword: count
...

That is, each line written to stdout should consist of a word, a colon, a space, and an integer indicating how many times that word occurs in stdin. The words should consist of only lowercase letters. The words need not be in any particular order.

The /u/cos217/Assignment3 directory on hats contains an executable binary file named sampleconcord. Your concord program should produce exactly the same output as sampleconcord, except perhaps for the order of the words written to stdout. In other words, these two pipelines:

sampleconcord < somefile | sort
concord < somefile | sort

should cause the sort command to write exactly the same output to stdout.

The program should contain no memory leaks. For each call of malloc() or calloc() in your program, eventually there should be exactly one call of free().

Hint: You may find the standard C strtok() function to be useful. Pages 620-622 of our King book describe that function. For the concord program it is sufficient to call strtok() using this string to define token delimeters:

" \t\n`1234567890-=~!@#$%^&*()_+[]{}\\|;:'\",<.>/?"

Submit your work electronically on hats via the command:

submit 3 concord.c

Grading

We will grade your work on quality from the user's point of view and from the programmer's point of view. To encourage good coding practices, we will deduct points if gcc217 generates warning messages.

From the user's point of view, your module has quality if it behaves as it should. The correct behavior of the SymTable ADT is defined by the previous sections of this assignment specification.

In part, style is defined by the rules given in The Practice of Programming (Kernighan and Pike), as summarized by the Rules of Programming Style document. These additional rules apply:

Names: You should use a clear and consistent style for variable and function names. One example of such a style is to prefix each variable name with characters that indicate its type. For example, the prefix "c" might indicate that the variable is of type char, "i" might indicate int, "pc" might mean pointer to char, "ui" might mean unsigned int, etc. But it is fine to use another style -- a style which does not include the type of a variable in its name -- as long as the result is a readable program.

Line lengths: Limit line lengths in your source code to 72 characters. Doing so allows us to print your work in two columns, thus saving paper.

Comments: Each source code file should begin with a comment that includes your name, the number of the assignment, and the name of the file.

Comments: Each function should begin with a comment that describes what the function does from the caller's point of view. The function comment should:

Comments: Each structure type definition and each structure field definition should have a comment that describes it.

Comments: The interface of an ADT should contain a comment that describes what an object of that type is. It would be reasonable to place that comment adjacent to the definition of the opaque pointer.