Princeton University
COS 217: Introduction to Programming Systems

Assignment 3: A Symbol Table ADT

Purpose

The purpose of this assignment is to help you learn C dynamic memory management, and how to create abstract data types (ADTs) in C. It also will give you the opportunity to gain more experience with the GNU/Linux programming tools, especially bash, emacs, and gdb.

Rules

Implementing hash table expansion (as described below) is the "extra challenge" part of this assignment. While doing the "extra challenge" part of the assignment, you are bound to observe the course policies regarding assignment conduct as given in the course Policies web page, plus one additional policy: you may not use any "human" sources of information. That is, you may not consult with the course's staff members, the lab teaching assistants, other current students via Piazza, or any other people while working on the "extra challenge" part of an assignment, except for clarification of requirements.

The extra challenge part is worth 10 percent of this assignment. So if you don't do any of the "extra challenge" part and all other parts of your assignment solution are perfect and submitted on time, then your grade for the assignment will be 90 percent.

Background

A symbol table is an unordered collection of bindings. A binding consists of a key and a value. A key is a string that uniquely identifies its binding; a value is data that is somehow pertinent to its key. A symbol table allows its client to insert (put) new bindings, to retrieve (get) the values of bindings with specified keys, and to remove bindings with specified keys. Symbol tables are used often in programming systems; compilers, assemblers, and execution profilers use them extensively.

There are several reasonable ways to implement a symbol table. A simple implementation might store the bindings in a linked list. Linked lists are described in Section 2.7 of The Practice of Programming (Kernighan and Pike) and Section 17.5 of C Programming: A Modern Approach (King). A more efficient implementation might use a hash table. Hash tables are described in Section 2.9 of The Practice of Programming (Kernighan & Pike).

Your Task

Your task in this assignment is to create an ADT named SymTable. Each SymTable object should be a symbol table. A SymTable object should be generic. That is, a SymTable object should contain values which are void pointers, and thus can point to data of any type.

You should create two implementations of your SymTable ADT: one that uses a linked list and another that uses a hash table.

The `SymTable` Interface

Store your SymTable interface in a file named symtable.h. It should contain these function declarations:

SymTable_T SymTable_new(void);

void SymTable_free(SymTable_T oSymTable);

int SymTable_getLength(SymTable_T oSymTable);

int SymTable_put(SymTable_T oSymTable, const char *pcKey, const void *pvValue);

void *SymTable_replace(SymTable_T oSymTable, const char *pcKey,
   const void *pvValue);

int SymTable_contains(SymTable_T oSymTable, const char *pcKey);

void *SymTable_get(SymTable_T oSymTable, const char *pcKey);

void *SymTable_remove(SymTable_T oSymTable, const char *pcKey);

void SymTable_map(SymTable_T oSymTable,
   void (*pfApply)(const char *pcKey, void *pvValue, void *pvExtra),
   const void *pvExtra);

SymTable_new should return a new SymTable object that contains no bindings, or NULL if insufficient memory is available.

SymTable_free should free all memory occupied by oSymTable.

SymTable_getLength should return the number of bindings in oSymTable.

If oSymTable does not contain a binding with key pcKey, then SymTable_put should add a new binding to oSymTable consisting of key pcKey and value pvValue and return 1 (TRUE). Otherwise the function should leave oSymTable unchanged and return 0 (FALSE). If insufficient memory is available, then the function should leave oSymTable unchanged and return 0 (FALSE).

If oSymTable contains a binding with key pcKey, then SymTable_replace should replace the binding's value with pvValue and return the old value. Otherwise it should leave oSymTable unchanged and return NULL.

SymTable_contains should return 1 (TRUE) if oSymTable contains a binding whose key is pcKey, and 0 (FALSE) otherwise.

SymTable_get should return the value of the binding within oSymTable whose key is pcKey, or NULL if no such binding exists.

If oSymTable contains a binding with key pcKey, then SymTable_remove should remove that binding from oSymTable and return the binding's value. Otherwise the function should not change oSymTable and return NULL.

SymTable_map should apply function *pfApply to each binding in oSymTable, passing pvExtra as an extra parameter. That is,the function should call (*pfApply)(pcKey, pvValue, pvExtra) for each pcKey/pvValue binding in oSymTable.

A SymTable object is responsible for allocating the memory in which its keys reside. Specifically, SymTable_put should not simply store the value of pcKey within the binding that it creates. Instead SymTable_put should make a defensive copy of the string to which pcKey points, and store the address of that copy within the new binding. You will find the standard C functions strlen, malloc, and strcpy useful for making the copy. Thereafter a SymTable object should own its keys. That is, a SymTable object should free the memory in which its keys reside when that memory is no longer required.

Conversely, a SymTable object is not responsible for allocating the memory in which its values reside. Specifically, SymTable_put should simply store the value of pvValue within the binding that it creates. SymTable_put should not make a defensive copy of the object to which pvValue points. In fact SymTable_put cannot make a copy of the object pointed to by pvValue; since SymTable_put cannot determine the type of that object, SymTable_put cannot make a copy of that object. Thus a SymTable object should not own its values; it should not free the memory in which its values reside. (That memory might not even be in the heap! Freeing it could be a disaster!)

The `SymTable` Linked List Implementation

Your SymTable linked list implementation should:

Reside in a file named symtablelist.c.
Validate function parameters by calling the standard assert macro.
Contain no memory leaks. Your SymTable ADT will use dynamically allocated memory. It should explicitly free all dynamically allocated memory when that memory is no longer required. For each chunk of dynamically allocated memory there should be exactly one call of free.
Avoid gross inefficiencies. In particular, it would be grossly inefficient to traverse the list multiple times when a single time would suffice.
Define the SymTable_getLength function so it runs in constant time.

The `SymTable` Hash Table Implementation

Your SymTable hash table implementation should:

Reside in a file named symtablehash.c.
Contain 509 buckets initially.
Use a reasonable hash function. You are welcome to use this one.

/* Return a hash code for pcKey that is between 0 and iBucketCount-1,
   inclusive.  Adapted from the COS 217 lecture notes. */

static int SymTable_hash(const char *pcKey, int iBucketCount)
{
   enum {HASH_MULTIPLIER = 65599};
   int i;
   unsigned int uiHash = 0U;
   assert(pcKey != NULL);
   for (i = 0; pcKey[i] != '\0'; i++)
      uiHash = uiHash * (unsigned int)HASH_MULTIPLIER
               + (unsigned int)pcKey[i];
   return (int)(uiHash % (unsigned int)iBucketCount);
}

Validate function parameters by calling the standard assert macro.
Contain no memory leaks. Your SymTable ADT will use dynamically allocated memory. It should explicitly free all dynamically allocated memory when that memory is no longer required. For each chunk of dynamically allocated memory there should be exactly one call of free.
Avoid gross inefficiencies. In particular, it would be grossly inefficient to hash the given key multiple times when a single time would suffice.
Define the SymTable_getLength function so it runs in constant time.
Expand. That is, for efficiency your implementation should dynamically increase the number of buckets (and, necessarily, reposition all bindings) whenever a call of SymTable_put causes the number of bindings to become too large. If an expansion attempt fails because of insufficient memory, then your implementation simply should proceed with the execution of SymTable_put.

Concerning hash table expansion:

Use this sequence of integers as bucket counts: 509, 1021, 2039, 4093, 8191, 16381, 32749, and 65521. (Those integers are primes that are close to powers of two.) That is, when SymTable_put detects that the new binding count exceeds 509, it should increase the SymTable object's bucket count to 1021. When the function detects that the new binding count exceeds 1021, it should increase the bucket count to 2039. Etc. When SymTable_put detects that the new binding count exceeds 65521, it should not increase the bucket count. Thus 65521 is the maximum number of buckets that a SymTable object should contain.
You will receive a better grade if you submit a working non-expanding hash table implementation instead of a non-working expanding hash table implementation. If your attempts to develop the expansion code fail, then leave the expansion code in your symtablehash.c file as comments, and describe your attempts in the "what bugs are in your submission" section of your readme file.

Logistics

Develop on nobel using emacs to create source code, gdb to debug, and meminfo to check for dynamic memory management errors.

Critique your code using the splint tool. Each time splint generates a warning on your code, you should either (1) edit your code to eliminate the warning, or (2) explain your disagreement with the warning in your readme file.

Similarly, critique your code using the critTer tool. Each time critTer generates a warning on your code, you should either (1) edit your code to eliminate the warning, or (2) explain your disagreement with the warning in your readme file.

A client module defined in testsymtable.c is available in the /u/cos217/Assignment3 directory on nobel. That module requires you to provide a single command-line argument, which should be an integer that specifies a binding count. The module tests your SymTable ADT by manipulating several SymTable objects. One of those SymTable objects contains the specified number of bindings. The module writes to the standard output stream an indication of how much CPU time it consumed while manipulating that SymTable object. Use testsymtable.c to test your SymTable ADT. Create additional test programs, as you deem necessary, to test your SymTable ADT.

Create a readme file by copying the file /u/cos217/Assignment3/readme to your project directory, and editing the copy by replacing each area marked "<Complete this section.>" as appropriate.

In particular, place in your readme file the CPU times reported by the testsymtable.c client module with binding counts 100, 1000, 10000, 100000, and 1000000 using (1) your linked list implementation, (2) your non-expanding hash table implementation, and (3) your expanding hash table implementation. You can create a non-expanding hash table implementation by temporarily commenting-out your expansion code; don't forget to "comment in" that code before submitting your work. If the CPU time consumed is more than 5 minutes, then the testsymtable.c client module will abort execution. In that case you should write "More than 5 minutes."

Submit your work electronically on nobel using this command:

submit 3 symtable.h symtablelist.c symtablehash.c readme

Grading

We will grade your work on quality from the user's and programmer's points of view. From the user's point of view, your module has quality if it behaves as it should. The correct behavior of the SymTable ADT is defined by the previous sections of this assignment specification.

From the programmer's point of view, a program has quality if it is well styled and thereby easy to maintain. In part, good style is defined by the splint and critTer tools, and by the rules given in The Practice of Programming (Kernighan and Pike) as summarized by the Rules of Programming Style document.

The more course-specific style rules listed in the previous assignment specifications also apply, as do these:

Modularity: A module's interface should not reveal the module's data. Data should be encapsulated with functions. When defining an ADT, use opaque pointer type definitions to hide structure type definitions from the ADT's clients.
ADT Comments: The interface (.h) of an ADT should contain a comment that describes what an object of that type is. The comment should appear immediately before the definition of the opaque pointer type.
Structure Type Definition Comments: Compose a comment for each structure type definition. A structure type definition's comment should immediately precede the structure type definition.
Field Definition Comments: Compose a comment for each field definition that occurs within a structure type definition. A field definition's comment should immediately precede the field definition.

To encourage good coding practices, we will deduct points if gcc217 generates warning messages.

Note: `splint` Warnings on `testsymtable.c`

When given the testsymtable.c file, splint generates warning messages about the use of the sprintf function, and suggests using the snprintf function instead.

splint complains because sprintf does not check the length of the array (alias buffer) into which it assigns characters, and so could cause a buffer overflow if the given array is too short. It suggests using snprintf because that function allows the caller to specify the length of the array.

Don't be concerned about those warning messages. You need not copy them to your readme file or explain them. Contrary to splint's suggestion, testsymtable.c uses sprintf instead of snprintf because:

In testsymtable.c the number of characters being assigned is known, and so the array certainly is not too short.
snprintf is not part of the C90 standard, and so is not declared in stdio.h. So gcc217 complains about the use of snprintf. A splint warning is less important than a gcc217 warning.

Note: `critTer` Warnings on `testsymtable.c`

When given the testsymtable.c file, critTer generates warning messages about the use of "magic numbers," the number of statments in some functions, the number of function definitions in the file, and the number of lines in the file. Do not be concerned about those warning messages. You need not copy those warning messages to your readme file or explain them. We judged that using magic numbers and defining long functions in testsymtable.c was clearer than the alternative. And splitting testsymtable.c into multiple files would have complicated the build process unnecessarily.

This assignment was created by Robert M. Dondero, Jr.
with input from other faculty members

Princeton University COS 217: Introduction to Programming Systems