Princeton University
COS 217: Introduction to Programming Systems

Assignment 3: A Hash Table ADT

Purpose

The purpose of this assignment is to help you learn about composition of ADTs and software performance analysis. It will also give you more experience with the GNU/UNIX programming tools, especially gprof.

Background

A hash table is a collection of buckets. Each bucket is a collection of bindings. Each binding consists of a key and a value. A key uniquely identifies its binding; a value is data that is somehow pertinent to its key. The mapping of bindings to buckets is determined by the hash table's hash function. A hash function accepts a binding's key and bucket count, and returns the number of the bucket in which that binding should reside.

A collision occurs when more than one binding is mapped to the same bucket. A good hash function distributes bindings fairly uniformly across the hash table's buckets, and thus minimizes collisions. A bad hash function is one that generates many collisions. With a good hash function, the retrieve/insert/delete performance of a hash table is very fast.

Chapter 8 of the Hanson textbook describes the interface and implementation of a hash table ADT, and also shows an example program that uses it.

Note that a hash table essentially has the same functionality as a set, as defined in Assignment 2: both allow insertion, deletion, and retrieval of bindings.

Your Tasks

The assignment consists of two parts. In Part 1 your task is to create a Table ADT. A Table should be implemented as a hash table in which each bucket is a Set (as defined in Assignment 2). Thus each Table is composed of Sets; effectively, the Table ADT is a user/client of the Set ADT. You should also create an associated TableIter ADT to allow iteration across a Table's items. You will find it necessary to implement your TableIter ADT so it uses the SetIter ADT from Assignment 2.

In principle, the user/client/customer of the Table ADT need not be concerned about its implementation; the user/client/customer is concerned only with the Table's interface. However in practice the user/client/customer of the Table ADT must choose a Set implementation (either the one that uses arrays or the one that uses a tree) and a bucket count. In Part 2 of the assignment your job is to analyze the efficiency of the Table ADT under various conditions, and thus provide guidance to the user/client/customer about which implementation and bucket count to choose.

Part 1: The Table ADT

The Table Interface

The Table interface should contain these function declarations. Note that, except for the Table_new function, they are essentially identical to those of the Set ADT from Assignment 2. One additional difference: whereas the Set_map is required to iterate over its Set's bindings in order (as determined by the compare function), the Table_map function need not iterate over its Table's bindings in any particular order.

Table_T     Table_new(int iBucketCount, 
               int (*pfCompare)(const void *pvKey1, const void *pvKey2),
               int (*pfHash)(const void *pvKey, int iBucketCount));
void        Table_free(Table_T oTable);
void        Table_clear(Table_T oTable);
int         Table_getLength(Table_T oTable); 
int         Table_put(Table_T oTable, const void *pvKey, void *pvValue);
int         Table_remove(Table_T oTable, const void *pvKey);
const void *Table_getKey(Table_T oTable, const void *pvKey);
void       *Table_getValue(Table_T oTable, const void *pvKey);
void        Table_map(Table_T oTable, 
               void (*pfApply)(const void *pvKey, void **ppvValue, void *pvExtra),
               void *pvExtra);

The Table_new function should return a new Table. The parameter iBucketCount is the number of buckets that the Table will contain. The parameter pfCompare is a pointer to a compare function that the Table will use to compare keys. The function *pfCompare should return a negative integer if there is a "less than" relationship between the two given keys, a positive integer if there is a "greater than" relationship between the two given keys, and zero if the two given keys are equal. The parameter pfHash is a pointer to a hash function that the Table will use to assign bindings to buckets. It should be a checked runtime error for iBucketCount to be less than 1, for pfCompare to be NULL, or for pfHash to be NULL..

The TableIter Interface

The TableIter interface is essentially identical to that of the SetIter ADT from Assignment 3. One additional difference: whereas a SetIter is required to iterate over its Set's bindings in order (as determined by the compare function), a TableIter need not iterate over its Table's bindings in any particular order.

TableIter_T TableIter_new(Table_T oTable); 
void        TableIter_free(TableIter_T oTableIter); 
int         TableIter_selectionIsValid(TableIter_T oTableIter);
void        TableIter_selectFirst(TableIter_T oTableIter);
void        TableIter_selectNext(TableIter_T oTableIter);
const void *TableIter_getSelectedKey(TableIter_T oTableIter);
void       *TableIter_getSelectedValue(TableIter_T oTableIter);

The Implementations

Of course you should implement your Table ADT as a hash table. Each bucket of a Table should be a Set, as specified in Assignment 2. In that manner a Table should be composed of Sets. You should use the given Set and SetIter ADTs (and not your own).

Part 2: The Performance Analysis

Suppose you have sold your Table et al ADTs to customers. Further suppose that your customers are requesting some guidance concerning efficient use of your ADTs.

Generic Analysis

Customer #1 has purchased your ADTs with the intention of using them in a variety of projects. The customer asks you for general guidelines about which ADT to use when. In particular, the customer asks for guidelines organized along four dimensions:

Table implementation: Table/Set/array, Table/Set/tree
Function: put, remove, get
Bucket count: 1, 100
Data size: 10 items, 100 items, 1000 items, 10000 items

(Note that when a Table's bucket count is 1, its performance is essentially the same as that of a single Set. Thus, effectively, the customer is asking for an analysis of four ADTs: Set/array, Set/tree, Table/Set/array, Table/Set/tree.)

Use the statistics produced by the given timetable program (described below) and your knowledge of the underlying ADT implementations to formulate a four sentence response to Customer #1. Your response should be of the form:

We recommend that you use the Set/array ADT when...
We recommend that you use the Set/tree ADT when...
We recommend that you use the Table/Set/array ADT when...
We recommend that you use the Table/Set/tree ADT when...

Place that response in your readme file.

Specific Analysis

Customer #2 has a more specific question. The customer will be using your Table/Set/tree ADT in an application that is very similar to timetable, where the bucket count is 100 and the number of items is 10000. In the hope of improving efficiency, the customer is considering rewriting the hash and compare functions in assembly language. The customer anticipates that the translation to assembly language will double the speed of those functions. More precisely, the time spent executing those functions and their descendants will be cut in half. The customer asks if the translation to assembly language is worth the effort. That is, how much will doubling the speed of the hash and compare functions impact the efficiency of the overall program?

Use the statistics produced by the gprof tool to formulate a one sentence response to Customer #2. Be as specific as possible about the performance gain. Place the response in your readme file.

Logistics

You should develop on arizona. Use Emacs to create source code. Use the "make" tool to automate the build process. Use gdb to debug. Use gprof to analyze performance..

The directory /u/cs217/Assignment3 contains files that pertain to the assignment:

set.h contains an interface for the Set ADT.
setarray.o and settree.o contain array and binary search tree (respectively) implementations of the Set ADT.
testtable.c contains a minimal test program that illustrates typical usage of the Table and TableIter ADTs. You can use it as the basis for developing your own programs to test your Table and TableIter ADTs.
timetable.c contains a program that you can use to further test your Table ADT, and to collect performance statistics.
stats.h and stats.c define an ADT that is used by the timetable program.
test10.txt, test100.txt, test1000.txt, and test10000.txt contain data that the timetable program can read.

You should create these files:

table.h should contain the interface for the Table and TableIter ADTs.
table.c should contain the implementation of the Table and TableIter ADTs.
readme should contain your name, and any information that will help us to grade your work in the most favorable light. In particular you should describe all known bugs in your readme file. It should also contain the performance analyses described above.

Submit your work electronically via the command /u/cs217/bin/submit 3 table.h table.c readme.

Grading

Your grade for the assignment will be based upon the quality of your Table and TableIter ADTs. As always, your ADTs will be graded on correctness and understandability/design. To encourage good coding practices, you will lose points for any warning messages generated by gcc during the compilation of your work. Your grade will also be based upon the quality of your performance analyses.

Princeton University COS 217: Introduction to Programming Systems