### HASH TABLES

Brute force approach. All data is just a sequence of bits. Can treat key as a gigantic number and use it as an array index. Requires exponentially large amounts of memory.

Hashing. Instead of using the entire key, represent entire key by a smaller value. In Java, we hash objects with a hashCode() method that returns an integer (32 bit) representation of the object.

hashCode() to index conversion. To use hashCode() results as an index, we must convert the hashCode() to a valid index. Modulus does not work since hashCode may be negative. Taking the absolute value then the modulus also doesn't work since Math.abs(Integer.MIN_VALUE) is negative. We use hashCode & 0x7FFFFFFF instead before taking the modulus.

Hash function. Converts a key to a value between 0 and M-1. In Java, this means calling hashCode(), setting the sign bit to 0, then taking the modulus.

Designing good hash functions. Requires a blending of sophisticated mathematics and clever engineering; beyond the scope of this course. Most important guideline is to use all the bits in the key. If hashCode() is known and easy to invert, adversary can design a sequence of inputs that result in everything being placed in one bin. Or if hashCode() is just plain bad, same thing can happen.

Uniform hashing assumption. For our analyses below, we assumed that our hash function distributes all input data evenly across bins. This is a strong assumption and never exactly satisfied in practice.

Collision resolution. Two philosophies for resolving collisions discussed in class: Separate chaining and 'open addressing'. We didn't use the term open addressing, but it's where you use empty array entries to handle collisions, e.g. linear probing.

Separate-chaining hash table. Key-value pairs are stored in a linked list of nodes of length M. Hash function tells us which of these linked lists to use. Get and insert both require potentially scanning through entire list.

Resizing separate chaining hash tables. Understand how resizing may lead to objects moving from one linked list to another. Primary goal is so that M is always proportional to N.

Performance of separate-chaining hash tables. Cost of a given get, insert, or delete is given by number of entries in the linked list that must be examined.

• The expected amortized search and insert time (under uniform hashing assumption) is N / M, which is no larger than some constant (due to resizing).
• We note that the expected length of the largest bin is log N / log log N. This is far beyond the scope of the class. Grows slowly, but is not quite constant.

Linear-probing hash tables. If the space that should be occupied by a key is already occupied by something else, try the space to the right. If that's occupied, go right more. Repeat. This philosophy works for get and insert.

Performance of linear-probing hash tables. As before, performance determined by number of entries in the key array that must be examined.

• If N / M is a constant (bounded away from 1), then the expected amortized search and insert time (under uniform hashing assumption) is a constant. For N / M = 0.5, expected cost of a search hit is 3/2 and expected cost of a search miss is 5/2. If N / M approaches 1, the costs blow up; see Knuth's parking problem.
• The expected length of the longest cluster is Theta(log N). This is beyond the scope of the course.

### Recommended Problems

#### C level

1. Textbook 3.4.5
2. Consider a symbol table that uses strings containing only the digits 0-9 as keys, and uses single characters as values. Suppose that the hashCode() of these strings is given simply by the sum of their digits, e.g. the hashCode() of "342" is 3+4+2=9.
1. Given a hash table of initial size 5, convert each hashCode() into an index using the modulus operator. Fill in the index column of the table below. The first two indices have been filled in for you.
Key Value hashCode() Index
13 A 4 4
15 B 6 1
2 C 2
34 D 7
16 E 7
100 F1
2. Draw the table that results if the six keys above are inserted into the symbol table and we use separate chaining to resolve collisions. You may assume that the table does NOT resize during these insertions.

#### B level

1. (Continued from above) Suppose we now insert the key-value pair ("81", G), and that this insert results in resizing the table to size 10. What is the size of the longest linked list after insertion is complete?
2. Textbook 3.4.13, 3.4.14
3. Textbook 3.4.15
4. For the symbol table applications below, pick the best symbol table implementation from this list ( A. Standard BST, B. Red black BST, C. Hash table, D. Ordered Array, E. Unordered Array, F. Heap).
1. ----- Lookup table for computing sin(theta), where theta is one of 1,000,000 possible angles spaced evenly between 0 and p.
2. ----- Database that maps sound data (from a file) to artist name.
3. ----- Fastest guaranteed insert, delete and search for an arbitrary numerical data set.
5. Fall 2014 Midterm, #5

#### A level

1. Textbook 3.4.23, here R is the multiplicative factor (we used 31 in class).
2. Fall 2012 Midterm, #7