For hashing, it is certainly possible to use a function that does not create collisions of words. (Recall that a collision occurs when two words have the same hash number as assigned by the hash function.) For example, we could write the ASCII for each word and interpret it as a non-negative binary number to get the conversion to a number. Then no two words would convert to the same number. What are the problems with using this method of converting a word to a number for use with an index? Hint: what is the range of possible numbers, even if we assume all words have less than 30 characters?
Let me illustrate this with an example. The word is "antidisestablishmentarianism." What will the index of this word be?
The hexadecimal equivalent of this word is 414e5449 44495345 53544142 4c495348 4d454e54 41524941 4e49534d, a 56-digit number in base 16. In decimal, that's a number somewhere in the ten thousand billion billion billion billion billion billion billions, give or take.
So, if you wanted to store this word in a hash table using this hash method, you'd need a table at least ten thousand billion billion ... well, at least that many table entries. This is a pretty large table. If each hash table entry held a single byte, you could hold about 640 million table entries on one CD-ROM, or the entire hash table on a tightly-packed cube of CDs about 6 million billion miles on a side (if there was that much matter in the universe, which there isn't.) This is the biggest problem with using this method: the universe is too small.
Note also that the grand majority of the hash entries wouldn't ever be used, since most hash values are not words when the numeric values are converted to characters. There are only so many millions of English words, so a great deal of this unbelievably huge table is a waste of memory.
They key here is seeing how we can turn a loop, in which we repeat the same procedure over and over, into recursive program, in which a procedure calls a simpler version of itself. Note that the algorithm divides one number (n) by another number (m) to get a remainder (r,) and if the remainder is not zero then we perform the procedure all over again, dividing one number by another to get a remainder, except swapping around the values of m, n and r before doing so. Think of this as a procedure performing another version of itself.
Here's a possible solution:
GCD( x, y ), where x<= y set r = the remainder when y is divided by x; if (r is equal to 0) The GCD is the value of x [[why? because x evenly divides y. Hence x is the GCD of x and y]] else The GCD is the value of GCD( r, x )
The procedure performs a division with remainder, and if the remainder isn't zero then it calls another version of itself (which performs a division with remainder, and if the remainder isn't zero it calls another version of itself, et cetera.)
Part b Give an argument to convince someone that the algorithm stops, i.e. explain why the while loop eventually stops.
We start by plugging in two values x and y, with x less than y. The third number, r, is guaranteed to be smaller than both x and y (why?) Now, the GCD function calls another version of the GCD function, but instead of plugging in x and y again, it plugs in two different values, the first of which is smaller than x and the second of which is smaller than y.
In other words, Each time the function is called, the numbers plugged into it are smaller than they were last time. These are positive whole numbers, so we can't decrease in value forever. The lowest we can get is x=1 and y=2, and 1 divides evenly into 2. The program may stop before that, but it will certainly have to stop eventually.
The reason the algorithm works is that, if y divided by x gives the remainder r, then the gcd( x, y) = gcd( r, x).
Proof: Suppose we want to know the gcd of x and y. Suppose r is the remainder when y is divided by x. In other words,
y = x*q + r, for some number q.
If a number divides into both x and y, then by the above equation it has to divide into r as well. Further, from the above equation, we can see that any number dividing x and r also divides y.
Thus, any number that is a divisor of both x and y is also a divisor of both r and x; and any number that is a divisor of both r and x is a divisor of both x and y. From this one can see that gcd(x,y) divides into gcd(r,x), and gcd(r,x) divides gcd(x,y). This is only possible if they are equal.
A very informal estimate of the speed: when we divide a number y by a number x, (with x less than y) what do we expect the remainder to be? Convince yourself that it must be less than y/2. 100 divided by anything less than 100, for instance, gives a remainder less than 50.
Every iteration of gcd(x,y), the larger of the two values x and y is replaced with something less than half its size. How long does it take to reach 0 if we keep cutting a number y in half? On the order of log y iterations.