COS 111, Fall 1998 - Problem Set 10

Due by 5pm, Thursday Jan. 7, 1999

Problem 1 In class, we discussed how search services build an index that records, for each word in the index, which Web pages contain that word. The search engine uses this index to find the Web pages that satisfy a query. The method of organizing this index is called hashing. You can find a summary of the technique, as presented in class, at the end of this page.

Part a Using the example in the summary at the end of this page, add the word "eel" to the index. Show what the array contains after this addition.
Part b Continuing with the array you obtained from Part a, add the word "zoo" to the index. Show what the data array contains after this addition.

Problem 2 For hashing, it is certainly possible to use a function that does not create collisions of words. For example, we could write the ASCII for each word and interpret it as a non-negative binary number to get the conversion to a number. Then no two words would convert to the same number. What are the problems with using this method of converting a word to a number for use with an index?
Hint: what is the range of possible numbers?

Problem 3 Brookshear, Chapter 10 Review Problem 14 (pg 391).

Problem 4 Brookshear, Chapter 10 Review Problem 17 (pg 392).

Problem 4 (1/15/99: Oops, Problem 5)
Part a Explain the concept of "code as code" used by Pamela Samuelson (and attributed to Larry Lessig) in her article "Encoding the Law into Digital Libraries".
Part b When you visit a Web page, you are able to view the source file, i.e. read the contents of the html file, as well as view the display of that file by the browser. This is very useful for seeing how a Web page was created. However, it also gives you the ability to copy the layout of a Web page, which may violate the intellectual property rights of the author of the page. What are the pros and cons of allowing Web page authors to deny viewing of their source files? Do you think this is technologically feasible?


Summary

How search engines store indices: Hashing

This is the method we discussed in class for building an index of words that gives the Web pages containing each word.

The search service has a spider that wanders around the Web finding pages. The index records for each word seen by the spider, the list of Web pages seen that contain the word. Call this list a word's "index entry". When the search engine wants to find all the pages containing a word, it looks up the word in the index and obtains the list of pages.

The details of the method for building the index are as follows:

A confusing point may be that there are two kinds of lists. Type 1: Each index entry is a list of Web pages containing a specific word. Type 2: An array element contains a linked-list of locations of index entries for different words that convert to the same number.

The technique of converting a set of items into a range of numbers and using those numbers as element numbers of an array is called hashing. Each array element stores information related to the corresponding item or items. In our case the set of items is the set of words.

An Example

Here is an extremely artificially small example, but it should illustrate the method.

Suppose we choose an array size of 5. Furthermore, suppose we are only going to deal with words of length three or less. We decided to calculate the number corresponding to a word by adding up the alphabetic position of the first letter of the word (1 to 26, ignore capitalization), twice the alphabetic position of the second letter of the word, and thrice the alphabetic position of the third letter of the word, and then dividing by 5 and taking the remainder. For example, the number corresponding to "cat" is 0, obtained by taking the remainder of
( (3=position of "c") + (2 * (1=position of "a")) + (3 * (20=position of "t")) ) / 5 = ( 3+2+60 )/ 5.

Suppose we have indexed five words: cat, dog, sod, ox and bat. Our index structure looks as follows:

ARRAY FOR INDEX:

array
element
number            contents
______________________________________
             __________________
  0         |      cat        |
            |-----------------|  
            |loc. of URL list------------->URL list of pages containing cat 
            |-----------------|               elsewhere in memory or on disk
            |   next |        |
            |--------|--------|
                     |
                     V
             __________________
            |      dog        |
            |-----------------|  
            |loc. of URL list------------->URL list of pages containing dog  
            |-----------------|               elsewhere in memory or on disk
            |     nil         |
            |-----------------|
                    
_______________________________________

            __________________
  1         |      sod        |
            |-----------------|  
            |loc. of URL list------------->URL list of pages containing sod  
            |-----------------|               elsewhere in memory or on disk
            |     nil         |
            |-----------------|

_______________________________________


  2


                EMPTY




________________________________________


            __________________
  3         |      ox         |
            |-----------------|  
            |loc. of URL list------------->URL list of pages containing ox  
            |-----------------|               elsewhere in memory or on disk
            |     nil         |
            |-----------------|


________________________________________


            __________________
  4         |      bat        |
            |-----------------|  
            |loc. of URL list------------->URL list of pages containing bat  
            |-----------------|               elsewhere in memory or on disk
            |     nil         |
            |-----------------|


________________________________________


Note that even though we have five array elements and five words, array element 2 is empty while array element 0 contains a linked list of two words. This is because our conversion of words to numbers assigned both "cat" and "dog" the number 0.
Back to Schedule and Assignments