Problem 1 In class, we discussed how search services build an index that records, for each word in the index, which Web pages contain that word. The search engine uses this index to find the Web pages that satisfy a query. The method of organizing this index is called hashing. You can find a summary of the technique, as presented in class, at the end of this page.
Part a Using the example in the summary at the end of this page,
add the word "eel" to the index. Show what the array contains
after this addition.
Part b Continuing with the array you obtained from Part a,
add the word "zoo" to the index. Show what the data array
contains after this addition.
Problem 2 For hashing, it is certainly possible to use a function that
does not create collisions of words. For example, we could write the
ASCII for each word and interpret it as a non-negative binary number to
get the conversion to a number. Then no two words would convert to the same
number.
What are the problems with using this method of converting
a word to a number for use with an index?
Hint: what is the range of possible numbers?
Problem 3 Brookshear, Chapter 10 Review Problem 14 (pg 391).
Problem 4 Brookshear, Chapter 10 Review Problem 17 (pg 392).
Problem 4 (1/15/99: Oops, Problem 5)
Part a Explain the concept of "code as code" used by Pamela Samuelson (and attributed
to Larry Lessig) in her article "Encoding the Law into Digital Libraries".
Part b When you visit a Web page, you are able to view the
source file, i.e. read the contents of the html file, as well as view the display
of that file by the browser. This is very useful for seeing how a Web page
was created. However, it also gives you the ability to copy the layout
of a Web page, which may violate the intellectual property rights of the
author of the page. What are the pros and cons of allowing Web page authors
to deny viewing of their source files? Do you think this is technologically
feasible?
This is the method we discussed in class for building an index of words that gives the Web pages containing each word.
The search service has a spider that wanders around the Web finding pages. The index records for each word seen by the spider, the list of Web pages seen that contain the word. Call this list a word's "index entry". When the search engine wants to find all the pages containing a word, it looks up the word in the index and obtains the list of pages.
The details of the method for building the index are as follows:
The technique of converting a set of items into a range of numbers and using those numbers as element numbers of an array is called hashing. Each array element stores information related to the corresponding item or items. In our case the set of items is the set of words.
Here is an extremely artificially small example, but it should illustrate the method.
Suppose we choose an array size of 5. Furthermore, suppose we are only going
to deal with words of length three or less. We decided to calculate the
number corresponding
to a word by adding up the alphabetic position of the first letter of the
word (1 to 26, ignore capitalization), twice the alphabetic position
of the second letter of the word, and thrice the alphabetic position
of the third letter of the word, and then dividing by 5 and taking the
remainder. For example, the number corresponding to "cat" is 0, obtained
by taking the remainder of
( (3=position of "c") + (2 * (1=position of "a")) + (3 * (20=position of "t")) ) / 5
= ( 3+2+60 )/ 5.
Suppose we have indexed five words: cat, dog, sod, ox and bat. Our index structure looks as follows:
ARRAY FOR INDEX: array element number contents ______________________________________ __________________ 0 | cat | |-----------------| |loc. of URL list------------->URL list of pages containing cat |-----------------| elsewhere in memory or on disk | next | | |--------|--------| | V __________________ | dog | |-----------------| |loc. of URL list------------->URL list of pages containing dog |-----------------| elsewhere in memory or on disk | nil | |-----------------| _______________________________________ __________________ 1 | sod | |-----------------| |loc. of URL list------------->URL list of pages containing sod |-----------------| elsewhere in memory or on disk | nil | |-----------------| _______________________________________ 2 EMPTY ________________________________________ __________________ 3 | ox | |-----------------| |loc. of URL list------------->URL list of pages containing ox |-----------------| elsewhere in memory or on disk | nil | |-----------------| ________________________________________ __________________ 4 | bat | |-----------------| |loc. of URL list------------->URL list of pages containing bat |-----------------| elsewhere in memory or on disk | nil | |-----------------| ________________________________________Note that even though we have five array elements and five words, array element 2 is empty while array element 0 contains a linked list of two words. This is because our conversion of words to numbers assigned both "cat" and "dog" the number 0.