COS 435, Spring 2002 - Problem Set 3

Due at 11am, Wednesday April 3, 2002.

Special Notice

This is a short problem set to give you some work with indexing and searching before beginning the take-home exam next Thursday, April 4. Solutions to this problem set will be provided in class on the date due. Therefore no late submissions will be accepted unless there are extraordinary circumstances and/or prior arrangements. Problem set 4, the last problem set, will be substantially larger. Problem set 3 will be a smaller percentage of your problem set grade.

Collaboration Policy

You may discuss problems with other students in the class. However, each student must write up his or her own solution to each problem independently. That is, while you may formulate the solutions to problems in collaboration with classmates, you must be able to articulate the solutions on your own.


Problems

Problem 1 Consider an inverted file of occurrence lists accessed through a B+ tree on the index terms. Each leaf of the B+ tree holds a sublist of alphabetically consecutive index terms, and, with each term, a pointer to the list of occurrences within documents for that term. Suppose there are 16 million index terms for a collection of 32 million documents of total size 200 gigabytes. We would like each internal node of the B+ tree and each leaf of the B+ tree to fit in one 8-kilobyte page of the file system.

Part a Recall that a B+ tree has a parameter m called the order of the tree, and each internal node of a B+ tree has between m and 2m children (except the root, which has between 2 and 2m). Find a value for the order m of the B+ tree so that one 8 kilobyte page can be assigned to each internal node and leaf, and so that an internal node will fill, but not overflow, its page when it has 2m children. Assume that each index term is represented using 8 bytes. If you need to make additional assumptions, state what assumptions you are making.

Part b For your m of Part a, what is the estimated height of the B+ tree?

Problem 2 In Problem 1, we assumed each index term was represented using 8 bytes. Clearly, many English words are longer than 8 characters, so it cannot be that the full ASCII representation is being used for each index term. Assuming the index terms are English words, what technique or techniques might be used to allow index terms to be represented in 8 bytes? For any technique you suggest, explain the costs of this technique both in preprocessing and when considering the processing of a single-word query.


A.S. LaPaugh Fri Mar 29 14:37:56 EST 2002