COS 597A, Fall 2008: Problem Set 5

COS 597A, Fall 2008 - Problem Set 5

Due at 3:00pm, Wednesday Novermber 19, 2008.

Collaboration Policy

You may discuss problems with other students in the class. However, each student must write up his or her own solution to each problem independently. That is, while you may formulate the solutions to problems in collaboration with classmates, you must be able to articulate the solutions on your own.

Late Penalties

10% of the earned score if submitted after class but by 6:00pm the day due.
20% of the earned score if submitted by 5pm on Friday 11/21/08.
40% of the earned score if submitted by 5pm on Monday 11/24/08.
No credit if submitted later than the 40% penalty deadline.

Chapter, exercise and page numbers refer to the course text Database System Concepts, Fifth Edition, by Silberschatz, Korth, and Sudarshan

Problem 1:
Recall that an inverted index for a collection of documents is a compilation of information about the occurences of terms in the documents. The inverted index has one entry for each term appearing in any document. The entry for a term contains a postings list. A postings list for a term contains information about the occurrences of the term in the collection. A postings list is usually organized document by document. The actual information about each occurrence can vary. In the simpliest form, a postings list is just a list of the documents containing the term. In the richest form, a posting list records the position of each occurrence of the term within each document and possible additional information about the occurrence.

For this problem, consider a collection of Web documents in which each document has a title and text contents. Documents are linked to each other, and the anchor text labeling the link from one document to another document is identified. For example, "the bulletin board for cos597A" is anchor text in this document linking to the course Announcements page. For this problem, anchor text is viewed as part of the regular text contents of the document containing it (e.g. this document), but is counted as special "anchor text" for the document pointed to (e.g. the course Announcements page). Note that the Announcements page does not contain the word "bulletin", but including the anchor text associates the Announcements page with the term "bulletin".

The collection will be searched using text queries, each query consisting of a list of one or more terms. There are no phrase queries, Boolean operations in a query, or other advanced query specifications. Consider the following definition of a document satisfying a query when searching the collection:

A document satisfies a single-term query if and only if the query term is in anchor text pointing to the document or in the title of the document.
A document satisfies a multi-term query if one of the following is true:

The document contains only some of the query terms, but at least one query term appears either in anchor text pointing to the document or in the title of the document.
The document contains all the query terms; each term may appear in any of the title, anchor text pointing to, or text contents of the document.

The ranking function for documents satisfying a query is not fully specified but must satisfy the following three conditions:

Terms in a title are most valuable – compared to terms in anchor text or terms in text contents,
Terms in anchor text pointing to a document are more valuable than terms in the text contents of the document,
All occurrences of terms in the text contents of a document are equally valuable, regardless of position, font, or other attributes.

Part A : Suggest a ranking function to rank the documents satisfying a query under the conditions above. It is fine to suggest a pair of ranking functions – one for singe-term queries and one for multi-term queries.

Part B: Describe one or more data structures to be used to record the information about terms in documents. These data structures together should make up the inverted index. The resulting inverted index must allow single-term queries to be processed without considering the postings for documents containing the query term only in their text contents.

Part C : Given your solution to Part B, can the ranking function you have suggested in Part A be computed while satisfying documents are being found? Explain briefly.

Problem 2:
In class we executed inserts and delete on an order one B+ tree, finishing with tree shown in slide #11 of the slides for B+ tree insert, delete examples (link to pdf).

Part a: Delete the entry with B+ tree key value 29 from the B+ tree. Describe each step of the deletion.
Part b: Delete the entry with B+ tree key value 27 from the B+ tree resulting from Part a. Describe each step of the deletion.

Problem 3:
Analyze the number of block reads and block writes required to delete a value from a B+ tree leaf when the deletion requires a merge of the leaf with a sibling but causes no merge of siblings or redistribution of key values between siblings at the parent-level. Include the block reads necessary to find the leaf; assume the height of the B+ tree is 3. Don't forget that in a B+ tree, a linked list of the leaves is maintained.

Problem 4:
Let 0, 1, 2, 3, 64, 128, 192, 4, 8, 16, 32 be a sequence of hash key values to be inserted in a dynamic hashing index. You may view each value in the sequence to be the result of applying some base hash function h to the actual search key value.

Part a: Use extendable hashing starting with 4 empty buckets and h₀(x)= h(x) mod 4. Assume each bucket can hold four data entries. Show the initial (empty) configuration of the directory and buckets. Insert each value of the sequence in order. Show the configuration each time the directory changes, including pointer changes. Show the final configuration. Assuming that each bucket fits in one block and 64 directory entries fit in one block, how many blocks are used?

Part b: Suppose we place a limit on directory splitting by requiring that at least half the buckets be as the same depth as the directory before the directory is split on an insertion. Otherwise, overflow blocks are used. Redo Part a under this strategy. Do you think this is a good idea? Why?