COS 597A Fall 2008: Solns PS 5

COS 597A, Fall 2008

Solutions to Problem Set 5: Problems 1 and 3

Problem 1:
Part A: Many ranking functions are possible. One choice is to use the vector model with weighted tf.idf values for components of each document vector: weighting occurrences in the title very highly, occurrences in anchor text moderately, and occurrences in the text contents lightly.   The query vector can contain a 1 for each term in the query and 0 elsewhere. The normalized dot product can be used to give a score for each document with respect to the query. For single-term queries, documents with the term only in the text contents are not considered since they do not satisfy the query.

Another possibility for multi-term queries is to rank all documents containing all terms of the query higher than any document that contains only some of the terms. Within each group, the above tf.idf score with preferential weighting for title, followed by anchor text, followed by text content can be used.

Part B: A good solution is to keep three postings lists for each term: documents containing the term in their title, documents containing the term in anchor text pointing to them, and documents containing the term in their text contents. Documents can appear on more than one of these 3 lists for a term. For each document on a list, the number of occurrences of the term of the appropriate type for the list should be recorded. (For the title list, the number will usually be 1, but for the anchor text list and the text contents list, the number will often be higher.) Each list should be sorted on document ID to allow for linear-time merging of lists.   Only the posting lists for a term in titles and in anchor text will be used for a single-term query; they are merged, taking the union, to get the list of satisfying documents.    For multi-term queries, the lists for all terms (three lists per term) are merged. All documents that appear on at least one of the lists for every term in the query satisfy the query, and all documents that appear on either the title list or anchor text list for any of the terms of the query satisfy the query.   Therefore, the merging operation will be a combination of unions and intersections computed in one pass through the sorted lists.

Detail not required: The data structure for the inverted index could be a B+ tree or a hash table on the terms, but each B+ tree leaf or hash table entry must contain 3 pointers: one to each of the posting lists for the term.

Part C:   For many ranking functions, the ranking function can be computed while satisfying documents are being found. Consider the vector model with weighted tf.idf values described in Part A. In the posting lists for a term, we have stored the number of occurrences of each type for each document.   As we merge all the postings lists relevant to a query, for each document we can calculate the value of the weighted term frequency for each query term. After the lists are merged, the final tf.idf values and the normalized dot product can be calculated.

Problem 3:
Every path from root to leave of a height-3 B+ tree is of length 3 and contains 4 nodes, including the root and leaf. Reading these nodes requires reading 4 blocks, assuming none of the nodes is already in memory buffer. Assume the leaf has two siblings. If both siblings of the leaf are checked (reached from the parent) before determining a merge is necessary, 2 additional blocks are read. Assume the leaf merges with its right sibling. (Merging with the left sibling is analogous.) Use the block containing the right sibling for the merged node so that the leaf pointers to/from the merged node and the next leaf to the right are correct. The pointers to/from the left sibling must be updated. If the block for the left sibling has remained in the buffer, only a write of the updated left sibling is needed. The merged node must be written. The key and pointer no longer needed in the parent must be deleted. Assuming the parent block also remained in the buffer, this is 1 more write to update the parent. The total is 6 blocks read and 3 blocks written.

Even if only one sibling is checked before merging, another leaf must be read and written to keep the leaf pointers correct.

This solution assumes the leaves are doubly-linked as presented in class. The text book uses only singly linked lists - left to right. In this case, when merging a leaf with its right sibling, using the block containing the leaf for the merged node means the pointer from the preceding leaf in the linked list is correct, saving a write.