COS 597A, Fall 2008
Solutions to Problem Set 5: Problems 1
and 3
Problem 1:
Part A: Many ranking
functions are possible. One choice is to use the vector model
with weighted tf.idf values for components of each document
vector: weighting
occurrences in the title very highly, occurrences in anchor text
moderately, and occurrences in the text contents lightly.
The query vector can contain a 1 for each term in the query and 0
elsewhere. The normalized dot product can be used to give a score
for each document with respect to the query. For
single-term queries, documents with the term only in the text contents
are not considered since they do not satisfy the query.
Another possibility for multi-term queries is to rank all documents
containing all terms of the query higher than any document that
contains
only some of the terms. Within each group, the above tf.idf score
with preferential weighting for title, followed by anchor text,
followed by text content can be used.
Part B: A good solution
is to keep three postings lists for each term: documents
containing the term in their title, documents containing the term
in anchor text pointing to them, and documents containing the term in
their text contents. Documents can appear on more than one of
these 3
lists for a term. For each document on a list, the number of
occurrences of the term of the appropriate type for the list should be
recorded. (For the title list, the number will usually be 1, but
for the
anchor text list and the text contents list, the number will often be
higher.) Each list should be sorted on document ID to allow
for linear-time merging of lists. Only the posting lists
for a term in titles and in anchor text will be used for a single-term
query; they are merged, taking the union, to get the list of
satisfying
documents. For multi-term queries, the lists for all
terms (three lists per term) are merged. All documents that
appear on
at least one of the lists for every term in the query satisfy the
query, and all documents that appear on either the title list or anchor
text list for any of the terms of the query satisfy the
query. Therefore, the merging operation will be a
combination of unions and intersections computed in one pass through
the sorted lists.
Detail not required: The data structure for the inverted index
could be a B+ tree or a hash table on the terms, but each B+ tree leaf
or hash table entry must contain 3 pointers: one to each of the posting
lists for the term.
Part C: For many
ranking functions, the ranking function can be computed while
satisfying documents are being found. Consider the vector model
with weighted tf.idf values described in Part A. In the posting
lists for a term, we have stored the number of occurrences of each type
for each document. As we merge all the postings lists
relevant to a query, for each document we can
calculate the value of the weighted term frequency for each query
term. After the lists are merged, the final tf.idf values and the
normalized dot product can be calculated.
Problem 3:
Every path from root to
leave of a height-3 B+ tree is of length 3 and contains 4 nodes,
including the root and leaf. Reading these nodes requires reading
4 blocks, assuming none of the nodes is already in memory
buffer. Assume the leaf has
two siblings. If both siblings of the leaf are checked
(reached from the parent) before determining a merge is necessary, 2 additional blocks are read.
Assume the leaf merges with its right sibling. (Merging
with the left sibling is analogous.) Use the block containing the
right
sibling for the merged node so that the leaf pointers to/from the
merged node and the next leaf to the right are correct. The
pointers to/from the left sibling must be updated. If the block
for the left sibling has remained in the buffer, only a write of the updated left sibling
is needed. The merged node must
be written. The key and pointer no longer needed in the
parent must be deleted. Assuming the parent block also remained
in the buffer, this is 1 more write
to update the parent. The total is 6 blocks read and 3 blocks written.
Even if only one sibling is checked before merging, another leaf must
be read and
written to keep the leaf pointers correct.
This solution assumes the leaves are doubly-linked as presented in
class. The text book uses only singly linked lists - left to
right. In this case, when merging a leaf with its right sibling,
using the block containing the leaf for the merged node means the
pointer from the preceding leaf in the linked list is correct, saving a
write.