Problem 1:
Recall that an inverted index
for a collection of documents is a compilation of information about the
occurences of terms in the documents. The inverted index has one
entry for each term appearing in any document. The entry for a
term contains a postings list.
A postings list for a term contains information about the occurrences
of the term in the collection. A postings list is usually
organized document by document. The actual information about each
occurrence can vary. In the simpliest form, a postings list is
just a list of the documents containing the term. In the richest
form, a posting list records the position of each occurrence of the
term within each document and possible additional information about the
occurrence.
For this
problem, consider a collection of Web documents in which each document
has a title and text
contents. Documents are linked to each
other, and the anchor text labeling the link from one
document to
another document is identified. For example, "the
bulletin board for cos597A" is anchor text in this document linking
to the course Announcements page. For this problem, anchor text
is viewed as part of the regular text contents of the document
containing it (e.g. this document), but is counted as special "anchor
text" for the document pointed to (e.g. the course Announcements
page). Note that the Announcements page does not contain the word
"bulletin", but including the anchor text associates the Announcements
page with the term "bulletin".
The
collection will be searched using text queries, each query consisting
of a list
of one or more terms. There are no
phrase queries, Boolean operations in a query, or other advanced query
specifications. Consider the following definition of a document satisfying a query when searching the
collection:
Part A : Suggest a ranking function to rank the documents satisfying a query under the conditions above. It is fine to suggest a pair of ranking functions – one for singe-term queries and one for multi-term queries.
Part B:
Describe one
or more data structures to
be used to record the information about terms in documents. These data structures together should make up
the inverted index. The resulting
inverted index must allow single-term queries to be processed without
considering the postings for documents containing the query term only
in their text contents.
Part C : Given your solution to Part B, can the
ranking
function you have suggested in Part A be computed while satisfying
documents are being found? Explain briefly.
Problem 2:
In
class we executed inserts and delete on an order one B+ tree, finishing
with tree shown in slide #11 of the slides for B+
tree insert,
delete examples (link to pdf).
Problem 3:
Analyze the number of block reads and block writes required to
delete a value from a B+ tree leaf when the deletion requires a merge
of the leaf with a sibling but causes no merge of siblings or
redistribution of key values between siblings at the
parent-level. Include the block reads necessary to find the leaf;
assume the height of the B+ tree is 3. Don't forget that
in a B+ tree, a linked list of the leaves is maintained.