COS 435, Spring 2006 - Problem Set 3

Due at 5pm, Friday March 31, 2006 (NOTE EXTENSION from original deadline of Mar. 30) .

Collaboration Policy

You may discuss problems with other students in the class. However, each student must write up his or her own solution to each problem independently. That is, while you may formulate the solutions to problems in collaboration with classmates, you must be able to articulate the solutions on your own.


Lateness Policy

A late penalty will be applied, unless there are extraordinary circumstances and/or prior arrangements:


Problems


Problem 1
In this problem you will compute hubs and authority weights and pageranks for the following graph:

                                         |------<------^
V |
Node_1 ---> Node_2 ---> Node_3 ---> Node_4-|
^ |
| |
|------------<------------

That is, there are edges from Node 1 to Node 2, from Node 2 to Node 3, from Node 3 to Node 4, from Node 4 to Node 2 and from Node 4 to Node 3.

Both hub and authority weights and pageranks are obtained from iterative formulae. You may write a small program to calculate them, use a math computation package such as Matlab, or do the calculation by hand -- whatever you find easiest. If possible, give the values after each iteration to show how the values converge. If you do the calculations by hand, you need to do enough iterations to see that the values are converging; however, I do not ask that you do more than 5 iterations. If you do the calculations by computer, say what convergence criteria you are using.  Be sure to read the notes below on the algorithms be for beginning.

Specifically compute:

Also:


Notes on the HITS (hubs and authorities) algorithm: 

Use the algorithm I gave in class as  pseudo-code: for the vector of authority values, h  the vector of hub values, and E the adjacency matrix:

initialize a = (1, 1, ... , 1)T  and h = (1, 1, ... , 1)T;
repeat until convergence {
anew = ET h;
hnew = E a;
a = normalized anew;
h = normalized hnew;
}
The normalization simply divides each vector component by the vector's Euclidean length, i.e. the square root of the sum of the squares of the vector components.  Note that this normalization step differs from that in the reading I assigned in Mining the Web: Discovering Knowledge from Hypertext Data.  Instead it follows the original paper by Kleinberg. 

Notes on the pagerank algorithm:
Let pr denote the vector of pagerank values, E be the adjacency matrix, n be the number of vertices in the graph, q be the "random jump"  parameter, and tk be the outdegree of vertex k.  Use the algorithm:
initialize pr = (1/n , 1/n, ... , 1/n)T ;
repeat until convergence {
for i from 1 to n {
pr[i]newq/n + (1- q) * SUM k  (E[k,i] * ( pr[k] / tk ) )
}
pr  = prnew ;
}
In class I used q rather than q/n in the update equation.  However, by using q/n, the components of pr always sum to 1, as one wants if pr represents the probabilities of being at the different vertices.  No normalization step is necessary. The q/n term is used in the reading I assigned in Mining the Web: Discovering Knowledge from Hypertext Data (denoted d/N there).


Problem 2 

Consider an inverted index containing, for each term, the posting list (i.e. the list of documents and occurrences within documents) for that term.  The posting lists are accessed through a B+ tree with the terms serving as search keys.  Each leaf of the B+ tree holds a sublist of alphabetically consecutive terms, and, with each term, a pointer to the posting list for that term.  

Part a. An artificially small example of a B+ tree is shown here (pdf).  (Note only part of the tree is shown in detail.)   What nodes of the example B+ tree are visited to find the posting list for "dune"?

Part b. Suppose there are 2 million terms for a collection of 32 million documents of total size 200 gigabytes. We would like each internal node of the B+ tree and each leaf of the B+ tree to fit in one 8-kilobyte page of the file system.  Recall that a B+ tree has a parameter m called the order of the tree, and each internal node of a B+ tree has between m+1 and 2m+1 children (except the root, which has between 2 and 2m+1). Assume that each term is represented using 16 bytes, and each pointer to a child in the tree or to a posting list is represented using 8 bytes.   Find a value for the order m of the B+ tree so that one 8 kilobyte page can be assigned to each internal node and leaf, and so that an internal node will fill, but not overflow, its page when it has 2m+1 children.  If you need to make additional assumptions, state what assumptions you are making.

Part c.  For your m of Part b, estimate the height of the B+ tree.  (Giving a range of heights is fine.)  Also estimate the amount of memory needed to store the tree, including leaves but not including the posting lists themselves.

Part d.  Estimate the aggregate size of the posting lists.


Problem 3

Part a.  Give the Elias-delta code for each of these integers:  60, 600, 4100.  How many bits does each encoded integer require? 
Part b.  Give the Golumb code with b=256 for the same integers as in Part a.  How many bits does each encoded integer require?
Part c.  Is one code preferable if you are encoding a sequence of gaps, where the gaps have average size of  600 and range from 60 to 1200?  What if the gaps have average size 600 and range from 6 to 6000?