COS 435, Spring 2008 - Problem Set 4

Due at 5:00pm,  Monday April 7,  2008.  (Note you have until the end of the business day.)

Collaboration Policy

You may discuss the general methods of solving the problems with other students in the class. However, each student must work out the details and write up his or her own solution to each problem independently. 


Lateness Policy

A late penalty will be applied, unless there are extraordinary circumstances and/or prior arrangements:

Problems


Problem 1 (Compression )

Part a.
  Give the variable-byte encoding  for each of these integers:  60, 4100, 230-1, 264-1.  How many bits does each encoded integer require? 
Part b.  Give the Elias-delta code for the same integers as in Part a.   How many bits does each encoded integer require?    Note that Introduction to Information Retrieval  does not cover Elias-delta encoding.  Please use the class notes for March 12 for the definitions of Elias-gamma and Elias-delta encodings.
Part c.  Which of the variable-byte and Elias-delta encodings, if any, is preferable if you are encoding the sequences of gaps between document IDs in the posting lists for the terms in an inverted index?  Assume there are about 10 billion documents.  You may make other reasonable assumptions, but be sure to state them.  Justify your answer.


Problem 2 (Collaborative Filtering)

For this problem you need the equations for user-based  "memory-based" collaborative filtering .  The equations were given in class on March 26. 

The table below gives the ratings of six items by three users.  The rating scale is 1 (poorest) through 5 (best) and "X" indicates not rated.


User 1 
User 2 User 3
Item 1 3
X
2
Item 2 X
1
2
Item 3 4
2
X
Item 4 5
1
X
Item 5 3
5
5
Item 6 5
1
X


Part a. 
What is the similarity of User 1 and User 3?  of User 2 and User 3?   

Part b.
Compute the predicted ratings by User 3  for items 3, 4 and 6.   For each item,  calculate the prediction  using  the ratings of both User 1 and User 2.

Part c.
In the equations used in Part a and Part b, the average rating by a user is subtracted  from that user's  ratings to account for shifts in the way different users rate (think "grade inflation").  Different users also use different sub-ranges of values for their ratings.  For example User 2 has used the full range 1 through 5 in her ratings, while User 1 has only used the range 3 through 5 (think "grade compression").  How might the collaborative filtering predictions take into account the differences in the range of values users use?  You need not work out a full set of equations -- just give some thoughts.


Problem 3 (Clustering)
Consider a "wagon-wheel graph": a graph consisting of 9 nodes -- a, b, c, d, e, f , g, h -- in a cycle and a ninth node, x, connected to each of the eight nodes. The edges have the following similarity weights (higher number is more similar).  If there is no edge between a pair of nodes, the similarity weight is 0.  You may view a figure of this graph here.
(a,b) has weight 10 
(b,c) has weight 4
(c,d) has weight 10
(d,e) has weight 4
(e,f) has weight 10
(f,g) has weight 4
(g,h) has weight 10
(h,a) has weight 4

(x,a) has weight 9
(x,b) has weight 1
(x,c) has weight 8
(x,d) has weight 8
(x,e) has weight 1
(x,f) has weight 1
(x,g) has weight 8
(x,h) has weight 8


Part a.

Find clusters for this graph using hierarchical agglomerative clustering with single-link cluster similarity.  If, during any iteration of the agglomerative algorithm, there is more than one pair of most-similar clusters, break the tie arbitrarily.  Visualize the results of  the n-1 merges by drawing the dendrogram in the style of Introduction to Information Retrieval, Section 17.1.  (Caution: on March 26, I discussed types of  distance measures between clusters.  As usual,  smallest distance (closest) translates to largest similarity (most similar).)

Part b.
Using the hierarchy of clusterings you found in Part a, what clustering do you get if you want 3 clusters?

Part c.
Calculate the cost of your clustering of  Part b under the cut-based cost used by min-max cut clustering (see class notes for 4/2/08) .

Part d.
Calculate the cost of the 3- clustering with cluster1={a,b}, cluster2={c,d,,x, g, h}and cluster3={e,f} under the cut-based cost used by min-max cut clustering.   Which of  this clustering and the clustering of Part b is better under this cost model?  Which clustering do you prefer subjectively? Why?

Part e.
Do you think complete-link agglomerative clustering or group-average agglomerative clustering would have performed better than the single-link agglomerative clustering did?  Justify your answer, but do not do complete calculations of the clusterings -- group-average clustering, in particular, is time-consuming to compute since we are not using objects that are vectors.