COS 435, Spring 2008 - Problem Set 4
Due at 5:00pm, Monday April 7, 2008. (Note you
have until the end of the business day.)
Collaboration Policy
You may discuss the general methods of solving the problems with other
students in the class. However,
each student must work out the details and write up his or her own
solution to each problem
independently.
Lateness Policy
A late penalty will be applied, unless there
are extraordinary circumstances and/or prior arrangements:
- Penalized 10%
of the earned score if
submitted by 5:00 pm Tuesday (4/8/08).
- Penalized 25% of the earned score if
submitted by 5:00 pm Saturday (4/12/08).
- Penalized 50% if submitted later than
5:00 pm Saturday (4/12/08).
Problems
Problem 1 (Compression )
Part a. Give the variable-byte encoding
for each of these integers: 60, 4100, 230-1, 264-1.
How many bits
does each encoded integer require?
Part b. Give the
Elias-delta code for the same integers as in Part a.
How many bits does each encoded integer require? Note
that Introduction
to Information Retrieval does not cover Elias-delta
encoding. Please use the class notes for March 12 for the
definitions of Elias-gamma and Elias-delta encodings.
Part c. Which of the
variable-byte and Elias-delta encodings, if any, is preferable if you
are encoding the sequences of gaps between document IDs in the posting
lists for the terms in an inverted index? Assume there are about
10 billion documents. You may make other reasonable assumptions,
but be sure to state them. Justify your answer.
Problem 2 (Collaborative Filtering)
For this problem you need the equations for user-based
"memory-based" collaborative filtering . The equations were given
in class on March 26.
The table below gives the ratings of six items by three users.
The rating scale is 1 (poorest) through 5 (best) and "X" indicates not
rated.
|
User 1
|
User 2 |
User 3 |
| Item 1 |
3
|
X
|
2
|
| Item 2 |
X
|
1
|
2
|
| Item 3 |
4
|
2
|
X
|
| Item 4 |
5
|
1
|
X
|
| Item 5 |
3
|
5
|
5
|
| Item 6 |
5
|
1
|
X
|
Part a.
What is the similarity of User 1 and User 3? of User 2 and User
3?
Part b.
Compute the predicted ratings by User 3 for items 3, 4 and
6. For each item, calculate the prediction
using the ratings of both User 1 and User 2.
Part c.
In the equations used in Part a and Part b, the average rating by a
user is subtracted from that user's ratings to account for
shifts in the way different users rate (think "grade inflation").
Different users also use different sub-ranges of values for their
ratings. For example User 2 has used the full range 1 through 5
in her ratings, while User 1 has only used the range 3 through 5 (think
"grade compression"). How might the collaborative filtering
predictions take into account the differences in the range of values
users use? You need not work out a full set of equations -- just
give some thoughts.
Problem 3 (Clustering)
Consider a "wagon-wheel graph": a graph consisting of 9 nodes -- a, b,
c, d, e, f , g, h -- in a cycle and
a ninth node, x, connected to each of the eight nodes. The edges have
the following similarity weights (higher number is more
similar). If there is no edge between a pair of nodes, the
similarity weight is 0. You may view a figure of this graph here.
(a,b) has weight 10
(b,c) has weight 4
(c,d) has weight 10
(d,e) has weight 4
(e,f) has weight 10
(f,g) has weight 4
(g,h) has weight 10
(h,a) has weight 4
(x,a) has weight 9
(x,b) has weight 1
(x,c) has weight 8
(x,d) has weight 8
(x,e) has weight 1
(x,f) has weight 1
(x,g) has weight 8
(x,h) has weight 8
Part a.
Find clusters for this graph using hierarchical agglomerative
clustering with single-link cluster similarity. If, during any
iteration of the agglomerative algorithm, there is more than one pair
of most-similar clusters, break the tie arbitrarily. Visualize
the results of the n-1 merges by drawing the dendrogram in the
style of Introduction
to Information Retrieval, Section 17.1. (Caution: on
March 26, I discussed types of distance
measures between clusters. As usual, smallest
distance (closest) translates to largest similarity (most similar).)
Part b.
Using the hierarchy of clusterings you found in Part a, what clustering
do you get if you want 3 clusters?
Part c.
Calculate the cost of your clustering of Part b under the
cut-based
cost used by
min-max cut clustering (see class notes
for 4/2/08) .
Part d.
Calculate the cost of the 3- clustering with cluster1={a,b},
cluster2={c,d,,x, g, h}and cluster3={e,f}
under the cut-based
cost used
by min-max cut clustering.
Which of this clustering and the clustering of Part b is better
under this cost model? Which clustering do you prefer
subjectively? Why?
Part e.
Do you think complete-link agglomerative clustering or group-average
agglomerative clustering would have performed better than the
single-link agglomerative clustering did? Justify your answer,
but do not do complete calculations of the clusterings -- group-average
clustering, in particular, is time-consuming to compute since we are
not using objects that are vectors.