COS 435, Spring 2006: Problem Sets

COS 435, Spring 2006 - Problem Set 4

Due at 5pm, Wednesday May 3, 2006.

Collaboration Policy

You may discuss problems with other students in the class. However, each student must write up his or her own solution to each problem independently. That is, while you may formulate the solutions to problems in collaboration with classmates, you must be able to articulate the solutions on your own.

Lateness Policy

A late penalty will be applied, unless there are extraordinary circumstances and/or prior arrangements:

Penalized 10% of the earned score if submitted by 5pm Friday May 5, 2006
Penalized 25% of the earned score if submitted by 5pm Friday May 12, 2006.
Penalized 50% of the earned score if submitted by 5pm Tuesday May 16, 2006.

Problems

Problem 1 (Clustering)
Consider a "wagon-wheel graph": a graph consisting of 9 nodes -- a, b, c, d, e, f , g, h -- in a cycle and a ninth node, x, connected to each of the eight nodes. The edges have the following similarity weights (higher number is more similar). You may view a figure of this graph here.

(a,b) has weight 10 
(b,c) has weight 4 
(c,d) has weight 10
(d,e) has weight 4
(e,f) has weight 10 
(f,g) has weight 4 
(g,h) has weight 10
(h,a) has weight 4

(x,a) has weight 9
(x,b) has weight 1
(x,c) has weight 8
(x,d) has weight 8
(x,e) has weight 1
(x,f) has weight 1
(x,g) has weight 8
(x,h) has weight 8

Part a.
Find clusters for this graph using hierarchical agglomerative clustering. In specific, use the algorithm which starts with n clusters for an n-node graph, one node in each cluster, and in each of n-1 stages merges the two clusters that are most similar (breaking ties arbitrarily). Measure the similarity between clusters as the similarity of the most similar pair of nodes, one from each cluster (i.e. single-link cluster similarity). Show the result of each of the n-1 stages.

Part b.
Using the hierarchy of clusterings you found in Part a, what clustering do you get if you want 3 clusters?

Part c.

Calculate the cost of your clustering of Part b under the cut-based cost used by min-max cut clustering (see class notes for 10/30/06) .

Part d.
Calculate the cost of the 3- clustering with cluster1={a,b, cluster2={c,d,,x, g, h}and cluster3={e,f} under the cut-based cost used by min-max cut clustering. Which of this clustering and the clustering of Part b is better under this cost model? Which clustering do you prefer subjectively? Why?

Problem 2 (Mining Association Rules)

Consider 12 customer transactions at a produce store. Each transaction is a set of items purchased:

{apples, lettuce, oranges, pears, tomatoes}
(apples. lettuce, celery, tomatoes, peppers }
{lettuce, tomatoes, cucumbers}
{apples, oranges, pears, pineapple}
{apples, oranges, pears, pineapple, grapefruit}
{apples, oranges. pears}
{apples, oranges}
{apples, pears, peaches, grapefruit}
{apples, pears}
{apples, pears, peaches}
{apples, oranges, grapefruit}
{apples, pears, lettuce}

Part a.
What is the support value for the set of items {apples, oranges, pears}? (For the definition of support, refer to the class presentation on April 11 or the assigned paper Mining association rules between sets of items in large databases )

Part b.
What association rules can be generated from the candidate set {apples, oranges, pears}? What is the confidence of each rule? (Recall that an association rule has a set of items as an antecedent and a single item as a consequent.)

Problem 3 (Collaborative Filtering)

For this problem you need the equations for user-based, "memory-based" collaborative filtering. The equations were given in class on April 13 and are summarized here. You may also find them in the paper posted for April 13 Toward the Next Generation of Recommender Systems: A Survey of the State-of-the-Art and Possible Extension (click on "IEEE xplore" for "Full Article Text" from this abstract page or try direct link to pdf from IEEE xplore - latter not guaranteed). The Pearson correlation coefficient to measure similarity is Equation 12 of that paper and the average rating by a user is defined in equation 11. We are using equation 10c in that paper to calculate the predicted rating of item i by a user u, with the normalizing factor k as given just above equation 11.

The table below gives the ratings of six items by three users. The rating scale is 1 (poorest) through 5 (best) and "X" indicates not rated.

	User 1	User 2	User 3
Item 1	3	X	2
Item 2	X	1	2
Item 3	4	2	X
Item 4	5	1	X
Item 5	3	5	5
Item 6	5	1	X

Part a.
What is the similarity of User 1 and User 3? of User 2 and User 3?

Part b.
Compute the predicted ratings by User 3 for items 3, 4 and 6. For each item, calculate the prediction using the ratings of both User 1 and User 2.

Part c.
In the equations used in Part a and Part b, the average rating by a user is subtracted from that user's ratings to account for shifts in the way different users rate (think "grade inflation"). Different users also use different sub-ranges of values for their ratings. For example User 2 has used the full range 1 through 5 in her ratings, while User 1 has only used the range 3 through 5 (think "grade compression"). How might the collaborative filtering predictions take into account the differences in the range of values users use? You need not work out a full set of equations -- just give some thoughts.