COS 435, Spring 2002: Problem Sets

COS 435, Spring 2002 - Problem Set 4

Due at 11am, Monday April 22, 2002.

Collaboration Policy

You may discuss problems with other students in the class. However, each student must write up his or her own solution to each problem independently. That is, while you may formulate the solutions to problems in collaboration with classmates, you must be able to articulate the solutions on your own.

Lateness Policy

A late penalty will be applied, unless there are extraordinary circumstances and/or prior arrangements:

Penalized 10% of the earned score if submitted after class but by 5pm Monday.
Penalized 30% of the earned score if submitted by 11am Wednesday (4/24/02).
No credit if submitted later than 11am Wednesday (2/24/02).

Problems

Problem 1 Section 12.5.2 of Modern Information Retreival describes the use of RED, GREEN, and BLUE averages for an image to serve as features on which images can be compared. One problem with this approach is that a picture that has all pixels with RED at 50% intensity and BLUE at 50% intensity (and no GREEN) will "look" the same as an image with all pixels in the left half at full RED intensity (no BLUE or GREEN) and all pixels in the right half at full BLUE intensity (no RED or GREEN). Suppose that instead we divide the image into four regions by cutting it in half vertically and horizontally and take the RED, BLUE and GREEN averages for each region.

Part a How would two images be compared if we use RED, BLUE and GREEN averages in each of the four regions for each image (so that an image is characterized by 3 vectors of 4 components: one vector for RED, one for BLUE, and one for GREEN)? What would be the similarity function?

Part b Would the use of four regions help the inaccuracy described at the beginning of this problem? Why or why not? In general, what are the pros and cons of dividing each image into four regions?

Problem 2 Consider a "wagon-wheel graph": a graph consisting of 6 nodes -- a, b, c, d, e, f -- in a cycle and a seventh node, x, connected to each of the six nodes. The edges have the following similarity weights (higher number is more similar):

(a,b) has weight 10 
(b,c) has weight 4 
(c,d) has weight 4 
(d,e) has weight 10
(e,f) has weight 4 
(f,a) has weight 4 

(x,a) has weight 1
(x,b) has weight 1
(x,c) has weight 8
(x,d) has weight 1
(x,e) has weight 1
(x,f) has weight 8

Find clusters for this graph using hierarchical agglomerative clustering. In specific, use the algorithm which starts with n clusters for an n-node graph, one node in each cluster, and in each of n-1 stages merges the two clusters that are most similar (breaking ties arbitrarily). Measure the similarity between clusters as the similarity of the most similar pair of nodes, one from each cluster. Show the result of each of the n-1 stages.

Problem 3

Part a Find the minimum cut tree of the graph of Problem 2.

Part b Cluster the graph of Problem 2 using the minimum cut tree of Part a. Use the clustering algorithm that finds the centroid of the minimum cut tree and removes it to create the first-level clustering. How many levels of clustering can you get using this minimum cut tree?

Problem 4 Consider the graph that is simply a 4-cycle -- a,b,c,d -- with edge similarity weights:

(a,b) has weight 4
(b,c) has weight 1
(c,d) has weight 2
(d,a) has weight 2

Cluster this graph by applying the clustering algorithm based on random walks, as described in
David Harel and Yehuda Koren, Clustering spatial data using random walks, Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2001, pg. 281 - 286.
Use neighborhood similarity with k=2. That is, compute the values

  <=2
P      (i,j)
 visit

for each pair of nodes (i,j). Use the similarity measure derived from these values and a threshold of your choosing to decompose the graph into clusters. Why did you choose that threshold? Are there other reasonable choices?