COS 435 Spr 09: PS 4

COS 435, Spring 2009 - Problem Set 4

Due at 3:00pm, Thursday, April 2, 2009

Collaboration and Reference Policy

You may discuss the general methods of solving the problems with other students in the class. However, each student must work out the details and write up his or her own solution to each problem independently.

Some problems have been used in previous offerings of COS 435. You are NOT allowed to use any solutions posted for previous offerings of COS 435 or any solutions produced by anyone else for the assigned problems. You may use other reference materials; you must give citations to all reference materials that you use.

Lateness Policy

A late penalty will be applied, unless there are extraordinary circumstances and/or prior arrangements:

Penalized 10% of the earned score if submitted by noon Friday (4/3/09).
Penalized 25% of the earned score if submitted by noon Monday (4/6/09).
Penalized 50% if submitted later than noon Monday (4/6/09).

Problems

Problem 1 (Distributed Index Construction)

Part a.

The technique of distributed index construction using MapReduce follows the block sort-based algorithm for building an index by assembling and sorting (term,docID) pairs. A second algorithm, single-pass in memory indexing, is described in Section 4.3 of Introduction to Information Retrieval. Give a high-level description of a method of implementing single-pass in memory indexing as a distributed computation. Your method should be able to use a variable number of machines. Be clear about what is written to disk as intermediate results. The final result should be the full inverted index, sorted by term, on disk.

Part b.

Modify your method of Part a to construct a collection of inverted indexes, each recording term occurrences for a subset of documents. The full set of documents is randomly partitioned into these subsets, which do not overlap.

Problem 2 (Compression )

Part a.
Give the variable-byte encoding for each of these integers: 60, 4100, 2³⁰-1, 2⁶⁴-1. How many bits does each encoded integer require?

Part b.
Give the Elias-delta code for the same integers as in Part a. How many bits does each encoded integer require? Note that Introduction to Information Retrieval does not cover Elias-delta encoding. Please use the class notes posted under March 5 for the definitions of Elias-gamma and Elias-delta encodings.

Problem 3 (Collaborative Filtering)

For this problem you need the equations for user-based "memory-based" collaborative filtering . The equations were given in class, and the slides are posted under March 24.

The table below gives the ratings of six items by three users. The rating scale is 1 (poorest) through 5 (best) and "X" indicates not rated.

	User 1	User 2	User 3
Item 1	3	X	2
Item 2	X	1	2
Item 3	4	2	X
Item 4	5	1	X
Item 5	3	5	5
Item 6	5	1	X

Part a.
What is the similarity of User 1 and User 3? of User 2 and User 3?

Part b.
Compute the predicted ratings by User 3 for items 3, 4 and 6. For each item, calculate the prediction using the ratings of both User 1 and User 2.

Part c.
In the equations used in Part a and Part b, the average rating by a user is subtracted from that user's ratings to account for shifts in the way different users rate (think "grade inflation"). Different users also use different sub-ranges of values for their ratings. For example User 2 has used the full range 1 through 5 in her ratings, while User 1 has only used the range 3 through 5 (think "grade compression"). How might the collaborative filtering predictions take into account the differences in the range of values users use? You need not work out a full set of equations -- just give some thoughts.