COS
435, Spring 2009 - Problem Set 4
Due
at 3:00pm, Thursday,
April 2, 2009
Collaboration and Reference
Policy
You may discuss the
general
methods of solving the problems with other students in the class.
However, each
student must work out the details and write up his or her own solution
to each
problem independently.
Some problems have
been used in
previous offerings of COS 435. You are NOT allowed to use any solutions
posted
for previous offerings of COS 435 or any solutions produced by anyone
else for
the assigned problems. You may use
other reference materials; you must give citations to all reference
materials
that you use.
Lateness Policy
A late penalty will be applied,
unless there are extraordinary circumstances and/or prior arrangements:
- Penalized 10%
of the earned score if submitted by noon Friday (4/3/09).
- Penalized 25% of the earned score if
submitted by noon Monday (4/6/09).
- Penalized 50% if submitted later than
noon Monday (4/6/09).
Problems
Problem 1 (Distributed Index
Construction)
Part a.
The technique of distributed index construction using MapReduce follows the block sort-based algorithm for
building an index by assembling and sorting (term,docID) pairs. A
second algorithm, single-pass in
memory indexing, is described in Section 4.3 of Introduction
to Information Retrieval. Give a high-level
description of a method of implementing single-pass in memory
indexing as a distributed computation. Your
method should be able to use a variable number of machines. Be
clear about what is written to disk as intermediate results. The
final result should be the full inverted index, sorted by term, on disk.
Part b.
Modify your method of Part
a to construct a collection of inverted indexes, each recording
term occurrences for a subset of documents. The full set of
documents is randomly partitioned into these subsets, which do not
overlap.
Problem 2 (Compression )
Part a.
Give the variable-byte encoding
for each of these integers: 60, 4100, 230-1, 264-1.
How many bits
does each encoded integer require?
Part b.
Give the
Elias-delta code for the same integers as in Part a.
How many bits does each encoded integer require? Note
that Introduction
to Information Retrieval does not cover Elias-delta
encoding. Please use the class notes posted under March 5
for the
definitions of Elias-gamma and Elias-delta encodings.
Problem 3 (Collaborative Filtering)
For this problem you need the equations for user-based
"memory-based" collaborative filtering . The equations were given
in class, and the slides are posted under March 24.
The table below gives the ratings of six items by three users.
The rating scale is 1 (poorest) through 5 (best) and "X" indicates not
rated.
|
User 1
|
User 2 |
User 3 |
Item 1 |
3
|
X
|
2
|
Item 2 |
X
|
1
|
2
|
Item 3 |
4
|
2
|
X
|
Item 4 |
5
|
1
|
X
|
Item 5 |
3
|
5
|
5
|
Item 6 |
5
|
1
|
X
|
Part a.
What is the similarity of User 1 and User 3? of User 2 and User
3?
Part b.
Compute the predicted ratings by User 3 for items 3, 4 and
6. For each item, calculate the prediction
using the ratings of both User 1 and User 2.
Part c.
In the equations used in Part a and Part b, the average rating by a
user is subtracted from that user's ratings to account for
shifts in the way different users rate (think "grade inflation").
Different users also use different sub-ranges of values for their
ratings. For example User 2 has used the full range 1 through 5
in her ratings, while User 1 has only used the range 3 through 5 (think
"grade compression"). How might the collaborative filtering
predictions take into account the differences in the range of values
users use? You need not work out a full set of equations -- just
give some thoughts.