COS 435, Spring 2011 - Problem Set 4

Due at 1:30pm,  Wednesday,  March  30,  2011


Collaboration and Reference Policy

You may discuss the general methods of solving the problems with other students in the class. However, each student must work out the details and write up his or her own solution to each problem independently.

Some problems have been used in previous offerings of COS 435. You are NOT allowed to use any solutions posted for previous offerings of COS 435 or any solutions produced by anyone else for the assigned problems.   You may use other reference materials; you must give citations to all reference materials that you use.


 

Lateness Policy

A late penalty will be applied, unless there are extraordinary circumstances and/or prior arrangements:
 


Problems


Problem 1 (Distributed computing)

Part a.
A collection
of Web documents is processed  and one of the results is a set of tuples of the form: (docURL,  list of URLs), one tuple per docURL.  The list of URLs is the list of all URLs that the document identified by docURL links to, i.e. the list of URLs appear as values of "href = ..." in the document.  This is the "to list"  - the list of URLs the document points to.   We want to produce the "from list" for the collection in the form (docURL, list of URLs), where the list of URLs is the list of all URLs that point to the document
identified by docURL.  

Use the MapReduce programming model to map the set of tuples (docURL, "to list" of URLs) to the set of tuples (docURL, "from list" of URLs).

Part b.
A problem with using URLs is that a document may be referred to by several URLs, e.g. due to mirror sites or shortcuts.  What we really want is a set of tuples (docID, "from list" of docIDs), where each document has a unique docID.   Suppose that the "to list" tuples are of the form (docID,  list of URLs), built by processing the document by docID but leaving the URLs appearing in the document as is.   Suppose there is also a set of tuples (URL, docID) giving the unique document ID associated with each URL.  Give one or more successive MapReduce computations to build the "from list" as
tuples (docID, "from list" of docIDs) from the new "to list" and URL-docID mappingYou may build on Part a or start over.

Notes: 
Slide 10 of the distributed query execution and index building slides suggests that the results of Reduce must be in terms of the intermediate key.  The intermediate key and the output key are the same for the inverted index example, but not in general.  The form of the output is decided by the user writing the Reduce function.   Also, successive MapReduce computations need not have their inputs restricted to the output of the previous MapReduce computation.


Problem 2 (Compression
)

Part a.
 
Give the variable-byte encoding  for each of these integers:  60, 32774
, 230-1.  How many bits does each encoded integer require?   How many bytes?

Part b.
 
Give the Elias-gamma code for the same integers as in Part a.   How many bits does each encoded integer require?    How many bytes (no fractions of bytes!)?


Part c.
Give the Elias-delta code for the same integers as in Part a.   How many bits does each encoded integer require?     How many bytes?  Note that Introduction to Information Retrieval  does not cover Elias-delta encoding.  Please use the class notes posted under March 2 for the definitions of Elias-gamma and Elias-delta encodings.

Part d.

Decode the following sequence of numbers encoded with the  Elias-delta encoding:

100111100100101101010