## COS 435, Spring 2012 - Problem Set 4

Due at 1:30pm,  Wednesday,  April 4,  2012.

## Collaboration and Reference Policy

You may discuss the general methods of solving the problems with other students in the class. However, each student must work out the details and write up his or her own solution to each problem independently.

Some problems have been used in previous offerings of COS 435. You are NOT allowed to use any solutions posted for previous offerings of COS 435 or any solutions produced by anyone else for the assigned problems.   You may use other reference materials; you must give citations to all reference materials that you use.

## Lateness Policy

A late penalty will be applied, unless there are extraordinary circumstances and/or prior arrangements:
• No penalty if in Prof. LaPaugh's office by 5pm Wednesday (4/4/12).
• Penalized 10% of the earned score if submitted by 11:59 pm Wed (4/4/12).
• Penalized 25% of the earned score if submitted by 5pm Friday  (4/6/12).
• Penalized 50% if submitted later than 5pm Friday  (4/6/12).

## Problems

Problem 1 (Compression )

Part a.

Give the variable-byte encoding  for each of these integers:  4100,  290-1,  2128-1.  How many bits does each encoded integer require?
How many bytes?

Part b.

Give the Elias-delta code for the same integers as in Part a.   How many bits does each encoded integer require?
How many bytes (no fractions of bytes!)?  Note that Introduction to Information Retrieval  does not cover Elias-delta encoding.  Please use the class notes posted under March 5 (notes on compression, Part 2) for the definitions of Elias-gamma and Elias-delta encodings.

Part c.
Decode the following sequence of numbers encoded with the  Elias-delta encoding:

100111100100101101010

Problem 2 (Web crawling) (from a 2010 exam 2 problem)

Consider a Web crawler that uses F different priority levels for fetching URLs, based on the frequency of change of the URL.  The crawler also uses different minimum delays between requests for different hosts.  It will contact a host known to have a large capacity for handling requests more frequently than a host that has less capacity.   For each host, h, that has been contacted, the earliest next contact time th is recorded.   Assume the fetching priorities and minimum delays between requests are independent.

Part A:  In the Mercator Web crawler, the URL frontier is managed by two sets of first in first out (FIFO) queues.  Give an example in which a crawler thread must wait before fetching a URL even though there is a URL that could be fetched immediately in one of the front queues.  Your example should show as much of the state of the front and back queues as necessary to make clear that the state is legal and crawler thread is waiting unnecessarily.

Part B:  Consider replacing each FIFO front queue in the Mercator URL frontier with a priority queue that is sorted on earliest next contact time of the host of each URL in the queue.  That is, the next element removed from the kth front priority queues is the URL with the earliest next contact time among all the URLs with fetch priority k.  Does this eliminate the situation that a crawler thread must wait before fetching a URL even though there is a URL that could be fetched immediately in one of the front queues?  Does using such a priority queue for each of the F front queues cause new problems?  Explain all your answers.

Part C: What information is needed to compute the earliest next contact times for all previously seen hosts?  Where is this information stored?  When is it updated?  Be explicit.