You may discuss the general methods of solving the problems with
other students in the class. However, each student must work out
the details and write up his or her own solution to each problem
independently.
Some problems have been used in previous offerings of COS 435.
You are NOT allowed to use any solutions posted for previous
offerings of COS 435 or any solutions produced by anyone else for
the assigned problems. You may use other reference
materials; you must give citations to all reference materials that
you use.
Problem 1 (Compression
)
Part a.
Give the variable-byte
encoding for each of these integers: 4100, 290-1, 2128-1. How many bits does
each encoded integer require?
How many bytes?
Part b.
Give the Elias-delta
code for the same integers as in Part a. How many bits
does each encoded integer require?
How
many bytes (no fractions of bytes!)? Note that Introduction
to
Information Retrieval does not cover
Elias-delta encoding. Please use the class notes posted under March 5 (notes
on compression, Part 2) for the definitions of Elias-gamma and
Elias-delta encodings.
Part c. Decode the
following sequence of numbers encoded with the
Elias-delta encoding:
100111100100101101010
Problem 2 (Web crawling) (from a 2010 exam 2 problem)
Consider a Web crawler
that uses F different priority levels for fetching URLs, based
on the frequency of change of the URL. The crawler also uses different minimum
delays between requests for different hosts. It will contact a
host known to have a large capacity for handling requests more
frequently than a host that has less capacity. For each host,
h, that has been contacted, the earliest next contact time th
is recorded. Assume the fetching
priorities and minimum delays between requests are independent.
Part A: In the Mercator
Web crawler, the URL frontier is managed by two sets of first in
first out (FIFO) queues. Give
an example in which a crawler thread must wait before fetching a
URL even though there is a URL that could be fetched immediately
in one of the front queues.
Your example should show as much of the state of the
front and back queues as necessary to make clear that the state
is legal and crawler thread is waiting unnecessarily.
Part B: Consider replacing each FIFO
front queue in the
Mercator URL frontier with a priority queue that is sorted on
earliest next contact time of the host of each URL in the queue. That is, the next
element removed from the kth front priority queues is
the URL with the earliest next contact time among all the URLs
with fetch priority k. Does
this eliminate the situation that a crawler thread must wait
before fetching a URL even though there is a URL that could be
fetched immediately in one of the front queues? Does using such a
priority queue for each of the F front queues cause new
problems? Explain
all your answers.
Part C: What information is
needed to compute the earliest next contact times for
all previously seen hosts?
Where is this information stored? When is it updated? Be explicit.