COS 435, Spring 2008: Problem Set 3

COS 435, Spring 2008 - Problem Set 3

Due at 5:00pm, Friday, March 7, 2008.

Collaboration Policy

You may discuss the general methods of solving the problems with other students in the class. However, each student must work out the details and write up his or her own solution to each problem independently.

Lateness Policy

A late penalty will be applied, unless there are extraordinary circumstances and/or prior arrangements:

Penalized 10% of the earned score if submitted by 5:00 pm Monday (3/10/08). (Note extended period of only 10% penalty.)
Penalized 25% of the earned score if submitted by 1:30 pm Wednesday(3/12/08).
Penalized 50% if submitted later than 1:30 pm Wednesday (3/12/08).

Problems

Problem 1:
Part A. Implement the HITS algorithm as an iterative algorithm as discussed in class. Use the iterative version appearing in slide 14 (page 7 of the posted pdf) of the class presentation. This should be a straightforward modification of your PageRank program for Assignment 2. Use the same convergence criterion that you used in Assignment 2, Problem 2, Part b and print the hub and authority values after each iteration. Run your HITS implementation for the same 3 graphs that you used in Assignment 2, Problem 2.

Part B. Compare your HITS results to your PageRank results of Assignment 2. Is the convergence criterion met more or less quickly? How do the hub and authority values of nodes compare to their PageRank, and is this more or less in agreement with your intuition? (Remember that the authority value is the primary value used for returning ranked results of a search.)

What to hand in: Hand in your program for Part a and files or printouts containing the output of this program for each of the graphs. Also hand in your answers to the questions in Part b.

Problem 2:
This problem is our class experiment with evaluating search engines. We will compare the two most popular search engines: Google, and Yahoo. (See comScore's January 2008 U.S. Search Engine Rankings.)   This is only meant to be an exercise, so I do not expect we can do a thorough enough job to call the study valid. But it will have the components of a full evaluation and hopefully we will get something interesting.

Part A: Choose an information need. The information need should require gathering information about a subject from several Web sites with good information. An examples of an activity that would provide an appropriate information need is doing a report for a course.   (I suggested in class that you record your recent information searches in anticipation of this exercise.) You should choose an information need that that you think is neither too easy nor too difficult for a search engine. For example, one expects looking for information on the flu to yeild essentially 100% relevant pages - too easy; conversely, looking for information on the history of the LaPaugh family in Europe might (at best) yield one relevant result in 20 - too hard.

Write a description of your information need that can be used to judge whether any given Web search result is relevant or not. (See the examples of TREC description and narrative sections on slides 14 and 15 of the presentation on evaluation of retrieval systems (pdf).) Once you have your information need described, write one query that you will use on both search engines to capture the information need.   The query should have the following properties:

The query should be constructed to avoid anticipated ambiguity. Use several search terms if necessary.
Do not use advanced search operations for either search engine. Exception: you may use quotes around phrases, e.g. "computer architecture".
Make sure you use the same query for both search engines.

Before proceeding to Part B, submit your description of information need and your query to Professor LaPaugh by email for approval. This is primarily to make sure no two people have the same information need or query.

Part B:
Run your query on on each of Google and Yahoo . Consider only the regular search results, not sponsored links. Also, ignore the clustering by site in Google; count each result returned. If you are having trouble with several results in languages other than English, you can go to the advanced search and choose English only, but then do this for both of the search engines. (In my trials, I did not get foreign-language results with a regular search, so this may not be an issue.) Record the first 30 results returned.

Pooling: To get a pool for hand assessment, take the the first 20 results from each search engine. Remove duplicates, and visit each result to decide relevance. Score each result as relevant or irrelevant.
Record the size of the pool (number of unique results produced by the combined results 1 - 20 of each search engine). Also record the number of relevant results in the pool.

Scoring: After constructing the pool, go back and score each of the first 30 results returned by each search engine based on your scoring of the pool. If a result does not appear in the pool, it receives a score of irrelevant. If a document appears twice under different URLs in the list for one search engine, count it only for it's better ranking for that search engine and delete any additional appearances within the same list. In this case there will be less than 30 distinct documents returned by the search engine. Do not go back to the search engine to get more documents. Keep only what was returned in the first 30, with their original ranks. For each search engine, calculate the following measures

Precision at rank 5
Precision at rank 10
Precision at rank 20
The average over the precisions at each point that a relevant docment appears in the first 20 (yes 20, not 30). See slide 11 of the presentation on evaluation of retrieval systems (pdf). (I will be computing the mean average precision over all your query results.)
If all the relevant docments in the pool appear in the first 30 returned results, record the rank at which the last relevant result appears. This is the point of 100% recall of the relevant documents in the pool. If some relevant documents are missing in the first 30 returned results, note this and record the recall of the relevant documents in the pool within these 30 results.

The first 4 measures are ways of capturing the quality of the first 20 results, which is about as far as most people look. The fifth measure gives credit to one search engine for finding relevant documents returned earlier by the other search engine.

What to hand in for Part B: Email to me and Joe the query used, the pool size and number of relevant result in the pool and for each search engine, the 5 scores described above. We will be averaging these results across the class, so please report each number on a separate line, clearly labeled as to what it is.

Part C:
What observations do you make about usability issues (user friendliness) of each search engine - separate from the quality of results you have been assessing in Part B?