COS 435 Spr 09: PS 3

COS 435, Spring 2009 - Problem Set 3

Due at 3:00pm, Tuesday, March 3, 2009

Collaboration and Reference Policy

You may discuss the general methods of solving the problems with other students in the class. However, each student must work out the details and write up his or her own solution to each problem independently.

Some problems have been used in previous offerings of COS 435. You are NOT allowed to use any solutions posted for previous offerings of COS 435 or any solutions produced by anyone else for the assigned problems. You may use other reference materials; you must give citations to all reference materials that you use.

Lateness Policy

A late penalty will be applied, unless there are extraordinary circumstances and/or prior arrangements:

Penalized 10% of the earned score if submitted by noon Wednesday (3/4/09).
Penalized 25% of the earned score if submitted by noon Friday (3/6/09).
Penalized 50% if submitted later than noon Friday (3/6/09).

Problems

Problem 1.

Part a. Implement the PageRank Algorithm in Java (preferred) or C as an iterative algorithm as discussed in class.   There are 3 graphs described in a pdf file on which to test your algorithm. Text input files for these graphs are in files HW3_graph1.txt, HW3_graph2.txt, and HW3_graph3.txt. For each graph, run the algorithm for 50 iterations.

Part b. Examining the data you have obtained in Part a. Has the computation for each graph converged? State what you think the convergence criterion should be in terms of the difference between PageRank values of successive iterations.   This convergence criterion should be for all inputs, not just a particular graph.

Part c. Examine the PageRank results for the three graphs. Do they match your intuition about the importance of various nodes? Explain.   How does the difference in structure between Graph 2 and Graph 3 affect their PageRanks?   Does this make sense?

What to hand in: Hand in your program for Part a and files or printouts containing the output of this program for each of the graphs. Also hand in your answers to the questions in Part b and Part c.

Problem 2.

This problem is our class experiment with evaluating search engines. As voted in class, we will compare the two popular search engines: Google, and MSN-Live Search. (You may be interested in comScore's January 2009 U.S. Search Engine Rankings.)   This is only meant to be an exercise, so I do not expect we can do a thorough enough job to call the study valid. But it will have the components of a full evaluation and hopefully we will get something interesting.

Part A: Choose an information need. The information need should require gathering information about a subject from several Web sites with good information. An example of an activity that would provide an appropriate information need is doing a report for a course.   You should choose an information need that that you think is neither too easy nor too difficult for a search engine. For example, one expects looking for information on the flu to yield essentially 100% relevant pages - too easy; conversely, looking for information on the history of the LaPaugh family in Europe might (at best) yield one relevant result in 20 - too hard.

Write a description of your information need that can be used to judge whether any given Web search result is relevant or not. (See the examples of TREC description and narrative sections in slides 23 and 24 of the class presentation on the evaluation of retrieval systems.) Once you have your information need described, write one query that you will use on both search engines to capture the information need.   The query should have the following properties:

The query should be constructed to avoid anticipated ambiguity. Use several search terms if necessary.
Do not use advanced search operations for either search engine. Exception: you may use quotes around phrases, e.g. "computer architecture".
Make sure you use the same query for both search engines.

Before proceeding to Part B, submit your description of information need and your query to Professor LaPaugh by email for approval. This is primarily to make sure no two people have the same information need or query.

Part B:
Run your query on each of Google and MSN-Live Search.   Consider only the regular search results, not sponsored links. Ignore “image results”, “video results”, and “news results” - these are not counted as part of the first 10 results on the first results page. Also, ignore the clustering by site in Google; count each result returned. If you are having trouble with several results in languages other than English, you can go to the advanced search and choose English only, but then do this for both of the search engines. (In my trials, I did not get foreign-language results with a regular search, so this may not be an issue.)   Record the first 20 results returned.

Pooling: To get a pool for hand assessment, take the first 15 results from each search engine. Remove duplicates, and visit each result to decide relevance. Score each result as relevant or irrelevant according to your description of Part A.   Record the size of the pool (number of unique results produced by the combined results 1 - 15 of each search engine). Also record the number of relevant results in the pool.

Scoring: After constructing the pool, go back and score each of the first 20 results returned by each search engine based on your scoring of the pool. If a result does not appear in the pool, it receives a score of irrelevant.   If a document appears twice under different URLs in the list for one search engine, count it only for its better ranking for that search engine and delete any additional appearances within the same list. In this case there will be less than 20 distinct documents returned by the search engine. Do not go back to the search engine to get more documents. Keep only what was returned in the first 20, with their original ranks. For each search engine, calculate the following measures

Precision at rank 5
Precision at rank 10
Precision at rank 15
If all the relevant documents in the pool appear in the first 20 returned results, record the rank at which the last relevant result appears. This is the point of 100% recall of the relevant documents in the pool. If some relevant documents are missing in the first 20 returned results, note this and record the recall of the relevant documents in the pool within these 20 results.

The first 3 measures are ways of capturing the quality of the highly ranked results. The fourth measure gives credit to one search engine for finding relevant documents returned earlier by the other search engine.

What to hand in for Part B: Email to Professor LaPaugh and Chong Wang the query used, the pool size and number of relevant result in the pool, and for each search engine, the 4 scores described above. These results will be averaged across the class, so please report each number on a separate line, clearly labeled as to what it is.

Part C:
What observations do you make about usability issues (user friendliness) of each search engine - separate from the quality of results you have been assessing in Part B?