COS 435, Spring 2002 - Problem Set 2

Due 3pm Thursday March 28, 2002.

Collaboration Policy

You may discuss problems with other students in the class. However, each student must write up his or her own solution to each problem independently. That is, while you may formulate the solutions to problems in collaboration with classmates, you must be able to articulate the solutions on your own.


Lateness Policy

A late penalty will be applied, unless there are extraordinary circumstances and/or prior arrangements:

Problems

Problem 1 In this problem you will compute hubs and authority weights and pageranks for the following graph:

                                         |------<------^
                                         V             |
           Node_1 --->  Node_2 ---> Node_3 ---> Node_4-|
                             ^                         |
                             |                         |
                             |------------<------------

That is, there are edges from Node 1 to Node 2, from Node 2 to Node 3, from Node 3 to Node 4, from Node 4 to Node 2 and from Node 4 to Node 3.

Both hub and authority weights and pageranks are obtained from iterative formulae. You may write a small program to calculate them, use a math computation package such as Matlab, or do the calculation by hand -- whatever you find easiest. If possible, give the values after each iteration to show how the values converge. If you do the calculations by hand, you need to do enough iterations to see that the values are converging; however, I do not ask that you do more than 5 iterations. If you do the calcuations by computer, say what convergence criteria you are using.

Specifically calculate:

Problem 2 This problem is our class experiment with evaluating search engines. This is only meant to be an exercise, so I do not expect we can do a thorough enough job to call the study valid. But it will have all the components of a full evaluation and hopefully we will get something interesting.

Each student will run one query for each of three search engines: Google, Ask Jeeves, and MSN Search. I chose these for their popularity and distinct underlying search indexes. Consider only the regular search results, not special listings such as MSN Search's "Top 10 Most Popular Sites for ..." or Ask Jeeves' "premier lists". Also, ignore the clustering by site in Google; count each page returned. (This is different than what I had originally said in class, but since the other search engines don't cluster, we will consider each page in a Google cluster separately.) If you are having trouble with several pages in languages other than English, you can go to the advanced search and choose English only, but then do this for all of the search engines. (In my trials, I did not get foreign-language pages with a regular search, so this may not be an issue.)

For each search engine, you will rate the first 30 documents returned. To get a pool for hand assessment, take the the first 20 documents from each search engine. Collect the 60 documents from the 3 search engines, remove duplicates, and visit each page to decide relevance. Rate each page as one of:

After constructing the pool, go back and rate each of the first 30 documents returned by each search engine. If a document does not appear in the pool, it receives a rating of 0 (irrelevant). If a document appears twice under different URLs in the list for one search engine, count it only for it's better ranking for that search engine and delete any additional appearances within the same list. In this case there will be less than 30 distinct documents returned by the search engine. Do not go back to the search engine to get more documents. Keep only what was returned in the first 30, in the order they were ranked, and give the last positions, with missing documents, ratings of 0. For each search engine, calculate the discounted cumulative gain (see the paper in your readings Evaluation by highly relevant documents) for each document rank from 1 through 30. You should turn in three length-30 lists of discounted cumulative gains -- one list for each search engine. Please do this by email so I can combine them easily. I will do the averaging across queries for each search engine and compare the results. In your email also include any observations you think interesting or relevant about the search results overall or for a particular engine. Also save all the data you have collected -- just in case.

The query assigned to each participant is:

Note: if you have been assigned a query rather than chosen one, and you really want to do something different, email me with the new query.

The Search Engine Watch has a lot of useful information about search engines, including a list of which search sites use which engines -- go to "Reviews, Ratings & Tests", then to "Search Engine Alliances Chart" (thanks to Ben for pointing out this page).