COS 435, Spring 2002: Problem Sets

COS 435, Spring 2002 - Problem Set 2

Due 3pm Thursday March 28, 2002.

Collaboration Policy

You may discuss problems with other students in the class. However, each student must write up his or her own solution to each problem independently. That is, while you may formulate the solutions to problems in collaboration with classmates, you must be able to articulate the solutions on your own.

Lateness Policy

A late penalty will be applied, unless there are extraordinary circumstances and/or prior arrangements:

Penalized 25% of the earned score if submitted by 5pm Friday (3/29/02).
No credit if submitted later than 5pm Friday (3/29/02).

Problems

Problem 1 In this problem you will compute hubs and authority weights and pageranks for the following graph:

                                         |------<------^
                                         V             |
           Node_1 --->  Node_2 ---> Node_3 ---> Node_4-|
                             ^                         |
                             |                         |
                             |------------<------------

That is, there are edges from Node 1 to Node 2, from Node 2 to Node 3, from Node 3 to Node 4, from Node 4 to Node 2 and from Node 4 to Node 3.

Both hub and authority weights and pageranks are obtained from iterative formulae. You may write a small program to calculate them, use a math computation package such as Matlab, or do the calculation by hand -- whatever you find easiest. If possible, give the values after each iteration to show how the values converge. If you do the calculations by hand, you need to do enough iterations to see that the values are converging; however, I do not ask that you do more than 5 iterations. If you do the calcuations by computer, say what convergence criteria you are using.

Specifically calculate:

Part a Beginning with each node having a hub weight of 1 and an authority weight of 1 (un-normalized), compute final hub and authority weights for all nodes.
Part b Beginning with each node having a pagerank of 1, compute the pagerank of all nodes when the parameter d in the paper The Anatomy of a Large-Scale Hypertextual Web Search Engine is set to .85 as suggested in the paper. (This is parameter (1-q) in the discussion on page 381 of Modern Information Retrieval).
Part c Repeat Part b, but with d equal to 1.

Problem 2 This problem is our class experiment with evaluating search engines. This is only meant to be an exercise, so I do not expect we can do a thorough enough job to call the study valid. But it will have all the components of a full evaluation and hopefully we will get something interesting.

Each student will run one query for each of three search engines: Google, Ask Jeeves, and MSN Search. I chose these for their popularity and distinct underlying search indexes. Consider only the regular search results, not special listings such as MSN Search's "Top 10 Most Popular Sites for ..." or Ask Jeeves' "premier lists". Also, ignore the clustering by site in Google; count each page returned. (This is different than what I had originally said in class, but since the other search engines don't cluster, we will consider each page in a Google cluster separately.) If you are having trouble with several pages in languages other than English, you can go to the advanced search and choose English only, but then do this for all of the search engines. (In my trials, I did not get foreign-language pages with a regular search, so this may not be an issue.)

For each search engine, you will rate the first 30 documents returned. To get a pool for hand assessment, take the the first 20 documents from each search engine. Collect the 60 documents from the 3 search engines, remove duplicates, and visit each page to decide relevance. Rate each page as one of:

0 == wrong, totally irrelevant material
1 == on topic, relevant
5 == Very good / totally relevant, would want to use.

After constructing the pool, go back and rate each of the first 30 documents returned by each search engine. If a document does not appear in the pool, it receives a rating of 0 (irrelevant). If a document appears twice under different URLs in the list for one search engine, count it only for it's better ranking for that search engine and delete any additional appearances within the same list. In this case there will be less than 30 distinct documents returned by the search engine. Do not go back to the search engine to get more documents. Keep only what was returned in the first 30, in the order they were ranked, and give the last positions, with missing documents, ratings of 0. For each search engine, calculate the discounted cumulative gain (see the paper in your readings Evaluation by highly relevant documents) for each document rank from 1 through 30. You should turn in three length-30 lists of discounted cumulative gains -- one list for each search engine. Please do this by email so I can combine them easily. I will do the averaging across queries for each search engine and compare the results. In your email also include any observations you think interesting or relevant about the search results overall or for a particular engine. Also save all the data you have collected -- just in case.

The query assigned to each participant is:

Jeff: "artificial grass pictures"
Ardan: "two-year colleges in Montreal"
Fengyun: "information discovery on the Web research"
Paul: "convert book to speech"
Bismarck: "curling stone price"
Ben: "community property states"

Note: if you have been assigned a query rather than chosen one, and you really want to do something different, email me with the new query.

The Search Engine Watch has a lot of useful information about search engines, including a list of which search sites use which engines -- go to "Reviews, Ratings & Tests", then to "Search Engine Alliances Chart" (thanks to Ben for pointing out this page).