COS 435, Spring 2009 - Problem Set 1

Due at 3:00pm,  Thursday,  Feb. 12,  2009.

Collaboration and Reference Policy

You may discuss the general methods of solving the problems with other students in the class. However, each student must work out the details and write up his or her own solution to each problem independently.

Some problems have been used in previous offerings of COS 435. You are NOT allowed to use any solutions posted for previous offerings of COS 435 or any solutions produced by anyone else for the assigned problems.   You may use other reference materials; you must give citations to all reference materials that you use.


Lateness Policy

A late penalty will be applied, unless there are extraordinary circumstances and/or prior arrangements:

Problems

Problem 1 
Consider combining the vector model with the "set of terms" model for documents and queries.  In this case,  for a dictionary of t index terms, each document is a t-dimensional vector whose jth component is a 1 if the document contains one or more instance of the jth term and is a 0 otherwise.  Query vectors are defined analogously.  


Problem 2

An Introduction to Information Retrieval Exercise 6.18 at the end of Section 6.4:  (Paraphrasing) Consider a query q and documents d1, d2,..., in the (general) vector model.  Show that if q and the di are all normalized to unit vectors, then the rank order produce by ranking the documents di in order of increasing Euclidean distance from q is identical to the rank order produced by ranking the documents di in order of decreasing dot product with q.


Problem 3
In class we discussed ways one might add ranking to the Boolean model of queries.  In this problem, consider more specifically combining Boolean queries with a scoring function for ranking.  In the combined model, a document will still only satisfy a query if it evaluates to "true".   Propose a specific method to rank the documents that satisfy a query.  You may use a vector scoring technique or specify another technique for scoring.  Your method should apply to any Boolean query, i.e. any Boolean combination of  terms, and must use the "bag of terms" model for documents.  If you do not use a vector scoring technique, specify your technique as an algorithm in enough detail that someone else can execute it, but do not give implementation details.   If you use a vector scoring technique, specify how your document vectors are defined, how query vectors are obtained from the Boolean queries,  and what vector-based metric is used.   Whatever type of technique you use, do a small example of your choosing to illustrate your technique.

Problem 4
To use the vector model, is not necessary to use the "set of terms" or "bag of terms" model for documents.   The values of components of the document vector can reflect other aspects of document contents besides term inclusion and term frequency - as long as query vectors and a vector-based scoring function can be specified.   Propose a formula for term weights for document vectors that uses term frequency but also accounts for the fact that words appearing early in a document tend to have higher importance in the document than terms appearing later.  Make "appearing early" concrete.