COS 435, Spring 2009: Problem Set 1

COS 435, Spring 2009 - Problem Set 1

Due at 3:00pm, Thursday, Feb. 12, 2009.

Collaboration and Reference Policy

You may discuss the general methods of solving the problems with other students in the class. However, each student must work out the details and write up his or her own solution to each problem independently.

Some problems have been used in previous offerings of COS 435. You are NOT allowed to use any solutions posted for previous offerings of COS 435 or any solutions produced by anyone else for the assigned problems. You may use other reference materials; you must give citations to all reference materials that you use.

Lateness Policy

A late penalty will be applied, unless there are extraordinary circumstances and/or prior arrangements:

Penalized 10% of the earned score if submitted by noon Friday (2/13/09).
Penalized 25% of the earned score if submitted by noon Monday(2/16/09).
Penalized 50% if submitted later than noon Monday (2/16/09).

Problems

Problem 1
Consider combining the vector model with the "set of terms" model for documents and queries. In this case, for a dictionary of t index terms, each document is a t-dimensional vector whose j^th component is a 1 if the document contains one or more instance of the j^thterm and is a 0 otherwise. Query vectors are defined analogously.

Part a: Consider the distance dissimilarity metric and dot product similarity metric without normalization by the length of the vectors: Dist(d, q) = √(Σ_i (d_i – q_i)² ) and d•q = Σ_i (d_i* q_i). Express each metric as a function of the number of distinct terms ("1"s) in document d, the number of distinct terms in query q, and the number of terms shared by d and q.

Part b: Under this 0/1 vector model, do the dot product metric and the distance metric produce the same ordering of documents by similarity to a query? (increasing order of distance from query and decreasing order of dot product with query ) If yes, why? If not, give an example and describe the difference between the similarity rankings.

Problem 2
An Introduction to Information Retrieval Exercise 6.18 at the end of Section 6.4: (Paraphrasing) Consider a query q and documents d₁, d₂,..., in the (general) vector model. Show that if q and the d_i are all normalized to unit vectors, then the rank order produce by ranking the documents d_iin order of increasing Euclidean distance from q is identical to the rank order produced by ranking the documents d_iin order of decreasing dot product with q.

Problem 3
In class we discussed ways one might add ranking to the Boolean model of queries. In this problem, consider more specifically combining Boolean queries with a scoring function for ranking. In the combined model, a document will still only satisfy a query if it evaluates to "true". Propose a specific method to rank the documents that satisfy a query. You may use a vector scoring technique or specify another technique for scoring. Your method should apply to any Boolean query, i.e. any Boolean combination of terms, and must use the "bag of terms" model for documents. If you do not use a vector scoring technique, specify your technique as an algorithm in enough detail that someone else can execute it, but do not give implementation details. If you use a vector scoring technique, specify how your document vectors are defined, how query vectors are obtained from the Boolean queries, and what vector-based metric is used. Whatever type of technique you use, do a small example of your choosing to illustrate your technique.

Problem 4
To use the vector model, is not necessary to use the "set of terms" or "bag of terms" model for documents. The values of components of the document vector can reflect other aspects of document contents besides term inclusion and term frequency - as long as query vectors and a vector-based scoring function can be specified. Propose a formula for term weights for document vectors that uses term frequency but also accounts for the fact that words appearing early in a document tend to have higher importance in the document than terms appearing later. Make "appearing early" concrete.