COS 435, Spring 2008 - Problem Set 1

Due at 1:30pm,  Wednesday,  Feb. 20,  2008.

Collaboration Policy

You may discuss the general methods of solving the problems with other students in the class. However, each student must work out the details and write up his or her own solution to each problem independently. 


Lateness Policy

A late penalty will be applied, unless there are extraordinary circumstances and/or prior arrangements:

Problems

Problem 1 
Consider combining the vector model with the "set of terms" model for documents and queries.  In this case,  for a dictionary of t index terms, each document is a t-dimensional vector whose jth component is a 1 if the document contains one or more instance of the jth term and is a 0 otherwise.  Query vectors are defined analogously.  


Problem 2


Problem 3
In class we discussed ways one might add ranking to the Boolean model of queries.  In this problem, consider more specifically combining Boolean queries with the vector model for ranking.  In the combined model, a document will still only satisfy a query if it evaluates to "true".   How might you rank the documents that satisfy the query using a vector scoring technique?  What  query vector would you use for scoring, and what metric would you use?  Do a small example of your choosing to illustrate your technique.


Problem 4
What is the computational cost (running-time) of doing a comparison of a query to all documents after latent semantic indexing has been used to express the term-document matrix C in terms of matricies U'k, Σ'k, and V'k for the rank-k approximation (using our class notation)? Do not include the preprocessing cost to find matrices U'k, Σ'k, and V'k.   Unit-cost computations are scalar addition, multiplication, comparison and other basic program steps, not vector operations.  You should list each step of the computation to compare a query expressed as a vector of weights for t index terms to all N documents. Your analysis should be in terms of t, N, and the reduced rank k after latent semantic indexing as been applied.