You may discuss the general methods of solving the problems with other
students in the class. However,
each student must work out the details and write up his or her own
solution to each problem
independently.
Problem 1
Consider combining the vector model with the "set of terms" model
for documents and queries. In this case, for a dictionary
of t index terms, each document is a t-dimensional
vector whose jth component is a 1 if the document contains
one or more instance of the jth term and is a 0
otherwise. Query vectors are defined analogously.
Problem 2
Problem 3
In class we discussed ways one might add ranking to the Boolean model
of queries. In this problem, consider more specifically combining
Boolean queries with the vector model for ranking. In the
combined model, a document will still only satisfy a query if it
evaluates to "true". How might you rank the documents that
satisfy the query using a vector scoring technique? What
query vector would you use for scoring, and what metric would you
use? Do a small example of your choosing to illustrate your
technique.
Problem 4
What is the computational cost (running-time) of
doing a comparison of a query to all documents after latent semantic
indexing has been used to express the term-document matrix C in terms
of matricies U'k, Σ'k,
and V'k for the
rank-k approximation (using our class notation)?
Do not include the preprocessing cost to find matrices U'k,
Σ'k, and V'k.
Unit-cost computations are scalar addition, multiplication, comparison
and other basic program steps, not vector
operations. You should list each step of the computation to
compare a query expressed as a vector of weights for t index
terms to all N documents. Your analysis should be in terms of t,
N, and the reduced rank k after latent semantic indexing
as
been applied.