COS 435, Spring 2002 - Problem Set 1

Due at 11am, Monday February 25, 2002.

Collaboration Policy

You may discuss problems with other students in the class. However, each student must write up his or her own solution to each problem independently. That is, while you may formulate the solutions to problems in collaboration with classmates, you must be able to articulate the solutions on your own.


Lateness Policy

A late penalty will be applied, unless there are extraordinary circumstances and/or prior arrangements:

Problems

Problem 1 Suppose you have a set of t index terms and a set of documents represented as t-dimensional 0/1 vectors over those index terms. In this problem we consider only queries that contain exactly one query term, and so exactly one "1" in their vector representations.

Problem 2 In "Indexing by latent semantic analysis." by Deerwester, S. et. al. is an example of the latent semantic indexing calculation for a matrix of 12 terms by 9 documents. The original matrix is in Table 2 on page 10. It is presented as two groups of documents (c1-c5 and m1-4). In the Appendix is given the final K, S, D decomposition for rank 2. (I am using the notation we used in class. Note that this paper uses notation "T" rather than "K" for the left singular vector maxtrix.)

Problem 3 What is the computational cost (running-time) of doing a comparison of a query to all documents after latent semantic indexing has been used to express the term-document matrix M in terms of matricies K, S, and D? Do not include the preprocessiong cost to find matrices K, S, and D. You should list each step of the computation to compare a query expressed as a vector of weights for t index terms to all N documents. Your analysis should be in tersm of t, N, and the reduced rank s after latent semantic indexing as been applied.

Problem 4 Consider a hybrid query denoted (t1, t2, ... tk, MUST(s1, s2, ... sm)), where t1, ... tk and s1, ... sm are index terms. The semantics of this query is that only documents containing all terms s1, s2, .. sm are considered matches to the query. These matching documents are ranked by considering the index terms t1, ..., tk and using a tf.idf ranking. Note that if one wishes to use a term both for filtering and for ranking, the term must be listed among the si and the tj, respectively.