COS 435, Spring 2006: Problem Sets

COS 435, Spring 2006 - Problem Set 2

Due at 11am, Thursday, March 9, 2006.

Collaboration Policy

You may discuss problems with other students in the class. However, each student must write up his or her own solution to each problem independently. That is, while you may formulate the solutions to problems in collaboration with classmates, you must be able to articulate the solutions on your own.

Lateness Policy

A late penalty will be applied, unless there are extraordinary circumstances and/or prior arrangements:

Penalized 10% of the earned score if submitted by 11am Friday (3/10/06).
Penalized 25% of the earned score if submitted by 11am Monday(3/13/06).
Penalized 50% if submitted later than 11am Monday(3/13/06).

Problems

Problem 1

We have discussed in class that one of the benefits of latent semantic indexing is that it reduces the dimension of the document vectors. A much simpler way to reduce the dimension is to reduce the number of terms. Consider the following algorithm for choosing only K terms from among the full set of index terms as representative terms for a collection of documents:

Compute the document frequency of each index term. (There is no need to use the log function; just compute the basic document frequency.)
Sort the index terms by document frequency and find the median.
Choose K index terms that have document frequency values equal to the median, breaking ties arbitrarily. If there are not K terms that have the median value of document frequency, choose the first K terms in increasing order by document frequency that have document frequencies equal to or larger than the median.

Part a. There is an example of the latent semantic indexing calculation for a matrix of 12 terms by 9 documents in the paper "Indexing by latent semantic analysis." by Deerwester, S. et. al. whose precision-recall results we looked at in class on 2/28/06. The original matrix is in Table 2 on page 10. It is presented as two groups of documents (c1-c5 and m1-4). Execute the new algorithm described above for K=2 on the same example. (You may do this by hand or write a short program in the language of your choice. An ascii file of the 0/1 entries of the 12 rows × 9 columns matrix can be found here.) What are the two terms used?

Part b. Is the new algorithms successful on the example of Part a? What are you using as criteria for success? How does the result compare
to the rank 2 latent semantic indexing representation presented in the paper? (You will find the actual final decomposition for rank 2 LSI in the paper's Appendix and find plots of terms and documents in the 2 dimensions within the paper. Note that the paper uses different notation than we used in class: they use X= TSD^_T to denote the SVD and we used D=KSP, "D" playing a different role in our notation. )

Part c. What characteristics should a collection of document have to have the best results from using only a subset of the index terms as described above? Are certain values of K better than others?

Problem 2 What is the computational cost (running-time) of doing a comparison of a query to all documents after latent semantic indexing has been used to express the term-document matrix D in terms of matricies K_s, S_s, and P_s for the rank-s approximation (using our class notation)? Do not include the preprocessing cost to find matrices K_s, S_s, and D_s. You should list each step of the computation to compare a query expressed as a vector of weights for t index terms to all N documents. Your analysis should be in terms of t, N, and the reduced rank s after latent semantic indexing as been applied.

Problem 3 Consider a hybrid query denoted (t1, t2, ... tk, MUST(s1, s2, ... sm)), where t1, ... tk and s1, ... sm are index terms. The semantics of this query is that only documents containing all terms s1, s2, .. sm are considered matches to the query. These matching documents are ranked by considering the index terms t1, ..., tk and using a tf.idf ranking. Note that if one wishes to use a term both for filtering and for ranking, the term must be listed among the si and the tj, respectively.

Part a Show how to represent such a query on a set of documents using a Bayesian inference network. Give both the network structure and the link matrix at each node used to calculate the probability of that node given the probability of the node's parents. You need not specify the link matrix by specifying each entry if you can specify the calculation as an equation. (HINT: Combine separate sub-networks for the tf.idf ranking and the MUST portion.)
Part b Consider the example of Problem 1 (originally from the paper by Deerwester et. al.). Suppose the query is (system, interface, MUST(EPS)). Show how to score documents c2, c3 and c4.