COS 435, Spring 2002: Problem Sets

COS 435, Spring 2002 - Problem Set 1

Due at 11am, Monday February 25, 2002.

Collaboration Policy

You may discuss problems with other students in the class. However, each student must write up his or her own solution to each problem independently. That is, while you may formulate the solutions to problems in collaboration with classmates, you must be able to articulate the solutions on your own.

Lateness Policy

A late penalty will be applied, unless there are extraordinary circumstances and/or prior arrangements:

Penalized 10% of the earned score if submitted after class but by 5pm Monday.
Penalized 30% of the earned score if submitted by 11am Wednesday (2/28/02).
No credit if submitted later than 11am Wednesday (2/28/02).

Problems

Problem 1 Suppose you have a set of t index terms and a set of documents represented as t-dimensional 0/1 vectors over those index terms. In this problem we consider only queries that contain exactly one query term, and so exactly one "1" in their vector representations.

Part a: Express each of the distance dissimilarity metric and the cosine similarity metric as a function of the number of terms ("1"s) in a document. Treat the case that a document contains the query term and the case that a document does not contain the query term separately.
Part b: For this restricted situation of one index term per query, do the cosine metric and the distance metric produce the same ordering of documents by similarity to the query? If yes, why. If not, describe the difference between the similarity rankings.

Problem 2 In "Indexing by latent semantic analysis." by Deerwester, S. et. al. is an example of the latent semantic indexing calculation for a matrix of 12 terms by 9 documents. The original matrix is in Table 2 on page 10. It is presented as two groups of documents (c1-c5 and m1-4). In the Appendix is given the final K, S, D decomposition for rank 2. (I am using the notation we used in class. Note that this paper uses notation "T" rather than "K" for the left singular vector maxtrix.)

Part a Calculate the 9 by 2 matrix DS for this example. Are the two original groups of documents evident from the values in DS? Justify your answer.
Part b For the query "trees, graphs", what is the modified query vector for this example?

Problem 3 What is the computational cost (running-time) of doing a comparison of a query to all documents after latent semantic indexing has been used to express the term-document matrix M in terms of matricies K, S, and D? Do not include the preprocessiong cost to find matrices K, S, and D. You should list each step of the computation to compare a query expressed as a vector of weights for t index terms to all N documents. Your analysis should be in tersm of t, N, and the reduced rank s after latent semantic indexing as been applied.

Problem 4 Consider a hybrid query denoted (t1, t2, ... tk, MUST(s1, s2, ... sm)), where t1, ... tk and s1, ... sm are index terms. The semantics of this query is that only documents containing all terms s1, s2, .. sm are considered matches to the query. These matching documents are ranked by considering the index terms t1, ..., tk and using a tf.idf ranking. Note that if one wishes to use a term both for filtering and for ranking, the term must be listed among the si and the tj, respectively.

Part a Show how to represent such a query on a set of documents using a Bayesian inference network. Give both the network structure and the mapping at each node used to calculate the probability of that node given the probability of the node's parents.
Part b Consider the example of Problem 2 (originally from the paper by Deerwester et. al.). Suppose the query is (system, interface, MUST(EPS)). Show how to score documents c2, c3 and c4.