COS 435, Spring 2011: Problem Set 1

COS 435, Spring 2011 - Problem Set 1

Due at 1:30pm, Wednesday, Feb. 16, 2011.

Collaboration and Reference Policy

You may discuss the general methods of solving the problems with other students in the class. However, each student must work out the details and write up his or her own solution to each problem independently.

Some problems have been used in previous offerings of COS 435. You are NOT allowed to use any solutions posted for previous offerings of COS 435 or any solutions produced by anyone else for the assigned problems. You may use other reference materials; you must give citations to all reference materials that you use.

Lateness Policy

A late penalty will be applied, unless there are extraordinary circumstances and/or prior arrangements:

No penalty if in Prof. LaPaugh's office by 5pm Wednesday (2/16/11).
Penalized 10% of the earned score if submitted by 11:59 pm Wed (2/16/11).
Penalized 25% of the earned score if submitted by 5pm Friday (2/18/11).
Penalized 50% if submitted later than 5pm Friday (2/18/11).

Problems

Problem 1
In the vector model, the document vectors are determined at the time the documents are initially processed, just as the inverted index is built at the time the documents are processed. In fact, the set of vectors for the documents can be concisely stored in an inverted index.

Part a.
What does the inverted index look like if the only information stored is that found in the document vectors? Be precise.

Part b.
We discussed many factors that could be considered in ranking documents with respect to a query. Name two or more properties of document contents that could be used with our "enhanced document model" (slides 27 and 28 of classic information retrieval), but not in the vector model. Explain why. (Clarification: "containing the query term" and "query term appears in title" are examples of properties of document contents. Both of these can be used in the vector model.)

Problem 2
In class we discussed ways one might add ranking to the Boolean model of queries. In this problem, we consider in more detail combining Boolean queries with a scoring function for ranking. In the combined model, a document will still only satisfy a query if it evaluates to "true". Consider only Boolean queries in which NOT is used only to negate terms, not longer expressions. That is, (NOT("cat") OR NOT("dog")) is fine, but NOT("cat" AND "dog") is not. (Aside beyond scope of this course: DeMorgan's Law tells us we lose no expressive power by doing this.)

Part a:
One way to assign scores to the satisfying documents for a Boolean query that meets our condition on the use of NOT is to add up the frequencies in the documents of all terms that appear in the Boolean query in un-negated form, even for those terms that also appear in negated form. What is the result of applying this scoring method to a collection of documents for the query (("cat" AND "dog") OR (NOT("cat") AND NOT("dog")))? (Terms "cat" and "dog" appear in both negated and un-negated form.) What problem or problems do you find with the result?

Part b:
Propose a specific method of scoring documents that tries to deal with the problem(s) you found in Part a. Your method should apply to any Boolean query, i.e. any Boolean combination of terms, that meets our condition on the use of NOT. Assume the "bag of words" model for documents, but you need not actually make use of the term frequencies. Specify your scoring technique as a formula or as an algorithm with enough detail that someone else can execute it (do not give implementation details). Give intuitive justification for your method. Do a small example of your choosing to illustrate your technique. Note that the intent is not for you to spend hours working up a complicated algorithm with a complicated justification, but rather to propose an idea and explain it in enough detail that we understand what you are proposing.

Problem 3
Consider a collection of 25,000 documents. Each document is assigned a non-empty ordered list of at least 1 and at most 10 keywords by the authors of the document. The keywords come from a fixed list of 100 keywords given to the authors by the publisher. The order of the keywords in the list of keywords for a document reflects the order of importance of the keywords in the document: the first keyword is the most important keyword, although it could be tied in importance with the second keyword, etc. (Imagine the ACM giving authors of all technical papers a list of 100 keywords related to computer science topics and asking each author to choose and order by importance at least 1 and at most 10 keywords that best describe the topic of the paper.)

A query on the collection of documents is an arbitrarily long list of keywords from among the 100 keywords. Any scoring of documents with respect to a query should have the following properties:

For a singe keyword query (e.g. “security”), if the keyword appears at position j in the list of keywords for document A and appears at position k > j in the list of keywords for document B, then A scores as a better match than B. (For example, if A has keyword list “fraud, security, privacy” and B has keyword list “privacy, copyright, security, encryption”, then A scores higher than B as a match for the query “security” because “security” is in position 2 in A’s list and in position 3 in B’s list.) Note that the order from the beginning of the list of keywords for a document is used; the size of the list of keywords for a document is not considered in this scoring.

For a query containing more than one keyword, if the list of keywords for document A contains more of the query keywords than the list of keywords for document B, then A scores as a better match than B.

Note that the list of keywords for a query is not ordered by importance.

Part a

Propose a 100-dimensional vector model for documents and queries that tries to capture the scoring properties given above. What are the components w_i,dof a document vector d? Of a query vector? What is the similarity (or distance) function between document vectors and query vectors for your model?

Part b

Show the vectors for the following 3 documents and 2 queries and the score for each document with respect to each query:

document A has keyword list “agents, interfaces”

document B has keyword list “network, protocol, interfaces, instrumentation”

document C has keyword list “interfaces, proxy, agents”

query 1: interfaces

query 2: interfaces, agents

Part c

How well does your vector model capture the scoring properties given at the beginning of this problem description? Be specific.

Part d
Does your vector model scoring distinguish between different documents that contain the same number of the query keywords, but different query keywords, for a multi-keyword query? Explain.

Part e
Does your vector model scale to full-text documents and a large lexicon of terms? Explain. Do you think it is practical?