## COS 435, Spring 2012 - Problem Set 1

Due at 1:30pm,  Wednesday,  Feb. 15,  2012.

## Collaboration and Reference Policy

You may discuss the general methods of solving the problems with other students in the class. However, each student must work out the details and write up his or her own solution to each problem independently.

Some problems have been used in previous offerings of COS 435. You are NOT allowed to use any solutions posted for previous offerings of COS 435 or any solutions produced by anyone else for the assigned problems.   You may use other reference materials; you must give citations to all reference materials that you use.

## Lateness Policy

A late penalty will be applied, unless there are extraordinary circumstances and/or prior arrangements:
• No penalty if in Prof. LaPaugh's office by 5pm Wednesday (2/15/12).
• Penalized 10% of the earned score if submitted by 11:59 pm Wed (2/15/12).
• Penalized 25% of the earned score if submitted by 5pm Friday  (2/17/12).
• Penalized 50% if submitted later than 5pm Friday  (2/17/12).

## Problems

Problem 1
In class we discussed ways one might add ranking to the Boolean model of queries.  In this problem, we consider in more detail combining Boolean queries with a scoring function for ranking.  In the combined model, a document will still only satisfy a query if it evaluates to "true".    Consider only Boolean queries in which NOT is used only to negate terms, not longer expressions.  That is,    (NOT("cat") OR NOT("dog")) is fine, but  NOT("cat" AND "dog")is not.   (Aside beyond scope of this course:  DeMorgan's Law tells us we lose no expressive power by doing this.)

Consider the following method for assigning scores to the satisfying documents for our restricted Boolean query.

• For every occurence of a un-negated term in the query, we give that occurrence 1 point if the document contains the term and 0 points otherwise;  For every occurrence of a negated term in the query, we give that occurrence 1 point if the document does not contain the term and 0 otherwise;
• For an AND form (A AND B), the whole form gets a number of points equal to the smaller of the points for A and the points for B
• For an OR form (A OR B), the whole form gets the average of the number of points for A and the number of points for B
• The score for the document is the number of points for the entire query

For example, to score a document containing  terms "cat" and "dog" but not "horse" for the query (("cat" AND NOT("horse")) or ("dog" AND "horse")), each occurrence of "cat" and"dog" in the query gets 1 point, the occurrence of "horse" gets 0 points, and the occurrence of NOT("horse") gets 1 point.  Then ("cat" AND NOT("horse")) gets the smaller of 1 and 1, so 1;  ("dog" AND "horse")  gets the smaller of 1 and 0, so 0.  Finally (("cat" AND NOT("horse")) OR ("dog" AND "horse")) gets the average of scores 1 and 0, so 1/2.  Therefore, the document receives a score of 1/2.

For each of the Boolean queries below, describe the result of applying this scoring method to a collection of documents. That is, describe, depending on how a document satisfies the query, what its score will be.  For each Boolean query, describe whether the score makes sense in terms of how relevant documents are to the query,  and what, if any,  problems you find with the scoring method.

1. Query (("cat" AND "dog") OR (NOT("cat") AND NOT("dog")))
2. Query (("cat" AND "dog") OR ("dog" AND "horse"))
3. Query (("dog" OR NOT("cat")) AND ("bird" OR "fish"))

Problem 2
Consider a collection of 25,000 documents.  Each document is assigned a non-empty ordered list of at least 1 and at most 10 keywords by the authors of the document.  The keywords come from a fixed list of 100 keywords given to the authors by the publisher. The order of the keywords in the list of keywords for a document reflects the order of importance of the keywords in the document: the first keyword is the most important keyword, although it could be tied in importance with the second keyword, etc.  (Imagine the ACM giving authors of all technical papers a list of 100 keywords related to computer science topics and asking each author to choose and order by importance at least 1 and at most 10 keywords that best describe the topic of the paper.)

A query on the collection of documents is an arbitrarily long list of keywords from among the 100 keywords.  Any scoring of documents with respect to a query should have the following properties:

• For a singe keyword query (e.g. “security”), if the keyword appears at position j in the list of keywords for document A and appears at position k > j in the list of keywords for document B, then A scores as a better match than B.  (For example, if  A has keyword list “fraud, security, privacy” and B has keyword list “privacy, copyright, security, encryption”,  then A scores higher than B as a match for the query “security” because “security” is in position 2 in A’s list and in position 3 in B’s list.)  Note that the order from the beginning of the list of keywords for a document is used;  the size of the list of keywords for a document is not considered in this scoring.
• For a query containing more than one keyword, if the list of keywords for document A contains more of the query keywords than the list of keywords for document B, then A scores as a better match than B.

Note that the list of keywords for a query is not ordered by importance.

Part a

Propose a 100-dimensional vector model for documents and queries that tries to capture the scoring properties given above.  What are the components wi,d of a document vector d?  Of a query vector? What is the similarity (or distance) function between document vectors and query vectors for your model?

Part b

Show the vectors for the following 3 documents and 2 queries and the score for each document with respect to each query:

document A has keyword list “agents, interfaces”

document B has keyword list “network, protocol, interfaces, instrumentation”

document C has keyword list “interfaces, proxy, agents”

query 1:  interfaces

query 2: interfaces, agents

Part c

How well does your vector model capture the scoring properties given at the beginning of this problem description?  Be specific.

Part d
Does your vector model scoring distinguish between different documents that contain the same number of the query keywords, but different query keywords, for a multi-keyword query?  Explain.

Part e
Does your vector model scale to full-text documents and a large lexicon of terms?  Explain.   Do you think it is practical?