COS 435, Spring 2012 - Problem Set 1

Due at 1:30pm,  Wednesday,  Feb. 15,  2012.

Collaboration and Reference Policy

You may discuss the general methods of solving the problems with other students in the class. However, each student must work out the details and write up his or her own solution to each problem independently.

Some problems have been used in previous offerings of COS 435. You are NOT allowed to use any solutions posted for previous offerings of COS 435 or any solutions produced by anyone else for the assigned problems.   You may use other reference materials; you must give citations to all reference materials that you use.


Lateness Policy

A late penalty will be applied, unless there are extraordinary circumstances and/or prior arrangements:

Problems


Problem 1
In class we discussed ways one might add ranking to the Boolean model of queries.  In this problem, we consider in more detail combining Boolean queries with a scoring function for ranking.  In the combined model, a document will still only satisfy a query if it evaluates to "true".    Consider only Boolean queries in which NOT is used only to negate terms, not longer expressions.  That is,    (NOT("cat") OR NOT("dog")) is fine, but  NOT("cat" AND "dog")is not.   (Aside beyond scope of this course:  DeMorgan's Law tells us we lose no expressive power by doing this.)   

 
Consider the following method for assigning scores to the satisfying documents for our restricted Boolean query. 


For example, to score a document containing  terms "cat" and "dog" but not "horse" for the query (("cat" AND NOT("horse")) or ("dog" AND "horse")), each occurrence of "cat" and"dog" in the query gets 1 point, the occurrence of "horse" gets 0 points, and the occurrence of NOT("horse") gets 1 point.  Then ("cat" AND NOT("horse")) gets the smaller of 1 and 1, so 1;  ("dog" AND "horse")  gets the smaller of 1 and 0, so 0.  Finally (("cat" AND NOT("horse")) OR ("dog" AND "horse")) gets the average of scores 1 and 0, so 1/2.  Therefore, the document receives a score of 1/2.

 
For each of the Boolean queries below, describe the result of applying this scoring method to a collection of documents. That is, describe, depending on how a document satisfies the query, what its score will be.  For each Boolean query, describe whether the score makes sense in terms of how relevant documents are to the query,  and what, if any,  problems you find with the scoring method.


  1. Query (("cat" AND "dog") OR (NOT("cat") AND NOT("dog")))  
  2. Query (("cat" AND "dog") OR ("dog" AND "horse"))
  3. Query (("dog" OR NOT("cat")) AND ("bird" OR "fish"))



 

Problem 2
Consider a collection of 25,000 documents.  Each document is assigned a non-empty ordered list of at least 1 and at most 10 keywords by the authors of the document.  The keywords come from a fixed list of 100 keywords given to the authors by the publisher. The order of the keywords in the list of keywords for a document reflects the order of importance of the keywords in the document: the first keyword is the most important keyword, although it could be tied in importance with the second keyword, etc.  (Imagine the ACM giving authors of all technical papers a list of 100 keywords related to computer science topics and asking each author to choose and order by importance at least 1 and at most 10 keywords that best describe the topic of the paper.)

 

A query on the collection of documents is an arbitrarily long list of keywords from among the 100 keywords.  Any scoring of documents with respect to a query should have the following properties:

 

Note that the list of keywords for a query is not ordered by importance.


Part a 

Propose a 100-dimensional vector model for documents and queries that tries to capture the scoring properties given above.  What are the components wi,d of a document vector d?  Of a query vector? What is the similarity (or distance) function between document vectors and query vectors for your model?

 

Part b

Show the vectors for the following 3 documents and 2 queries and the score for each document with respect to each query:

 

document A has keyword list “agents, interfaces”

document B has keyword list “network, protocol, interfaces, instrumentation”

document C has keyword list “interfaces, proxy, agents”

 

query 1:  interfaces

query 2: interfaces, agents

 

Part c

How well does your vector model capture the scoring properties given at the beginning of this problem description?  Be specific.


Part d
Does your vector model scoring distinguish between different documents that contain the same number of the query keywords, but different query keywords, for a multi-keyword query?  Explain.


Part e
Does your vector model scale to full-text documents and a large lexicon of terms?  Explain.   Do you think it is practical?