You may discuss the general methods of solving the problems with
other students in the class. However, each student must work out the
details and write up his or her own solution to each problem
independently.
Some problems have been used in previous offerings of COS 435. You
are NOT allowed to use any solutions posted for previous offerings of
COS 435 or any solutions produced by anyone else for the assigned
problems. You may use other reference materials; you must
give citations to all reference materials that you use.
Problem 1
In
class we discussed ways one might add ranking to the Boolean model of
queries. In this problem, we consider in more detail combining
Boolean queries with a scoring function for ranking. In the
combined model, a document will still only satisfy a query if it
evaluates to "true". Consider only Boolean queries in
which NOT is used only to
negate terms, not longer expressions. That is, (NOT("cat") OR
NOT("dog")) is fine, but NOT("cat" AND "dog")is not.
(Aside beyond scope of this course: DeMorgan's Law tells us we
lose no expressive power by doing this.)
Consider the
following method for assigning scores to the satisfying
documents for our restricted Boolean query.
For example, to score a document containing terms "cat" and "dog" but not "horse" for the query (("cat" AND NOT("horse")) or ("dog" AND "horse")), each occurrence of "cat" and"dog" in the query gets 1 point, the occurrence of "horse" gets 0 points, and the occurrence of NOT("horse") gets 1 point. Then ("cat" AND NOT("horse")) gets the smaller of 1 and 1, so 1; ("dog" AND "horse") gets the smaller of 1 and 0, so 0. Finally (("cat" AND NOT("horse")) OR ("dog" AND "horse")) gets the average of scores 1 and 0, so 1/2. Therefore, the document receives a score of 1/2.
For each of the
Boolean queries below, describe the result of applying this scoring
method to a collection of documents. That is, describe,
depending on how a document satisfies the query, what its score will
be. For each Boolean
query, describe whether the score makes sense in terms of how relevant
documents are to the query, and what, if any, problems you
find with the scoring method.
Problem 2
Consider a collection of 25,000 documents. Each
document
is assigned a non-empty ordered list of at least 1 and
at most 10 keywords by the authors of the document.
The keywords come from a fixed list of 100 keywords given to the
authors by the publisher. The order of the keywords in the list of
keywords for a document reflects the order of importance of the
keywords in the document: the first keyword is the most important
keyword, although it could be tied in importance with the second
keyword, etc. (Imagine the ACM giving
authors of all technical papers a list of 100 keywords related to
computer science topics and asking each author to choose and order by
importance at least 1 and at most 10 keywords that best describe the
topic of the paper.)
A query on
the collection of documents is an arbitrarily long list of keywords
from among the 100 keywords. Any scoring
of documents with respect to a query should have the following
properties:
Note that the
list of keywords for a query is not
ordered by importance.
Part a
Propose a
100-dimensional vector model for documents and queries that tries to
capture the scoring properties given above. What
are the components wi,d of a document vector d? Of a
query vector? What is the similarity (or distance) function between
document vectors and query vectors for your model?
Part b
Show the
vectors for the following 3 documents and 2 queries and the score for
each document with respect to each query:
document A has keyword list “agents,
interfaces”
document B has keyword list “network,
protocol, interfaces, instrumentation”
document C has keyword list “interfaces,
proxy, agents”
query 1: interfaces
query 2: interfaces, agents
Part c
How well does
your vector model capture the scoring properties given at the beginning
of this problem description? Be specific.
Part d
Does your vector model scoring distinguish between different documents
that contain the same number of the query keywords, but different query
keywords, for a multi-keyword query? Explain.
Part e
Does your vector model scale to full-text documents and a large
lexicon of terms? Explain. Do you think it is
practical?