You may discuss the general methods of solving the problems with other
students in the class. However,
each student must work out the details and write up his or her own
solution to each problem
independently.
Some problems have been used in previous offerings of COS 435. You
are NOT allowed to use any solutions posted for previous offerings of
COS 435 or any solutions produced by anyone else for the assigned
problems. You may use other reference materials; you must
give citations to all reference materials that you use.
Problem 1
Consider combining the vector model with the "set of terms" model
for documents and queries. In this case, for a dictionary
of t index terms, each document is a t-dimensional
vector whose jth component is a 1 if the document contains
one or more instance of the jth term and is a 0
otherwise. Query vectors are defined analogously.
Problem 2
An Introduction to Information
Retrieval
Exercise 6.18 at the end of Section 6.4: (Paraphrasing) Consider
a query q
and documents d1,
d2,...,
in the (general) vector model. Show that if q and the di
are all normalized to unit vectors, then the rank order produce by
ranking the documents di in
order of increasing Euclidean distance from q is identical
to the rank order produced by ranking the documents di in
order of decreasing dot product with q.
Problem 3
In class we
discussed ways one might add ranking to the Boolean model
of queries. In this problem, consider more specifically combining
Boolean queries with a scoring function for ranking. In the
combined model, a document will still only satisfy a query if it
evaluates to "true". Propose a specific method to rank the
documents that
satisfy a query. You may use a vector scoring technique or
specify another technique for scoring. Your method should apply
to any Boolean query, i.e. any
Boolean combination of terms, and must use the "bag of terms"
model for documents. If you do not use a vector scoring
technique, specify your technique as an algorithm in enough detail that
someone else can execute it, but do not give implementation details.
If you use a vector scoring technique, specify how your
document vectors are defined, how query vectors are obtained from the
Boolean queries, and what vector-based metric is
used. Whatever type of technique you use, do a small example of
your choosing to
illustrate your
technique.
Problem 4
To use the vector model, is not necessary to use the "set of terms" or
"bag of terms" model for documents. The values of
components of the document vector can reflect other aspects of document
contents besides term inclusion and term frequency - as long as query
vectors and a vector-based scoring function can be
specified. Propose a formula for term weights for document
vectors that uses term frequency but also accounts for the fact that
words appearing early in a document tend to have higher importance in
the document than terms appearing later. Make "appearing early"
concrete.