Here are two quotations from the opposing candidates in the 2004 Presidential Debates:

I proposed a constitutional amendment. The reason I did so was because I was worried that activist judges are actually defining the definition of marriage, and the surest way to protect marriage between a man and woman is to amend the Constitution.

I believe marriage is between a man and a woman. But I also believe that because we are the United States of America, we're a country with a great, unbelievable Constitution, with rights that we afford people, that you can't discriminate in the workplace.

In this assignment, your task will be to write a computer program that can (usually) predict whether a quote such as these are more likely to have been said by George W. Bush or John F. Kerry. There are many reasonable approaches to this problem. In this assignment, we will consider one of these: We will begin by building a statistical model of the kind of language used by each of the candidates. Then, given a new piece of text, we will determine who said it by measuring how well it fits each of the statistical models, and selecting the one that gives the best fit.

**Markov models.**
The probabilistic language models we will be using are called
*Markov models*.
Markov models are at the heart of modern speech
recognition systems, and are used for a broad range of natural language
processing tasks, as well as many other problems relevant to artificial intelligence.
They are widely used to model
all sorts of dynamical processes in engineering, mathematics, finance and many
other areas.

A Markov model defines a probability distribution over sequences of symbols.
Said differently, a Markov model defines a probabilistic mechanism for randomly
generating sequences over some alphabet of symbols.
In a *zero-th order Markov model*, the symbols in the sequence are generated
with a fixed probability that does not in any way depend on other symbols in the sequence.
For instance, the alphabet might consist of just the letters `a` and `b`,
and the zero-th
order Markov model might specify that each symbol in the sequence is `a` with
probability 2/3 and `b` with probability 1/3. Thus, such a zero-th order Markov
model might generate a sequence like the following:

a b a b a a a b b b a a a a b b b a a b a a a a a b a a a a b a a a b b

In a *first order Markov model*, the probability of the current symbol that is
being generated can depend on preceding symbol. For instance, consider a
sequence in which every occurrence of `b` is always followed by `a`, while an
`a` is followed
with equal probability by either `a` or `b`. Such a model might
generate a sequence like the following.

b a b a b a b a a b a b a b a b a a b a a b a a a a a a b a b a a a a b

In this example, the sequence is likely to have about the same proportion of `a`'s
and `b`'s as in the preceding example. However, the model is first order
rather than zero-th order because whether or not the current symbol is `a` or
`b`
significantly affects the probability of what symbol will be generated next.

This idea can be generalized to a *k-th order Markov model* in which the
probability of the current symbol can depend on the preceding *k* symbols.

Markov models are frequently used to build a probabilistic model or statistical
estimate of a natural language. In modeling ordinary English text, a zero-th
order model is clearly inadequate for most purposes, capturing only the
frequencies of each letter.
On the other hand, a first order Markov model
can capture the notion, for instance, that a `q` is nearly always
followed by `u`.
Higher order Markov models can capture the tendency of the letters
`proba` to be followed by `b` or `t`.
(In other language applications, entire
words are regarded as the "letters"
of the "alphabet" so that the Markov model
provides a model of the sequences of *words* rather than *letters*
that are likely to be observed.
We will not consider such a framework in
this assignment, but you are welcome to explore it on your own.)

**Language modeling.**
The first part of the assignment is to implement
a class called `MarkovModel.java` that can build
a *k*-th order Markov model from text stored in a given file.
(Eventually, you will build two objects using this class,
one for Bush and one for Kerry.)
Constructing a *k*-th order Markov model
from text sequences is mostly a matter of counting.
For each sequence *p* of
length *k* (let us call it the *context*), we need to estimate the probability
of *p* being followed by each letter *c* in our alphabet.
Given a text sample, this probability can be estimated by the number of times that *p* is
followed by *c*, divided by the number of times that *p* appears as a
context at all.
That is, we can estimate the probability of *c* following *p*
by *N*(*p·c*) / *N*(*p*), where
*N*(*p·c*) denotes the number of times we observe the concatenated
sequence *p·c*, and *N*(*p*) denotes the number of times that
we observe *p*.

Unfortunately, this estimate will be problematic if
some of the counts are zero, as is certain to happen on real data.
Therefore, instead, we will use a different "smoothed" estimate, namely
(*N*(*p·c*) + 1) / (*N*(*p*) + *S*),
where *S* is the size of our alphabet.
This form of smoothing,
called *Laplace smoothing*, has theoretical justification coming from probability
theory, and ensures that zero counts will not be a problem.

For example, if the input text is
`"aabcabaacaac"`, and we
are using a second-order Markov model (*k* = 2) and the three-letter alphabet
is `{a, b, c}`,
then we can estimate the probability that `"aa"` is
followed by `'c'`; by (2 + 1) / (3 + 3) = 1/2 since
*N*(`"aa"`) = 3, *N*(`"aac"`) = 2 and *S* = 3.
Similarly, the probabilities that `"aa"` is followed by
`'a'` and `'b'` is 1/6 and 1/3, respectively.
Note that the probabilities sum to 1.

One thing to notice here: To handle the beginning and end of the string, we
treat the string as circular. Thus, *N*(`"caa"`) = 2 instead of 1.

So, in constructing a Markov model, your first job is to write code that will
compute the appropriate counts *N*(·) by scanning the text
file and counting how many times all sequences
of the required lengths appear in the file.
These counts should be stored appropriately for later use.
You should also compute and record the alphabet size *S*.
Organize your program by creating a data type `MarkovModel`
with the following API:

Now, the first line below computes the number of occurrences of each substring from s of size k and k+1; the second line prints it to standard output; and remaining lines compute and print Laplace-smoothed probability estimates.public class MarkovModel { public MarkovModel(int k, String corpus) // build an order-k Markov model from corpus public double laplace(String s) // return laplace-smoothed probability estimate of s public String toString() // return a string representation of this model }

For example, whenMarkovModel model = new MarkovModel(k, corpus); StdOut.println(model); StdOut.printf("%.4f\n", model.laplace("aac")); StdOut.printf("%.4f\n", model.laplace("aaa")); StdOut.printf("%.4f\n", model.laplace("aab"));

alphabet size S = 3 "aa" 3 "ab" 2 "ac" 2 "ba" 1 "bc" 1 "ca" 3 "aab" 1 "aac" 2 "aba" 1 "abc" 1 "aca" 2 "baa" 1 "bca" 1 "caa" 2 "cab" 1 0.5000 0.1667 0.3333

**Determining the likelihood of a new sequence.**
Next, we will want to use the Markov model that you constructed from data
to compute a measure of how
well that model fits a new text sequence.
Later, we will use this measure
to determine which of two models (one for Bush and one for Kerry) better fits a
test sequence.
To compute a measure of fit, we will compute the
probability of the model generating the new sequence, a quantity usually called
the *likelihood* of the sequence under the model. For each symbol *c*
in the sequence, we can compute the probability of observing *c* under the model,
given its *k*-letter context *p*.
In particular, this is just the Laplace-smoothed
estimate given above.
To compute the likelihood, or probability of the
entire sequence, we can multiply these probabilities together,
for all symbols in the sequence. Although correct mathematically,
the resulting product might be indistinguishable from zero using
floating point arithmetic.
Therefore, rather than
multiplying probabilities, you will need to work with *log* probabilities.
That is, for each symbol *c* in the sequence, you will need to compute the
*log* of
its probability, and you will then *add* (rather than multiply) these log
probabilities to arrive at the final log likelihood of the sequence.
Keep in mind that all of these log probabilities will be negative (since
probabilities are never more than one).

Your second job is to compute, for a given string, and for each position *i* in
the string, an estimate of the log probability of the character at position *i*,
given its *k*-character context. The log likelihood of this entire sequence
can then be computed as the sum of these log probabilities.

For instance, continuing the example above, having constructed the second order
(*k* = 2) Markov model,
suppose we are given a new string `"aabca"`. We can compute the log
probability of each of the five symbols in this string as follows (where the
first column gives the context of each of the symbols in the second column):

context c log probability ------------------------------------------------ "aa" 'b' log((1 + 1) / (3 + 3)) = -1.0986 "ab" 'c' log((1 + 1) / (2 + 3)) = -0.9163 "bc" 'a' log((1 + 1) / (1 + 3)) = -0.6931 "ca" 'a' log((2 + 1) / (3 + 3)) = -0.6931 "aa" 'a' log((0 + 1) / (3 + 3)) = -1.7918 TOTAL log likelihood = -5.1930 AVERAGE log likelihood = -1.0386

One thing to notice here: To handle the beginning and end of the string, we
again treat the string as circular. Thus, in `"aabca"`,
the character following the context `"ca"` is `'a'`.

**Choosing the best model.**
Now that we have a method of measuring the
fit of a model to a particular string, we can choose the best model to be the
one that maximizes the likelihood of the string. In other words, we can
simply measure the likelihood of the string under each model (Bush or Kerry)
and choose the one that is greatest.

Write a program `TopModel.java` as follows:
its first argument is the order parameter *k*; its
next two arguments are the names of the files containing excerpts
from the two candidates (or other text that you might choose to experiment with).
The remaining arguments are the names of the files containing text
sequences you wish to classify.
Your program should build models for each of the two candidates,
read in each text sequence, and classify it using the output
format described below.

**Input format.**
Each file is a sequence of Unicode characters.
We provide data files from the three presidential debates.
You should train using the first two debates and test on the third.
Read the files using the `In.java` library that we provide.

For amusement, you might also see how your Bush/Kerry models predict when attempting to classify quotes from John Edwards and Dick Cheney taken from the vice presidential debate. Or you might use the data from the Obama/McCain and Biden/Palin debates.In in = new In("filename.txt"); String corpus = in.readAll();

**Output format.**
For each text sequence,
your program should output the filename of the text block and its *average*
likelihood under each model (that is, the likelihood as computed above, divided by the
length of the sequence—this is simply to prevent unusually long sequences
from seeming more significant than they are).
Also, print the difference of
these two numbers; the *sign* of the difference will indicate which model is more
likely, and the *absolute value* of the difference can be interpreted as a measure of
confidence in this prediction.

To get an idea of why the algorithm is making its predictions, you should also
find and print the ten positions in the test sequence where the difference in
the log probabilities of the two models are greatest.
That is, if the log
probability of character `i` (given its context) under model `j`
is `l _{j}[i]`, then
you should find the 10 indices

%java TopModel 2 bush1+2.txt kerry1+2.txt bush3-00.txt bush3-01.txt bush3-00.txt -2.1131 -2.1895 +0.0764 "spr" -2.715 -4.963 +2.247 "eek" -5.617 -3.457 -2.160 "eek" -5.617 -3.457 -2.160 "ari" -6.011 -3.901 -2.110 "siv" -3.843 -5.900 +2.057 "siv" -3.843 -5.900 +2.057 "siv" -3.843 -5.900 +2.057 "ban" -2.338 -4.174 +1.836 "sig" -2.744 -4.514 +1.769 "fea" -2.645 -4.401 +1.756 bush3-01.txt -2.1467 -2.1607 +0.0140 "Of " -2.351 -4.331 +1.979 " Of" -2.543 -4.382 +1.839 " My" -1.674 -3.308 +1.634 "My " -1.376 -2.983 +1.607 "sal" -4.956 -3.418 -1.538 "sal" -4.956 -3.418 -1.538 "nen" -2.275 -3.703 +1.428 " Go" -2.970 -4.357 +1.386 "k I" -5.380 -4.003 -1.377 "dis" -1.946 -3.292 +1.346

The first line indicates that for the file `bush3-00.txt`,
the average likelihood under the model built using `bush1+2.txt` was
`-2.1131`,
and `-2.1895` under the `kerry1+2.txt` model.
The difference of these numbers is `+0.0764`,
which, being positive, indicates an overall prediction that this quote
was uttered by Bush.
The next line indicates that the most significant difference in
log probabilities between the two models came in predicting
`'r'` under context `"sp"` where the Bush log probability was
`-2.715` and the Kerry log probability was
`-4.963`, the difference being `+2.247`.

When printing characters from the text, you should convert all white space to ordinary spaces.

**Analysis.**
Analyze your approach to this problem giving estimates of
its time and space requirements. Also, critically explore the
effectiveness of this technique for this task.

**Submission.**
Submit `TopModel.java`, `MarkovModel.java`, and
any other files needed by your program (excluding those
in `stdlib.jar` and `adt.jar`).
Finally, submit a
readme.txt file and answer the questions.

**Midterm evaluation.**
Please fill out the following anonymous
questionnaire to provide us with feedback
to help us improve this course.

This assignment was developed by Rob Schapire (and slightly adapted by Kevin Wayne).