COS 226 Programming Assignment Checklist: The Markovian Candidate

Goals

Here are the main goals of the assignment.

Use a symbol table in an interesting domain.
Learn about natural language processing, machine learning and statistical modeling.

Frequently Asked Questions

Is it important that my program implements the MarkovModel interface? Yes, we will test this piece independently. It need not print out the k-grams in the given order, e.g., it's fine if you organize the strings of length k and k+1 in the same data structures.

Do I need to remove punctuation, change characters to lower case, or otherwise modify the text? No. For building and using the Markov models, use the Unicode text exactly as given. The only processing you should do is to change all whitespace characters to ordinary spaces whenever you print anything out (but not when building or using the Markov models).

My program does not always predict the correct speaker on this task (for instance, it sometimes predicts that Bush was the speaker on a Kerry quote). Is this okay? Yes, even a correct implementation might not get perfect accuracy using this technique (or any technique).

How do I set the alphabet size S for Laplace smoothing? Use the actual number of distinct characters in the given training file. (In reality, it would be better to use the number of distinct characters in the union of the two training files and all of the test files.)

How do I handle the beginning of a text block where there are fewer than k characters in the context? Treat the string as if it were a circular string. So if k = 3 and the string is "abcdefg", the context c would be gab.

Do I need to handle the case when k is greater than the length of the test input? No. It's fine to assume k is less than the length of either of the training inputs or of the test input.

Can I use ST.java, SET.java, MaxPQ.java, and MinPQ.java from the booksite? Absolutely. You are also free to use java.util.HashMap, java.util.HashSet, java.util.TreeMap, and java.util.TreeSet if you prefer.

What is N(p) for an order 0 Markov chain? When p is the empty string (string of length 0), it's equal to the total number of characters in the input.

Given a string s, how do I access its ith character? Extract a substring? Use s.charAt(i) to get the ith character. Use s.substring(i, i+k) to get the k-character substring starting at i, or s.substring(i) for the substring beginning at i and continuing to the end of the string.

Given a string s, how can I convert all of its whitespace characters to ordinary spaces? Use the String library function s.replaceAll("\\s", " ");. The \s matches any sequence of white space characters; the extra \ is needed to "escape" the usual meaning of \ inside strings. You should only do this when printing out, not when creating the Markov model or computing likelihoods. Note that s.replaceAll() does not change s itself, but returns the resulting string.

When computing the average likelihood, should I include a (k+1)-gram multiple times if it appears multiple times? Yes, absolutely. Also, include duplicates when printing the 10 indices for which the difference in log probabilities is greatest.

Am I required to format floating point numbers exactly as in the reference solution? No, this is not a requirement, but if you wish, you can use System.out.printf() or String.format().

My toString() method takes quadratic time. What can I do to make it run in linear time? Strings are immutable. One consequence is that repeated string concatenation is inefficient. Use the StringBuilder library. See the Polygon.java example.

Why do I get slightly different answers when I execute on Windows? It could be the way your operating system separates lines of text. Unix uses "\n"; Windows uses "\r\n". If your alphabet size is one bigger than expected and all your lines have one extra character, this is the reason.

Input, Output, and Testing

Input. Here is a directory containing all of the data from the debates. We are providing data files from the three debates, one file per speaker per debate with text blocks corresponding to the actual responses. These are in the files bush1.txt, kerry1.txt, bush2.txt, etc. The first two debates have also been combined into the files bush1+2.txt and kerry1+2.txt. You should train using the first two debates and test on the third.

To run your program on a large number of inputs, take advantage of the wildcard capabilities of your operating system (Windows, Linux, and OS X).

%java TopModel 2 bush1+2.txt kerry1+2.txt bush3-*

For amusement, you might also see how your Bush/Kerry models do at classifying quotes from Barack Obama and John McCain or from Joe Biden and Sarah Palin from the 2008 debates.

We have also provided texts (which students submitted last semester). These can be used to detect differences in literature across centuries. There are two feeder files. One, 19th.txt, consists of 'A Christmas Carol', 'Tom Sawyer', and 'The Scarlett Letter' (All taken from project Gutenberg). The other, 20th.txt, consisted of 'This Side of Paradise', 'The Portrait of the Artist as a Young Man', and 'Heart of Darkness'. The remaining files are particular fictional texts such as 'Moby Dick', 'The Call of the Wild', 'Sons and Lovers', and 'Alice in Wonderland'.

We will test your program using similar, but not necessarily the same data.

Output. To allow automatic testing, your program must use the output format described on the assignment.

You are encouraged to test your program carefully on small files where you can compute the correct outputs by hand.

Possible progress steps

Here are some suggestions of how you might make progress on this assignment.

Figure out an overall solution to the problem that will be sufficiently efficient both in time and space. Keep in mind that your program should be efficient even for fairly large values of the order parameter k (say, as big as 100). This rules out solutions that are exponential in k. Your program also must work for k as small as zero.
Download the directory markov containing the data files.
Once you have your MarkovModel.java class working, write a main() to test it. For instance, you might want to read in a file, then print out the count of how often each k-letter context appears in the file, and how often each of these contexts appears with each following symbol. You might also check that the log probabilities are being computed correctly
Before applying your program to the data that we have provided, you should test it extensively using small files of your own design where you can easily check the answer by hand.