COS 226 HW#4 Checklist

COS 226 Programming Assignment #4

Checklist

Goals

Here are the main goals of the assignment:

Use a symbol table in an interesting domain.
Use characters and Java strings.
Learn about natural language processing, machine learning and statistical modeling.

Frequently Asked Questions

What characters may appear in the input? The input consists of UNICODE text (although we don't plan on feeding you anything outside the ASCII subset of UNICODE). Do not assume it is just white space and lowercase letters. You can assume that the text blocks and tags will not include the delimiter character '|'.

Do I need to remove punctuation, change characters to lower case, or otherwise modify the text? No. For building and using the Markov models, you should use the UNICODE text exactly as given. The only processing you should do is to change all white space characters to ordinary spaces whenever you print anything out (but not when building or using the Markov models).

My program does not always predict the correct speaker on this task (for instance, it sometimes predicts that Bush was the speaker on a Kerry quote). Is this okay? Yes, even a correct implementation might not get perfect accuracy using this technique (or any technique).

How do I set the alphabet size S for Laplace smoothing? If you are only using the Bush/Kerry debate files that we are providing, it is okay to simply hard-code S to the constant 77, the actual number of unique unicode symbols appearing in these files. This is what we did in the reference solutions. A more general solution, which is not required, is to count the total number of unique characters appearing in the text of all the training files, and to then add one. If you use this solution, be sure to clearly document it since your results might be slightly different than those in the reference solutions.

How do I handle the beginning of a text block where there are fewer than k characters in the context? You have two options here, which actually are equivalent in the sense that they should give the same results. The first option is to allow contexts which are shorter than k characters long. For instance, with k=2, in the text block "abcdefg", the context for the letter 'a' would be the empty string "", the context for the letter 'b' would be the one-character string "a", and for all other letters, we would use two-letter contexts as usual (for instance, the context for 'c' would be "ab"). The second option is to pretend that the text block has been prepended with k repetitions of a special character that does not appear elsewhere in the text, such as '|'. Thus, the context for 'a' would be "||", the context for 'b' would be "|a", the context for 'c' would be "ab", etc. In the first approach, contexts can vary in length; in the second approach, they all have length exactly k.

Can I use the hash table class HashMap from Java's library? Yes. Keep in mind that Java's symbol tables cannot store two elements with the same key.

Can I use a Java system sort (such as Arrays.sort) and/or the priority queue implementation from the last assignment? Yes.

Given a string s, how do I access its i-th character? Use s.charAt(i).

Given a string s, is there an efficient way to get a substring? Use s.substring(i, i+k) to get the k-character substring starting at i, or s.substring(i) for the substring beginning at i and continuing to the end of the string.

Given a string s, how do I easily convert all of its white space characters to ordinary spaces? Use the String library function s.replaceAll("\\s", " ");. The \s matches any sequence of white space characters; the extra \ is needed to "escape" the usual meaning of \ inside strings. You should only do this when printing out, not when creating the Markov model or computing likelihoods.

How do I determine if a given character c is a white space character? Use Character.isWhitespace(c).

How do I compute logs? Use Math.log.

How do I convert a String s to an int? Use Integer.parseInt(s).

How do I access the command-line parameters? If your main method has the signature public static void main(String argv[]), then the first command-line parameter is stored in argv[0], the second in argv[1], etc.

Am I required to format floating point numbers exactly as in the reference solution? No, this is not a requirement, but if you wish, you can use the Java library DecimalFormat.

Where can I get documentation on the standard Java classes, including the ones mentioned above? See the documentation for the Java standard platform. Keep in mind that we are asking that you stick to Java version 1.4.2.

Input, Output, and Testing

Input. Here is a directory containing all of the data from the debates. We are providing data files from the three debates, one file per speaker per debate with text blocks corresponding to the actual responses. These are in the files bush1.txt, kerry1.txt, bush2.txt, etc. The first two debates have also been combined into the files bush1+2.txt and kerry1+2.txt. You should train using the first two debates and test on the third. For amusement, you might also see how your Bush/Kerry models do at classifying quotes from John Edwards and Dick Cheney taken from the vice presidential debate. These are in cheney.txt and edwards.txt. We will test your program using similar, but not necessarily the same, data.

These files can be read using the MarkovIn.java class. This class can be initialized to read from a data file using the MarkovIn(String filename) constructor, or from standard input using the MarkovIn() constructor. Once initialized. you can read the next tag using the readTag() method, followed by the next text block using the readText() method. Note that the final line terminator in each block of text is discarded. An example main that uses these is included. Note that this program uses the first character appearing in the file as a text block delimiter ('|' in all of the files that we are providing).

Output. To allow automatic testing, your program must use the output format described on the main webpage for this assignment. (It is okay if the floating point numbers are not formatted in exactly the same way, but otherwise, you should use the given format as closely as possible.)

Reference solutions. Reference solutions are being provided only for k=2 in the file refk=2.txt. This file was created using the command

% cat bush3.txt kerry3.txt | java BestModel 2 bush1+2.txt kerry1+2.txt

You are encouraged to test your program carefully on small files where you can compute the correct outputs by hand.

Submission and program report

Using whiteboard, you should submit Markov.java and BestModel.java, as well as any other files needed by your program (other than MarkovIn.java). You also need to submit a program report electronically using whiteboard. See the assignments page for more details. The program report should include:

A high-level description of the design of your algorithm, and what influenced your decisions.

An analysis of your algorithm's performance characteristics.

A critical exploration of the effectiveness of this method for this task.

Optionally, a description of other tasks you might have tried your program on.

Possible progress steps

Here are some suggestions of how you might make progress on this assignment.

Figure out an overall solution to the problem that will be sufficiently efficient both in time and space. Keep in mind that your program should be efficient even for fairly large values of the order parameter k (say, as big as 100). This rules out solutions that are exponential in k. Your program also must work for k as small as zero.
Download the directory markov containing the data files and MarkovIn.java. This directory is mirrored on arizona at ~cos226/ftp/spring05/markov.
Once you have your Markov.java class working, write a main to test it. For instance, you might want to read in a file, then print out the count of how often each k-letter context appears in the file, and how often each of these contexts appears with each following symbol. You might also check that the log probabilities are being computed correctly
Before applying your program to the data that we have provided, you should test it extensively using small files of your own design where you can easily check the answer by hand.