COS 226 Programming Assignment Checklist: Language Modeling

Goals

Here are the main goals of the assignment.

Frequently Asked Questions

What is the input alphabet? The inputs consist of UNICODE text (although we don't plan on feeding you anything outside the ASCII subset of UNICODE). Do not assume it is just whitespace and lowercase letters.

My random process dead ends because it generates a match for for the final K characters of the input file, but this string does not appear anywhere else in the input. That's OK. If you prefer, you can repeatedly restart your code from the beginning until you get the right number of characters. An alternative is to treat the input string as circular so that there are no dead ends.

Can I use the hash table in Java's library? Yes. Recall that Java's symbol tables cannot store two elements with the same key. In this application, you won't be storing duplicate keys. You may use SymbolTable.java - it implements a symbol table using Java's HashMap library, and also includes a show method.

Should my program generate a different output each time I run it? Yes.

How can I be sure that I'm returning a random character from the correct probability distribution? If you have an array suffixes of N possible suffix characters, and suffixes[i] appears freq[i] times, then the following code will return the character suffixes[i] with probability freq[i] / F, where F is the sum of the frequencies.

int r = (int) (Math.random() * F);
int sum = 0;
for (int i = 0; i < N; i++) {
    sum = sum + freq[i];
    if (r < sum) return suffixes[i];
}
This code can be easily modified to work with linked lists.

How much memory should my program use? Note that it's not necessary to store the input. Also, if the input is large, your language model may take up less space than the original input, e.g., an order 5 Markov model of 1 billion DNA base pairs will have at most 4^5 = 1,024 keys, and each key will have at most 4 suffix characters.

How do I read in characters from standard input? The library StdIn.java is intended for parsing tokens. Use CharStdIn.java for character input.

Given a string s, how do I access its ith character? Use s.charAt(i).

Given a string s, is there an efficient way to get a substring? Use s.substring(i, i+k) to get the k-character substring starting at i. In Java, this operation takes constant time, regardless of the size of k. This is possible because strings are immutable, so the library can reuse the part of the array in the original string.

Given a string s, how do I easily convert all of its whitespace characters to ' ' or '_'? Use the String library function s.replaceAll("\\s", "_");. The \s matches any sequence of whitespace character; the extra \ is needed to "escape" the usual meaning of \ inside strings. You should only do this when printing out the language model itself, not when storing the language model or printing random text.

How do I determine with a given character c is a whitespace character? Use if (Character.isWhitespace(c)).

Do I need to print the distinct keys or the frequency lists in lexicographic order like in the reference solutions? No, you are free to print them out in whatever order is most convenient. However, if you want to compare your solutions with our reference solutions, the lexicographic ordering might make things easier.

Can I exploit the fact that I'm always consecutively hashing K-character keys that differ in only the first and last characters. Yes, this can speed up your algorithm by a factor of K. If K is large, this is significant. You are not required to implement this idea, but if you finish early, it's worth thinking about.

I get an OutOfMemoryException. Is there anything I can do? Depending on your operating system, you may have to ask the Java Virtual Machine for more main memory. The following requests 200MB of memory.

java -Xmx200m TextGenerator 10 < mobydick.txt
Of course, using less memory is also an option.

Input, Output, and Testing

Input and output. Here are a number of sample input files. They are the same ones from last week.

Reference solutions.   For reference, we provide executable code for our solution in Windows, Solaris, and OS X. We encourage you to develop a more efficient program! It is certainly possible that you may be able to beat our solution, and it is also possible that you will get most/all of the credit even if your method is not quite as efficient.

Note that our program ignores the last character read in because this is always a newline character in our sample files. This helps prevent the dead-ending on some small test files.

Submission and readme

You should submit at least the following three programs, together with a readme.txt file: Markov.java (a data structure corresponding to a K-tuple of letters, with the breakdown of the next possible letters with their frequencies), LanguageModeler.java (corresponding to part 1), and TextGenerator.java (corresponding to part 2). There is no need to submit FrequencyCounter.java this this is subsumed by LanguageModeler.java.

Here is a template readme.txt file. Use a plain text editor to edit it (no wordpad, msword, etc). It should contain the following information:

  • A high-level description for the design of your algorithm, and what influenced your decisions.
  • An analysis of your performance characteristics, especially that of space usage.
  • An argument why your code generates each output with the correct probability.
  • Submit one or two of the most amusing excerpts that your program generated.
  • Unix users may find the shell script timeit.csh useful for computing a table of CPU times.

    Possible progress steps

    This assignment is carefully organized in several pieces, with checkpoints along the way. It is tempting to race ahead and try to write all of the pieces at once. This is a mistake often made by novices. Although it will appear that you are making rapid progress, should you encounter a bug, you won't know in which part of your code it is buried. You will save time in the long run by diligently testing and debugging each piece of code as you write it.