Programming Assignment Checklist: Markov Model of Natural Language

Pair programming. On this assignment, just like the last one, you are encouraged (not required) to work with a partner provided you practice pair programming (same rules as last assignment).

Frequently Asked Questions

What are the main goals of this assignment? Use a symbol table, learn about natural language processing, and learn about Markov models.

What preparation do I need before beginning this assignment? Review the Parameterized data types section of the textbook on pages 566-570 to make sure you know how to use Generics.

Review the beginning of Section 4.4 (pages 608-617) to make sure you understand the functionality of Symbol Tables and how to write a Symbol Table client.

You will be writing a client for the Symbol Table object class ST.java. Here is a subset of the API for ST.java:

public class ST<Key, Value>         // Note: Key must be Comparable
------------------------------------------------------------------------
             ST()                   // create a symbol table
       void  put(Key key, Value v)  // put key-value pair into the table
      Value  get(Key key)           // return value paired with key
                                    // or null if no such value
    boolean  contains(Key key)      // is there a value paired with key?
    public int size()               // return the number of key-value pairs in the symbol table

What is the origin of the Markov text generator? It was first described by Claude Shannon in 1948. The first computer version was apparently written by Don P. Mitchell, adapted by Bruce Ellis, and popularized by A. K. Dewdney in the Computing Recreations section of Scientific American. Brian Kernighan and Rob Pike revived the program in a University setting and described it as an example of design in The Practice of Programming. The program is also described in Jon Bentley's Programming Pearls.

Do I have to implement the prescribed API? Yes, or you will lose a substantial number of points.

How do I read in the input text from standard input? Use StdIn.readAll().

Given a string s, is there an efficient way to extract a substring? Use s.substring(i, i+k) to get the k-character substring starting at i.

How do I emulate the behavior of a circular string? There are a number of approaches. One way is to concatenate the first k characters to the end of the input text.

How do I convert a char to an int? A char is a 16-bit (unsigned) integer. Java will automatically promote a char to an int if you use it as an index into an array.

How do I use StdRandom.discrete() to pick the next character? The argument to StdRandom.discrete() needs to be a double[] holding the probabilities of each character being next. StdRandom.discrete() will return an integer which is the index of the random pick based on those probabilities. That index is the integer value of the next character.

My rand() method calls StdRandom.discrete() as recommended in the Possible Progress Steps, but I get the following error message when I run: java.lang.AssertionError. What does that mean? It means that the probabilities don't sum up to 1. Double check how you are computing the values for the array you send to StdRandom.discrete(). The array elements are the probabilities of each possible event, so the sum of the array elements should be 1. (To learn how to use assertions, see pp. 446-447.)

Should my program generate a different output each time I run it? Yes.

My random text ends in the middle of a sentence. Is that OK? Yes, that's to be expected. We recommend using a StdOut.println(); statement to ensure that there is a new line generated after the last character.

For which values of k should my program work? It should work for all well-defined values of k from and including 0 to and including the length of the input text. Naturally, as k gets larger, your program will use more memory and take longer.

I get an OutOfMemoryException. How do I tell Java to use more of my computer's memory? Depending on your operating system, you may have to ask the Java Virtual Machine for more main memory beyond the default.

% java-introcs -Xmx500m TextGenerator 7 1000 < input.txt

The 500m means 500MB, and you should adjust this number depending on the size of the input text.

What is a StringBuilder object? StringBuilder is part of the standard Java library. It is an object that we use because of its more efficient handling of large strings — we saw earlier that repeatedly concatenating characters to a String takes quadratic time, whereas this takes linear time for a StringBuilder. Here is a subset of the StringBuilder API with some methods you might find useful for this assignment. You do not need to use all of them.

public class StringBuilder
-------------------------------------------------------------------------------------------------------
              StringBuilder(String s) // create a StringBuilder initialized to the contents of String s
StringBuilder append(char c)          // append the string representation of the char to this sequence
String        toString()              // returns a String representing the data in this sequence
String        substring(int i, int j) // returns the length (j-i) substring including position i, 
                                      // up to and excluding position j (like String.substring())

Testing

Thoroughly test your MarkovModel. We provide a main() as a start to your testing.

    public static void main(String[] args) {
        MarkovModel mod1 = new MarkovModel("banana", 2);
        StdOut.println("freq(\"an\", 'a')    = " + mod1.freq("an", 'a'));
        StdOut.println("freq(\"na\", 'b')    = " + mod1.freq("na", 'b'));
        StdOut.println("freq(\"na\", 'a')    = " + mod1.freq("na", 'a'));
        StdOut.println("freq(\"na\")         = " + mod1.freq("na"));
        StdOut.println();

        String text = "one fish two fish red fish blue fish"; 
        MarkovModel mod2 = new MarkovModel(text, 4);
        StdOut.println("freq(\"ish \", 'r') = " + mod2.freq("ish ", 'r'));
        StdOut.println("freq(\"ish \", 0)   = " + mod2.freq("ish ", (char) 0));
        StdOut.println("freq(\"ish \")      = " + mod2.freq("ish "));
        StdOut.println("freq(\"tuna\")      = " + mod2.freq("tuna"));
    }

If your method is working properly, you will get the following output:

% java-introcs MarkovModel
freq("an", 'a')    = 2
freq("na", 'b')    = 1
freq("na", 'a')    = 0
freq("na")         = 2

freq("ish ", 'r') = 1
freq("ish ", 0)   = 0
freq("ish ")      = 3
freq("tuna")      = 0

Note that this does not test your rand() or gen() methods.

You can use print statements to test rand() to make sure that each non-zero entry of the array passed to StdRandom.discrete() is accurate. (But make sure to remove these print statements from your rand() method before submitting your code.)

Alternatively, you can write a loop to call rand() many times and count how many times a particular character is returned. (e.g., With mod1 above, rand("na") should generate 'b' half of the time. With mod2 above, rand("fish") should generate 'o' one quarter of the time.)

To test gen(), use a text that has no repetition (e.g., "abc") and set order k=1. This should only be able to generate text where "a" is followed by "b", "b" is followed by "c", "c" is followed by "a". Then use a text with easy to compute repetition (e.g., "abac"). It should generate "a" followed half the time by "b" and half the time by "c". "b" or "c" should always be followed by "a".

An order-0 Markov model generates a random sequence of letters where each letter appears with probability proportional to its frequency in the input text. For input17.txt there are 9 g's, 7 a's, and 1 c. So we want the probability of generating a 'g' to be 9/17, an 'a' to be 7/17, and a 'c' to be be 1/17. In a sequence of 100 characters, we'd therefore expect on average about 53 g's, 41 a's, and 6 c's.

% java-introcs TextGenerator 0 100 < input17.txt
gaaagaacagcagacgacggaagaaggaggaaaaggaggggaggggggaggaggaagggagaaaggagacagcggaggggacgggaggagaggaggagag

For input17.txt, the next character after "ga" is 'a' with probability 1/5 and 'g' with probability 4/5. If you run the following command 10 times, you should expect (on average) to see "gag" 8 times and "gaa" 2 times.

% java-introcs TextGenerator 2 3 < input17.txt
gag

Possible Progress Steps

These are purely suggestions for how you might make progress. You do not have to follow these steps.

Download the files mentioned on the assignment page. Make sure that ST.java and the testing .txt files are in the exact same directory as the .java files you are writing. You might also want to use StdRandom.java, but it is a part of stdlib.jar, so you probably won't need to download it.
Review the material in the textbook on symbol tables as well as IntegerSort.java.
Create an instance variable for your symbol table where the key is a String. There is more than one possible type for the value of your symbol table.
- One strategy is for the value to be another symbol table. This symbol table will have a character as the key and a count of frequencies (how many times one character appears after the key in the text) as a value.
- Another strategy is for the value to be an integer array. Each entry of the integer array is a count of frequencies.
Please note that there is no reason to save the original text or the circular text as an instance variable. That would be a waste of space. Your MarkovModel object does not need either of these strings after the symbol table is built.
Write your constructor to create the circular version of the text string. Then initialize and populate your symbol table, using the symbol table methods contains(), get() and put().

OPTIONAL: You may want to write a private helper method that prints the symbol table created by the constructor. This can certainly help you debug SMALL test cases. Use the enhanced for loop to access each element in your symbol table. Here's some pseudo-code:

// helper method that prints each entry in the symbol table
private void printTable() {

   // st is the symbol table object (ST st = ...)
   for (String key : st) {
       StdOut.print (key + ": ")
       ... entry = st.get(key);

       // code for printing entry
       ...

       // new line before next entry
       StdOut.println();
    }

}

Write the order() method. This should be a one-liner.
Using the symbol table instance variable, write the freq() methods.
Using the main() provided above test this part of your MarkovModel data type before continuing. You may wish to add more testing of the constructor and freq() methods.
Write the rand() method. To generate a random character with probability proportional to its frequency you may call or refer to StdRandom.discrete() on p. 226 of the textbook.
It may not be obvious from your final results if rand() is working as you intended, so be sure to test it thoroughly. Then test your complete MarkovModel data type before continuing.
Write the gen() method. To generate a trajectory, create a StringBuilder object which consists of the kgram argument. Then write a loop to generate the remaining characters, one character at a time using other methods from MarkovModel and append each character to the StringBuilder object. (Do not use String concatenation. It is very slow.) You will need to convert the StringBuilder object to a String object to match the method return type.
Write the client program TextGenerator.java. Create an instance of the MarkovModel class with the string read from StdIn. Use the gen() method from the MarkovModel class to generate a T-character trajectory. Output the trajectory, followed by a new line.
Make sure to test long inputs (we provide several) and long outputs with different orders. If the order and input size are fixed, your program should run in time that is linear in the output size. If the order and output size are fixed, your program should run in time that is linear in the input size.

Enrichment

Here's a website that generate pseudo-random computer science papers. It uses something called a context free grammar instead of a Markov chain, but otherwise is similar in spirit to what you are doing on this assignment.

What can I do with my random text? CNN reported the famous Frist filibuster of 2005 where one former student read his output.

We are implementing the model Shannon described in his landmark paper. But the "one reads until this letter is encountered" method in his quote on the assignment page is, ironically, not a statistically accurate example of his model. If we run your program with input "wawawaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaawy" then 'w' should be followed by 'a' 75% of the time, while the "read until" model will follow 'w' with 'a' only 15% of the time. What would be involved in simulating this other model? Which one do you think gives more realistic text?

Here are Garfield comics generated by a Markov chain.