Programming Assignment Checklist: Markov Model of Natural Language

Goals
Here are the main goals of the assignment.

Use a graph-like data structure.
Learn about natural language processing.
Learn about Markov chains.

Frequently Asked Questions

What is the origin of the Markov text generator? It was first described by Claude Shannon in 1948. The first computer version was apparently written by Don P. Mitchell, adapted by Bruce Ellis, and popularized by A. K. Dewdney in the Computing Recreations section of Scientific American. Brian Kernighan and Rob Pike revived the program in a University setting and described it as an example of design in The Practice of Programming. The program is also described in Jon Bentley's Programming Pearls.

How do I read in characters from standard input? At this time StdIn doesn't support readAll, so you should use In instead.

In stdin = new In();
String text = stdin.readAll();

Given a string s, is there an efficient way to get a substring? Use s.substring(i, i+k) to get the k-character substring starting at i. In Java, this operation takes constant time, regardless of the size of k. This efficient construction is possible because strings are immutable, which enables the library to reuse the relevant part of the original string's character array. See Section 3.1 for more information on strings.

Why treat the text as a cyclic string? This prevents the Markov chain from ever getting "stuck", i.e., entering a state with no outgoing transitions.

How do I emulate the behavior of a cyclic string? There are a number of approaches. The easiest might be to concatenate the first k characters to the end of the text.

Should my program generate a different output each time I run it? Yes.

For which values of k should my program work? It's fine to assume k > 0. Naturally, as k gets larger, your program will use more memory and take longer.

I get an OutOfMemoryException. How do I tell Java to use more of my computer's memory? Depending on your operating system, you may have to ask the Java Virtual Machine for more main memory beyond the 64MB default.


java -Xmx100m TextGenerator 7 1000 < input.txt

The 100m means 100MB, and you should adjust this number depending on the size of the input text.

Why not store the probability of each transition instead of using parallel edges? It's easier to implement this way. Storing the probabilities would be more efficient and more general, but it's not necessary on this assignment.

Input, Output, and Testing

Input. Here are a number of additional sample input files.

Submission

Submission. Submit MarkovChain.java, TextGenerator.java, AdjList.java and readme.txt. Do not modify or submit In.java or SymbolTable.java.

readme.txt. Use the following readme file template.

Possible Progress Steps

These are purely suggestions for how you might make progress. You do not have to follow these steps.

Download the directory markov. It contains an implemention of Graph.java, SymbolTable.java, and AdjList.java which you will want to use as your starting point.
Modify AdjList to include a method random() that returns a random element (a String) of the adjacency list (or null if the list is empty). Be sure to test out your routine on inputs of size 0, 1, and 2. Make sure that it returns each element with equal probability. You may use the program TestAdjList.java to do this.
Rename Graph to MarkovChain and modify it to model a Markov chain. This involves replacing addEdge with addTransition and replacing neighbors with next. Be sure to thoroughly test your modifications. The last thing you want is to have an undiscovered bug in MarkovChain when you're writing the client. You may use TestMarkovChain.java to help you test. It implements the example in the assignment description.
Write the client program TextGenerator.java in small steps. Start by reading in the data and printing it back out. Next, add a command line parameter k, and print out all substrings of length k. Next print out all pairs of adjacent k-grams instead, and make it work with cyclic wrap-around. Then, insert these pairs into the Markov chain and print out the Markov chain. Now, remove all of the debugging print statements and simulate a random trajectory through the Markov chain. Thoroughly test and debug each piece as soon as you write it.

Enrichment

Here's a website that generate pseduo-random computer science papers. It uses something called a context free grammar instead of a Markov chain, but otherwise is similar in spirit to what you are doing on this assignment.
What can I do with my random text? CNN reports that one former student read his output at the Frist filibuster.