Programming Assignment Checklist: Markov Model of Natural Language

Frequently Asked Questions

What are the main goals of this assignment? Use a graph-like data structure, learn about natural language processing, and learn about Markov chains.

What is the origin of the Markov text generator? It was first described by Claude Shannon in 1948. The first computer version was apparently written by Don P. Mitchell, adapted by Bruce Ellis, and popularized by A. K. Dewdney in the Computing Recreations section of Scientific American. Brian Kernighan and Rob Pike revived the program in a University setting and described it as an example of design in The Practice of Programming. The program is also described in Jon Bentley's Programming Pearls.

How do I read in characters from standard input? Use StdIn.readAll().

Given a string s, is there an efficient way to get a substring? Use s.substring(i, i+k) to get the k-character substring starting at i. In Java, this operation takes constant time, regardless of the size of k. This efficient construction is possible because strings are immutable, which enables the library to reuse the relevant part of the original string's character array. See Section 3.1 for more information on strings.

Why treat the text as a cyclic string? This prevents the Markov chain from ever getting "stuck", i.e., entering a state with no outgoing transitions.

How do I emulate the behavior of a cyclic string? There are a number of approaches. The easiest might be to concatenate the first k characters to the end of the text.

Should my program generate a different output each time I run it? Yes.

My random text ends in the middle of a sentence. Is that OK? Yes, that's to be expected.

For which values of k should my program work? It's fine to assume k > 0. Naturally, as k gets larger, your program will use more memory and take longer. Also, you can assume k <= M.

I get an OutOfMemoryException. How do I tell Java to use more of my computer's memory? Depending on your operating system, you may have to ask the Java Virtual Machine for more main memory beyond the 64MB default.

java -Xmx100m TextGenerator 7 1000 < input.txt
The 100m means 100MB, and you should adjust this number depending on the size of the input text.

Why not store the probability of each transition instead of using parallel edges? It's easier to implement this way. Storing the probabilities would be more efficient and more general, but it's not necessary on this assignment.

Submission

readme.txt. Use the following readme file template.

Possible Progress Steps

These are purely suggestions for how you might make progress. You do not have to follow these steps.

  1. Download the directory markov. It contains an implemention of Graph.java, ST.java, SET.java, and Queue.java which you can use as a starting point.

  2. Rename Queue.java to a file RandomQueue.java and modify it to model a random queue.

  3. Rename Graph to MarkovChain and modify it to model a Markov chain.

  4. Write the client program TextGenerator.java in small steps.

Enrichment