COS 126

Markov Model of Natural Language
Programming Assignment


Use a Markov chain to create a statistical model of a piece of English text. Simulate the Markov chain to generate stylized pseudo-random text.

Perspective. In the 1948 landmark paper A Mathematical Theory of Communication, Claude Shannon founded the field of information theory and revolutionized the telecommunications industry, laying the groundwork for today's Information Age. In this paper, Shannon proposed using a Markov chain to create a statistical model of the sequences of letters in a piece of English text. Markov chains are now widely used in speech recognition, handwriting recognition, information retrieval, data compression, and spam filtering. They also have many scientific computing applications including the genemark algorithm for gene prediction, the Metropolis algorithm for measuring thermodynamical properties, and Google's PageRank algorithm for Web search. For this assignment, we consider a whimsical variant: generating stylized pseudo-random text.

Markov model of natural language. Shannon approximated the statistical structure of a piece of text using a simple mathematical model known as a Markov model. A Markov model of order 0 predicts that each letter in the alphabet occurs with a fixed probability. We can fit a Markov model of order 0 to a specific piece of text by counting the number of occurrences of each letter in that text, and using these frequencies as probabilities. For example, if the input text is "gagggagaggcgagaaa", the Markov model of order 0 predicts that each letter is 'a' with probability 7/17, 'c' with probability 1/17, and 'g' with probability 9/17 because these are the fraction of times each letter occurs. The following sequence of letters is a typical example generated from this model:

g a g g c g a g a a g a g a a g a a a g a g a g a g a a a g a g a a g ...
A Markov model of order 0 assumes that each letter is chosen independently. This independence does not coincide with statistical properties of English text because there a high correlation among successive letters in a word or sentence. For example, 'w' is more likely to be followed with 'e' than with 'u', while 'q' is more likely to be followed with 'u' than with 'e'.

We obtain a more refined model by allowing the probability of choosing each successive letter to depend on the preceding letter or letters. A Markov model of order k predicts that each letter occurs with a fixed probability, but that probability can depend on the previous k consecutive letters. Let a k-gram mean any k consecutive letters. Then for example, if the text has 100 occurrences of "th", with 60 occurrences of "the", 25 occurrences of "thi", 10 occurrences of "tha", and 5 occurrences of "tho", the Markov model of order 2 predicts that the next letter following the 2-gram "th" is 'e' with probability 3/5, 'i' with probability 1/4, 'a' with probability 1/10, and 'o' with probability 1/20.

A brute-force solution. Claude Shannon proposed a brute-force scheme to generate text according to a Markov model of order 1:

“ To construct [a Markov model of order 1], for example, one opens a book at random and selects a letter at random on the page. This letter is recorded. The book is then opened to another page and one reads until this letter is encountered. The succeeding letter is then recorded. Turning to another page this second letter is searched for and the succeeding letter recorded, etc. It would be interesting if further approximations could be constructed, but the labor involved becomes enormous at the next stage. ”
Your task is to write a Java program to automate this laborious task in a more efficient way — Shannon's brute-force approach is prohibitively slow when the size of the input text is large.

Markov model data type. Create an immutable data type MarkovModel to represent a Markov model of order k from a given text string. The data type must implement the following API:

public class MarkovModel
----------------------------------------------------------------------------------------
// Note: all of the below constructors/methods should be public.

       MarkovModel(String text, int k)// create a Markov model of order k from given text
                                      // Assume that text has length at least k.

   int order()                        // order k of Markov model

   int freq(String kgram)             // number of occurrences of kgram in text
                                      // (throw an exception if kgram is not of length k)

   int freq(String kgram, char c)     // number of times that character c follows kgram
                                      // (throw an exception if kgram is not of length k)

  char rand(String kgram)             // random character following given kgram
                                      // (Throw an exception if kgram is not of length k.
                                      //  Throw an exception if no such kgram.)

String gen(String kgram, int T)       // generate a String of length T characters
                                      // by simulating a trajectory through the corresponding
                                      // Markov chain.  The first k characters of the newly
                                      // generated String should be the argument kgram.
                                      // Throw an exception if kgram is not of length k.
                                      // Assume that T is at least k.

Implement throwing the RuntimeExceptions indicated by the API above.

To avoid dead ends, treat the input text as a circular string: the last character is considered to precede the first character. For example, if k = 2 and the text is the 17-character string "gagggagaggcgagaaa", then the salient features of the Markov model are captured in the table below:

               frequency of   probability that
                next char       next char is 
kgram   freq    a   c   g        a    c    g
----------------------------------------------
 aa      2      1   0   1       1/2   0   1/2  
 ag      5      3   0   2       3/5   0   2/5  
 cg      1      1   0   0        1    0    0
 ga      5      1   0   4       1/5   0   4/5  
 gc      1      0   0   1        0    0    1  
 gg      3      1   1   1       1/3  1/3  1/3  
----------------------------------------------
        17      7   1   9
Note that the frequency of "ag" is 5 (and not 4) because we are treating the string as circular.

A Markov chain is a stochastic process where the state change depends on only the current state. For text generation, the current state is a k-gram. The next character is selected at random, using the probabilities from the Markov model. For example, if the current state is "ga" in the Markov model of order 2 discussed above, then the next character is 'a' with probability 1/5 and 'g' with probability 4/5. The next state in the Markov chain is obtained by appending the new character to the end of the k-gram and discarding the first character. A trajectory through the Markov chain is a sequence of such states. Below is a possible trajectory consisting of 9 transitions.

trajectory:          ga  -->  ag  -->  gg  -->  gc  -->  cg  -->  ga  -->  ag  -->  ga  -->  aa  -->  ag
probability for a:       1/5      3/5      1/3       0        1       1/5      3/5      1/5      1/2
probability for c:        0        0       1/3       0        0        0        0        0        0
probability for g:       4/5      2/5      1/3       1        0       4/5      2/5      4/5      1/2
Treating the input text as a circular string ensures that the Markov chain never gets stuck in a state with no next characters.

To generate random text from a Markov model of order k, set the initial state to k characters from the input text. Then, simulate a trajectory through the Markov chain by performing T − k transitions, appending the random character selected at each step. For example, if k = 2 and T = 11, the following is a possible trajectory leading to the output gaggcgagaag.

trajectory:          ga  -->  ag  -->  gg  -->  gc  -->  cg  -->  ga  -->  ag  -->  ga  -->  aa  -->  ag
output:              ga        g        g        c        g        a        g        a        a        g

Text generation client. Write a client program TextGenerator that takes two command-line integers k and T, reads the input text from standard input and builds a Markov model of order k from the input text; then, starting with the k-gram consisting of the first k letters of the input text, prints out T characters generated by simulating a trajectory through the corresponding Markov chain. You may assume that the text has length at least k, and also that Tk.

% more input17.txt
gagggagaggcgagaaa

% java TextGenerator 2 11 < input17.txt 
gaggcgagaag

% java TextGenerator 2 11 < input17.txt 
gaaaaaaagag

Experimentation. Once you get the program working, test it on different inputs of different sizes and different orders. Does increasing the order have the effect you expect? Try your model on something that you have written or some other text you know well. Make sure to test both long inputs (we provide several) and long outputs.

Files provided. We provide a number of sample test files this week. As well, we provide the readme.txt template, the abbreviated partner readme.txt template, and a copy of ST.java, which is not installed by default. Obtain the files either as markov.zip or through the ftp site's markov directory, following the instructions from previous weeks.

Deliverables. Submit MarkovModel.java, TextGenerator.java, and readme.txt. If working in a pair, one student should submit these, and the other should only submit the abbreviated partner readme.txt. Include in your readme.txt two of the most entertaining language-modeling fragments that you discover.

If you and your partner both do the extra credit separately, you are both allowed to upload .java files to facilitate testing.

Extra credit. Imagine you receive a message where some of the characters have been corrupted by noise. We represent unknown characters by the ~ symbol (we assume we don't use ~ in our messages). Add a method replaceUnknown to MarkovModel.java that decodes a noisy message by replacing each ~ with the most likely character given our order k Markov model, and conditional on the surrounding text:

String replaceUnknown(String corrupted)  // replace unknown characters with most probable characters
(Note: Ignore the API warning that you will get because this method is not part of the original API for MarkovModel. Do not ignore any other warnings.)

Assume unknown characters are at least k characters apart and also appear at least k characters away from the start and end of the message. Test your new method by writing a client program FixCorrupted.java that takes as arguments the model order and the noisy string. The program should print out the most likely original string:

Original    : it was the best of times, it was the worst of times.
Noisy       : it w~s th~ bes~ of tim~s, i~ was ~he wo~st of~times.

%  java FixCorrupted 4 "it w~s th~ bes~ of tim~s, i~ was ~he wo~st of~times." < wiki_100k.txt 
it was the best of times, it was the worst of times.

%  java FixCorrupted 2 "it w~s th~ bes~ of tim~s, i~ was ~he wo~st of~times." < wiki_100k.txt 
it was the best of times, is was the woust of times.
This maximum-likelihood approach doesn't always get it perfect, but it fixes most of the missing characters correctly.

Here are some details on what it means to find the most likely replacement for each ~. For each unknown character, you should consider all possible replacement characters. You want the replacement character that makes sense not only at the unknown position (given the previous characters) but also when the replacement is used in the context of the k subsequent known characters. For example we expect the unknown character in "was ~he wo" to be 't' and not simply the most likely character in the context of "was ". You can compute the probability of each hypothesis by multiplying the probabilities of generating each of k+1 characters in sequence: the missing one, and the k next ones.

This assignment was developed by Bob Sedgewick and Kevin Wayne, based on the classic idea of Claude Shannon.
Copyright © 2004.



Example 1 input: news item

Microsoft said Tuesday the company would comply with a preliminary ruling by Federal District Court Judge Ronald H. Whyte that Microsoft is no longer able to use the Java Compatibility Logo on its packaging and websites for Internet Explorer and Software Developers Kit for Java.

"We remain confident that once all the facts are presented in the larger case, the court will find Microsoft to be in full compliance with its contract with Sun," stated Tom Burt, Associate General Counsel for Microsoft Corporation. "We are disappointed with this decision, but we will immediately comply with the Court's order."

Microsoft has been in the forefront of helping developers use the Java programming language to write cutting-edge applications. The company has committed significant resources so that Java developers have the option of taking advantage of Windows features when writing software using the Java language. Providing the best tools and programming options will continue to be Microsoft's goal.

"We will continue to listen to our customers and provide them the tools they need to write great software using the Java language," added Tod Nielsen, General Manager for Microsoft's Developer Relations Group/Platform Marketing.

Example 1 output: random news item, using input as an order 7 model

Microsoft said Tuesday the court will find Microsoft's goal.

"We will continue to listen to our customers and programming option of taking advantage of Windows features when writing software using the Java Compatibility Logo on its packaging and websites for Internet Explorer and Software using the best tools a nd programming language. Providing the Java language. Providing the Java programming language to write great software Developers Kit for Java.

"We remain confident that once all the facts are presented in the forefront of helping developers have the option of taking advantage of Windows features when writing software Developers use the Java Compatibility Logo on its packaging and websites for Internet Explorer and Software using the best tools a nd provide them the tools they need to write cutting-edge applications. The company would comply with this decision, but we will immediately comply with this decision, but we will immediately comply with a preliminary ruling by Federal District Court Judge Ronald H. Whyte that Microsoft is no longer able to use the Java language," added Tod Nielsen, General Manager for Microsoft's goal.




Example 2 input: Speech to class of 2001, excerpts [link to full text]

Welcome to Princeton. This may be your first Princeton lecture, but it's not a typical one. For one thing it's the only time you'll be in a class of size more than 1000! Also, lectures usually involve slides or vugraphs, or at least a blackboard. When Hal told me this lecture would be in this room and that no audio-visual aids would be possible, I realized the challenge: we've all been on vacation all summer, and now we have to deal in ideas, face-to-face. No slides. No movies. No organist. Not even any Internet access. Well, at least the experience ties in with the topic of this lecture, as you'll see.

Many of you have probably not done much academic work since you opened that thick envelope from Fred Hargadon. Right? The purpose of this lecture is to set your minds in motion, because you'll need them in gear at full speed when classes start on Thursday.

The topic that I've chosen for this purpose is the prospect of having all knowledge online, and its implications. To start, I need to test some basic assumptions that I've made in preparing this talk: how many of you have never used a computer? how many of you use electronic mail? how many of you have ever used the Internet? how many use it regularly? how many run companies that produce Web pages? OK. Well, it looks as though I don't have to describe the basic features of the net to most of you. I'm not going to assume much, anyway.

You can find a link to a web page for this lecture on my home page. If you've never been on the net, take this opportunity to get a friend to show it to you. Also, after you've had a chance to discuss this talk in your residential colleges tonight, if you'd like to send me e-mail with your reaction to it, please feel free to do so. I'll collect the mail that I get and put it on the web page.

SUMMARY OF BUSH ARTICLE

I'd like to begin with a brief summary of the article "As We May Think", which was written by Vannevar Bush in 1945. The article was written at the end of World War II. Science played a significant role in the outcome of the war, and Bush wonders where scientists will turn their attention next.

Example 2 output: random speech, using input as an order 7 model, excerpts [link to full text]

Welcome to life before you were born.

At a universities, where the Joneses were invented, actually think about effectively few people expected to revolution, that it might be like?

Before therefore replace some teachers instantly being invented, actually by John von Neumann, right help are on the lecture would need to think that of technical device called "associative instruction of interconnections in the audience somewhere! I was at Xerox PARC. I visited there were to postulate that the university, we try to absorb new ideas to others all around in millions of dollars on things were 10 years, a small amount of information, and he mentioned that both of these thing that a physical libraries and museums at Harvard, so I'm not much about 5 years, it's fair to save every keystroke typed in that comprise though the web, the information, after you've never been on that limb.

SUMMARY OF BUSH ARTICLE

No argument with it? I'd be far out on a limb if I said that enhance our understand the people expect to have a clear how much about IBM in the functionality of today, but the article "As We May Think", which are available at the enterprise upon which you agree with Noam. Again, let me begins by noting them to help solve scientists in the amount of information, the number of problems.

Still, Bush did hit the nail on the verge of breaking down. Why? First, he says, there are now found in T-shirts and sandals, drinking personal attention next.

I'd like to say that you missed the connection be linked together in different world different than access it by typing in a short time with exponential colleges tonight.




Example 3 input: As you Like It, excerpts [link to full text]

	[Enter DUKE SENIOR, AMIENS, and two or three Lords,
	like foresters]

DUKE SENIOR	Now, my co-mates and brothers in exile,
	Hath not old custom made this life more sweet
	Than that of painted pomp? Are not these woods
	More free from peril than the envious court?
	Here feel we but the penalty of Adam,
	The seasons' difference, as the icy fang
	And churlish chiding of the winter's wind,
	Which, when it bites and blows upon my body,
	Even till I shrink with cold, I smile and say
	'This is no flattery: these are counsellors
	That feelingly persuade me what I am.'
	Sweet are the uses of adversity,
	Which, like the toad, ugly and venomous,
	Wears yet a precious jewel in his head;
	And this our life exempt from public haunt
	Finds tongues in trees, books in the running brooks,
	Sermons in stones and good in every thing.
	I would not change it.

AMIENS	Happy is your grace,
	That can translate the stubbornness of fortune
	Into so quiet and so sweet a style.

DUKE SENIOR	Come, shall we go and kill us venison?
	And yet it irks me the poor dappled fools,
	Being native burghers of this desert city,
	Should in their own confines with forked heads
	Have their round haunches gored.

Example 3 output: random Shakespeare, using order 6 model, excerpts [link to full text]

DUKE SENIOR	Now, my co-mates and thus bolden'd, man, how now, monsieur Jaques,
	Unclaim'd of his absence, as the holly!
	Though in the slightest for the fashion of his absence, as the only wear.

TOUCHSTONE	I care not for meed!
	This I must woo yours: your request than your father: the time,
	That ever love I broke
	my sword upon some kind of men
	Then, heigh-ho! sing, heigh-ho! sing, heigh-ho! sing, heigh-ho! unto the needless stream;
	'Poor deer,' quoth he,
	'Call me not so keen,
	Because thou the creeping hours of the sun,
	As man's feasts and women merely players:
	Thus we may rest ourselves and neglect the cottage, pasture?

	[Exit]

	[Enter DUKE FREDERICK	Can in his time in my heartily,
	And have me go with your fortune
	In all this fruit
	Till than bear
	the arm's end: I will through
	Cleanse the uses of the way to look you.
	Know you not, master,
	Sighing like upon a stone another down his bravery is not so with his effigies with my food:
	To speak my mind, and inquisition
	And unregarded age in corners throat,
	He will come hither:
	He dies that hath engender'd:
	And you to
	the bed untreasured of the brutish sting it.