COS 126Markov Model of Natural Language |
Programming AssignmentDue: 11:59pm |

Use a Markov chain to create a statistical model of a piece of English text. Simulate the Markov chain to generate stylized pseudo-random text.

**Perspective.**
In the 1948 landmark paper
A Mathematical Theory of Communication,
Claude Shannon founded the field of information theory and
revolutionized the telecommunications industry, laying the groundwork for today's
Information Age.
In this paper, Shannon proposed using a *Markov chain* to create a statistical
model of the sequences of letters in a piece of English text.
Markov chains are now widely used in
speech recognition, handwriting recognition, information retrieval,
data compression, and spam filtering.
They also have many scientific computing applications including:
the genemark algorithm for gene prediction, the Metropolis
algorithm for measuring thermodynamical properties,
and Google's PageRank algorithm for Web search.
For this assignment, you will consider a whimsical variant: generating
stylized pseudo-random text.

**Markov model of natural language.**
Shannon approximated the statistical structure of a piece of text using
a simple mathematical model known as a *Markov model*.
A Markov model of *order 0* predicts that each letter in the
alphabet occurs with a fixed probability.
We can fit a Markov model of order 0 to a specific piece of text by
counting the number of occurrences of each letter in that text,
and using these counts as probabilities.
For example, if the input text is `"agggcagcgggcg"`, then
the Markov model of order 0 predicts that each letter is
`'a'` with probability 2/13,
`'c'` with probability 3/13,
and
`'g'` with probability 8/13. The following sequence of letters
is a typical example generated from this model.

An order 0 model assumes that each letter is chosen independently. This does not coincide with the statistical properties of English text since there is a high correlation among successive letters in an English word or sentence. For example, the lettersa g g c g a g g g a g c g g c a g g g g . . .

We obtain a more refined model by allowing the probability of
choosing each successive letter to depend on the preceding letter or letters.
An *Markov model of order k* predicts that each letter occurs with
a fixed probability, but that probability can depend on the previous
*k* consecutive letters (*k*-gram).
For example, if the text has 100 occurrences of `"th"`,
with 60 occurrences of `"the"`,
25 occurrences of `"thi"`,
10 occurrences of `"tha"`,
and 5 occurrences of `"tho"`, the Markov model of order 2 predicts
that the next letter following the 2-gram `"th"` is
`'e'` with probability 3/5,
`'i'` with probability 1/4,
`'a'` with probability 1/10, and
`'o'` with probability 1/20.

**A brute force solution.**
Shannon proposed an interesting scheme to generate text according to
a Markov model of order 1.

To construct [an order 1 model] for example, one opens a book at random and selects a letter at random on the page. This letter is recorded. The book is then opened to another page and one reads until this letter is encountered. The succeeding letter is then recorded. Turning to another page this second letter is searched for and the succeeding letter recorded, etc. It would be interesting if further approximations could be constructed, but the labor involved becomes enormous at the next stage.Your task is write an efficient Java program to automate this laborious task. Shannon's brute force approach yields a statistically correct result, but, as Shannon observed, it is prohibitively slow when the size of the input text N is large. An alternate approach is to create a "Markov chain" and simulate a trajectory through it. This approach is described below.

**Markov chain.**
For the purpose of this assignment, a *Markov chain* is comprised of
a set of states, one distinguished state called the start state,
and a set of transitions from one state to another.
The figure below illustrates a Markov chain with 5 states and 14 transitions.

*Simulating a Markov chain.*
To simulate a *trajectory* through the Markov chain, begin at the start state.
At each step select one of the leaving arcs uniformly at random, and move
to the neighboring state.
For example, if the Markov chain is in state `bab`, then it will
transition to state `abb` with probability 3/4 and to state
`aba` with probability 1/4.
The following are the first ten steps of a possible trajectory beginning
from `bbb`:

bbb -> bbb -> bba -> bab -> abb -> bbb -> bbb -> bba -> bab -> abb

*Markov chain data type.*
Create a data type `MarkovChain` to represent a Markov chain of strings.
In addition to a constructor, the data type must have three public methods.

`addTransition(v, w)`: add a transition from state`v`to state`w`.`next(v)`: pick a transition leaving state`v`uniformly at random, and return the resulting state.`toString()`: return a string representation of this Markov chain.

**Text generation.**
We now describe how to generate pseudo-random text according to a Markov model
of order *k* by building a Markov chain from the piece of text,
and simulating a trajectory through the Markov chain.
Include a state for each of the distinct *k*-grams that occur in the text.
Also include *N* transitions, one from each *k*-gram in the text
to the next (overlapping) *k*-gram. treat the text as a cyclic string
to handle the boundary cases.
For example, if *k* = 3 and the text consists of
`bbbabbabbbbaba`, then the first two transitions to add
are from `bbb` to `bba` and from `bba` to `bab`,
and the last (cyclic) one is from `abb` to `bbb`.
The full Markov chain for k = 3 is illustrated in the figure from the previous
section.

Implement a client program `TextGenerator.java` that takes
two command line inputs *k* and *M*,
reads the text from standard input, builds the Markov chain associated
with the order *k* Markov model, and prints out *M*
pseudo-random characters according to the model.
Build the Markov chain as described above.
Each step of the Markov chain generates one new letter in the output.
Let `s` be the string corresponding to the first *k*
characters of the original text.
Print out `s` and start the Markov chain at state `s`.
Simulate the Markov chain for *M - k* steps, printing out
the last letter of each resulting state.

**Amusement.**
Once you get the program working, you will find that the
random text generated from low-order models starts to sound more and more
like the original text as you increase the order, as illustrated in
the examples below. There are many more apparent pearls of wisdom in
the full texts on the Web.
As you can see, there are limitless opportunities for amusement here.
Try your model on some of your own text, or find something interesting
on the net. You need not limit yourself to the English language.

**Deliverables.**
Submit `MarkovChain.java`, `AdjList.java`
and `TextGenerator.java`, along with
two of the most entertaining language-modeling fragments that you discover.

This assignment was developed by Bob Sedgewick and Kevin Wayne, based on the classic idea of Claude Shannon.

Copyright © 2004.

"We remain confident that once all the facts are presented in the larger case, the court will find Microsoft to be in full compliance with its contract with Sun," stated Tom Burt, Associate General Counsel for Microsoft Corporation. "We are disappointed with this decision, but we will immediately comply with the Court's order."

Microsoft has been in the forefront of helping developers use the Java programming language to write cutting-edge applications. The company has committed significant resources so that Java developers have the option of taking advantage of Windows features when writing software using the Java language. Providing the best tools and programming options will continue to be Microsoft's goal.

"We will continue to listen to our customers and provide them the tools they need to write great software using the Java language," added Tod Nielsen, General Manager for Microsoft's Developer Relations Group/Platform Marketing.

"We will continue to listen to our customers and programming option of taking advantage of Windows features when writing software using the Java Compatibility Logo on its packaging and websites for Internet Explorer and Software using the best tools a nd programming language. Providing the Java language. Providing the Java programming language to write great software Developers Kit for Java.

"We remain confident that once all the facts are presented in the forefront of helping developers have the option of taking advantage of Windows features when writing software Developers use the Java Compatibility Logo on its packaging and websites for Internet Explorer and Software using the best tools a nd provide them the tools they need to write cutting-edge applications. The company would comply with this decision, but we will immediately comply with this decision, but we will immediately comply with a preliminary ruling by Federal District Court Judge Ronald H. Whyte that Microsoft is no longer able to use the Java language," added Tod Nielsen, General Manager for Microsoft's goal.

Many of you have probably not done much academic work since you opened that thick envelope from Fred Hargadon. Right? The purpose of this lecture is to set your minds in motion, because you'll need them in gear at full speed when classes start on Thursday.

The topic that I've chosen for this purpose is the prospect of having all knowledge online, and its implications. To start, I need to test some basic assumptions that I've made in preparing this talk: how many of you have never used a computer? how many of you use electronic mail? how many of you have ever used the Internet? how many use it regularly? how many run companies that produce Web pages? OK. Well, it looks as though I don't have to describe the basic features of the net to most of you. I'm not going to assume much, anyway.

You can find a link to a web page for this lecture on my home page. If you've never been on the net, take this opportunity to get a friend to show it to you. Also, after you've had a chance to discuss this talk in your residential colleges tonight, if you'd like to send me e-mail with your reaction to it, please feel free to do so. I'll collect the mail that I get and put it on the web page.

SUMMARY OF BUSH ARTICLE

I'd like to begin with a brief summary of the article "As We May Think", which was written by Vannevar Bush in 1945. The article was written at the end of World War II. Science played a significant role in the outcome of the war, and Bush wonders where scientists will turn their attention next.

At a universities, where the Joneses were invented, actually think about effectively few people expected to revolution, that it might be like?

Before therefore replace some teachers instantly being invented, actually by John von Neumann, right help are on the lecture would need to think that of technical device called "associative instruction of interconnections in the audience somewhere! I was at Xerox PARC. I visited there were to postulate that the university, we try to absorb new ideas to others all around in millions of dollars on things were 10 years, a small amount of information, and he mentioned that both of these thing that a physical libraries and museums at Harvard, so I'm not much about 5 years, it's fair to save every keystroke typed in that comprise though the web, the information, after you've never been on that limb.

SUMMARY OF BUSH ARTICLE

No argument with it? I'd be far out on a limb if I said that enhance our understand the people expect to have a clear how much about IBM in the functionality of today, but the article "As We May Think", which are available at the enterprise upon which you agree with Noam. Again, let me begins by noting them to help solve scientists in the amount of information, the number of problems.

Still, Bush did hit the nail on the verge of breaking down. Why? First, he says, there are now found in T-shirts and sandals, drinking personal attention next.

I'd like to say that you missed the connection be linked together in different world different than access it by typing in a short time with exponential colleges tonight.

[Enter DUKE SENIOR, AMIENS, and two or three Lords, like foresters] DUKE SENIOR Now, my co-mates and brothers in exile, Hath not old custom made this life more sweet Than that of painted pomp? Are not these woods More free from peril than the envious court? Here feel we but the penalty of Adam, The seasons' difference, as the icy fang And churlish chiding of the winter's wind, Which, when it bites and blows upon my body, Even till I shrink with cold, I smile and say 'This is no flattery: these are counsellors That feelingly persuade me what I am.' Sweet are the uses of adversity, Which, like the toad, ugly and venomous, Wears yet a precious jewel in his head; And this our life exempt from public haunt Finds tongues in trees, books in the running brooks, Sermons in stones and good in every thing. I would not change it. AMIENS Happy is your grace, That can translate the stubbornness of fortune Into so quiet and so sweet a style. DUKE SENIOR Come, shall we go and kill us venison? And yet it irks me the poor dappled fools, Being native burghers of this desert city, Should in their own confines with forked heads Have their round haunches gored.

DUKE SENIOR Now, my co-mates and thus bolden'd, man, how now, monsieur Jaques, Unclaim'd of his absence, as the holly! Though in the slightest for the fashion of his absence, as the only wear. TOUCHSTONE I care not for meed! This I must woo yours: your request than your father: the time, That ever love I broke my sword upon some kind of men Then, heigh-ho! sing, heigh-ho! sing, heigh-ho! sing, heigh-ho! unto the needless stream; 'Poor deer,' quoth he, 'Call me not so keen, Because thou the creeping hours of the sun, As man's feasts and women merely players: Thus we may rest ourselves and neglect the cottage, pasture? [Exit] [Enter DUKE FREDERICK Can in his time in my heartily, And have me go with your fortune In all this fruit Till than bear the arm's end: I will through Cleanse the uses of the way to look you. Know you not, master, Sighing like upon a stone another down his bravery is not so with his effigies with my food: To speak my mind, and inquisition And unregarded age in corners throat, He will come hither: He dies that hath engender'd: And you to the bed untreasured of the brutish sting it.