COS 226 Programming Assignment

Language modeling

Design and implement functions that build an order K Markov model from a piece of input text. Markov models are popular tools in speech recognition, handwriting recognition, information retrieval, and data compression. For this assignment, you will consider a different application: generating stylized pseudo-random text.

Markov models.  An order 0 Markov model predicts that each character in the alphabet occurs with a fixed probability, independent of previous characters. Given an input text, you compute the Markov model of order 0 by counting up the number of occurrences of each letter in the input, and use these as the frequencies. For example, if the input text is "agggcagcgggcg", then the order 0 Markov model predicts that each character is a with probability 2/13, c with probability 3/13, and g with probability 8/13.

Characters in English text are not independent. If the previous 5 characters in a string are orith, then the next letter is almost surely m. An order K Markov model predicts that each character occurs with a fixed probability, but that probability can depend on the previous k characters. Given an input text, you compute a Markov model of order K by counting up the number of occurrences of each letter that follows each sequence of K letters. For example, if the text has 100 occurrences of th, with 50 occurrences of the, 25 occurrences of thi, 20 occurrences of tha, and 5 occurrences of tho, the order 2 Markov model predicts that the next character following th is e with probability 1/2, i with probability 1/4, a with probability 1/5, and o with probability 1/20.

Part 0: Markov model. Create a data type Markov to represent a K-character substring. Ultimately, it will have a method random that returns a random character according to the Markov model. For now, just make it store the substring and an integer that counts the number of times the substring appears. You will need a constructor, a method to increment the frequency count, and the usual toString method for output.

public Markov(String substring) 
public void add()
public String toString()
Part 1: frequency counts. Implement a client program FrequencyCounter.java that reads the order parameter K of the Markov model from the command-line, a text string from standard input, and uses a symbol table to insert each K-character substring (key) from the text. For example, if K is 2 and the input string is "agggcagcgggcg", then your client program should create Markov objects for each of the 5 distinct keys, and call the add method 12 times in total: ag gg gg gc ca ag gc cg gg gg gc cg. Maintain an integer count of the number of occurrences of each key. Use your symbol table's methods to print out the number of distinct keys and the number of times each key appears in the text. For the example above, your program should output:
5 distinct keys
2 ag
1 ca
2 cg
3 gc
4 gg

Part 2: language model. To generate random text, given a K-character key, you data type Markov must know all of the letters that follow the K-character key. This operation is at the crux of the matter, as you will need it to generate random characters in accordance with the Markov model. Modify your Markov data type so that in addition to the frequency counts, it records the breakdown depending on the next letter. Create your own linked list to keep track of the list of suffix characters along with their frequencies. Modify the toString method so that it it prints out the list of suffixes, along with the substring and the frequency count. Include the following method to insert a suffix character.

public void add(char c)

Modify FrequencyCounter.java to create a program LanguageModeler.java that inserts keys into the symbol table (if necessary), and calls add(char c) to add the appropriate suffix characters to the Markov model. It should produce the following output on the example input.

5 distinct keys
2 ag: 1 c 1 g
1 ca: 1 g
1 cg: 1 g
3 gc: 1 a 2 g
4 gg: 2 c 2 g
To make things look pretty, convert all whitespace characters into spaces or underscores. Note that since the last cg substring doesn't have a "next" character, we don't include it in the model.

Part 3: pseudo-random text generation. Add a method random to Markov that and returns a pseudo-random character according to the language model. Be sure to get the probabilities right, as we will be checking this.

Now, write a program TextGenerator.java that takes a command line input K, a command line input M, reads in a text string from standard input, and prints out M characters according to the order K Markov model. Start by printing the first K characters of the original text. Then, repeatedly generate successive pseudo-random characters. Using the example above, if the Markov object m represents the substring "gg", then m.random() should return c or g, each with probability 1/2. After you generate a character, move over one character position, always using the last K characters generated to determine the probabilities for the next. For example, if your program chooses c in the example above, then the next Markov object would represent the substring "gc", and according to the Markov model, the next character should be a with probability 1/3 and g with probability 2/3. Continue the process until you have output M characters. If the language model contains less than 100 K-tuples (prefixes), then print the language model (as in the program LanguageModel.java) before you output the M randomly generated characters.

Deliverables. Submit your code as usual, along with one or two of the most amusing language-modeling examples that you come up with. Describe your data structure and algorithms in detail, and analyze the most important performance characteristics of your program in the readme.txt file.

Amusement. Once you get the program working, you will find that the random text with low-order models starts to sound more and more like the original text as you increase the order, as illustrated in the examples below. There are many more apparent pearls of wisdom in the full texts on the web. As you can see, there are limitless opportunities for amusement here. Try your model on some of your own text, or find something interesting on the net. You may wish to write small filters to clean up white space for pure text, but be careful, as the special characters sometimes drive the model, as illustrated in the Shakespeare example.

This assignment was developed by Bob Sedgewick and Kevin Wayne, and is based on a classic idea of Claude Shannon from the 1940s.
Copyright © 2004.

Example 1 input: news item

Microsoft said Tuesday the company would comply with a preliminary ruling by Federal District Court Judge Ronald H. Whyte that Microsoft is no longer able to use the Java Compatibility Logo on its packaging and websites for Internet Explorer and Software Developers Kit for Java.

"We remain confident that once all the facts are presented in the larger case, the court will find Microsoft to be in full compliance with its contract with Sun," stated Tom Burt, Associate General Counsel for Microsoft Corporation. "We are disappointed with this decision, but we will immediately comply with the Court's order."

Microsoft has been in the forefront of helping developers use the Java programming language to write cutting-edge applications. The company has committed significant resources so that Java developers have the option of taking advantage of Windows features when writing software using the Java language. Providing the best tools and programming options will continue to be Microsoft's goal.

"We will continue to listen to our customers and provide them the tools they need to write great software using the Java language," added Tod Nielsen, General Manager for Microsoft's Developer Relations Group/Platform Marketing.

Example 1 output: random news item, using input as an order 7 model

Microsoft said Tuesday the court will find Microsoft's goal.

"We will continue to listen to our customers and programming option of taking advantage of Windows features when writing software using the Java Compatibility Logo on its packaging and websites for Internet Explorer and Software using the best tools a nd programming language. Providing the Java language. Providing the Java programming language to write great software Developers Kit for Java.

"We remain confident that once all the facts are presented in the forefront of h elping developers have the option of taking advantage of Windows features when writing software Developers use the Java Compatibility Logo on its packaging and websites for Internet Explorer and Software using the best tools a nd provide them the tools they need to write cutting-edge applications. The company would comply with this decision, but we will immediately comply with this decision, but we will immediately comply with a preliminary ruling by Federal District Court Judge Ronald H. Whyte that Microsoft is no longer able to use the Java language," added Tod Nielsen, General Manager for Microsoft's goal.




Example 2 input: Speech to class of 2001, excerpts [link to full text]

Welcome to Princeton. This may be your first Princeton lecture, but it's not a typical one. For one thing it's the only time you'll be in a class of size more than 1000! Also, lectures usually involve slides or vugraphs, or at least a blackboard. When Hal told me this lecture would be in this room and that no audio-visual aids would be possible, I realized the challenge: we've all been on vacation all summer, and now we have to deal in ideas, face-to-face. No slides. No movies. No organist. Not even any Internet access. Well, at least the experience ties in with the topic of this lecture, as you'll see.

Many of you have probably not done much academic work since you opened that thick envelope from Fred Hargadon. Right? The purpose of this lecture is to set your minds in motion, because you'll need them in gear at full speed when classes start on Thursday.

The topic that I've chosen for this purpose is the prospect of having all knowledge online, and its implications. To start, I need to test some basic assumptions that I've made in preparing this talk: how many of you have never used a computer? how many of you use electronic mail? how many of you have ever used the Internet? how many use it regularly? how many run companies that produce Web pages? OK. Well, it looks as though I don't have to describe the basic features of the net to most of you. I'm not going to assume much, anyway.

You can find a link to a web page for this lecture on my home page. If you've never been on the net, take this opportunity to get a friend to show it to you. Also, after you've had a chance to discuss this talk in your residential colleges tonight, if you'd like to send me e-mail with your reaction to it, please feel free to do so. I'll collect the mail that I get and put it on the web page.

SUMMARY OF BUSH ARTICLE

I'd like to begin with a brief summary of the article "As We May Think", which was written by Vannevar Bush in 1945. The article was written at the end of World War II. Science played a significant role in the outcome of the war, and Bush wonders where scientists will turn their attention next.

Example 2 output: random speech, using input as an order 7 model, excerpts [link to full text]

Welcome to life before you were born.

At a universities, where the Joneses were invented, actually think about effectively few people expected to revolution, that it might be like?

Before therefore replace some teachers instantly being invented, actually by John von Neumann, right help are on the lecture would need to think that of technical device called "associative instruction of interconnections in the audience somewhere! I was at Xerox PARC. I visited there were to postulate that the university, we try to absorb new ideas to others all around in millions of dollars on things were 10 years, a small amount of information, and he mentioned that both of these thing that a physical libraries and museums at Harvard, so I'm not much about 5 years, it's fair to save every keystroke typed in that comprise though the web, the information, after you've never been on that limb.

SUMMARY OF BUSH ARTICLE

No argument with it? I'd be far out on a limb if I said that enhance our understand the people expect to have a clear how much about IBM in the functionality of today, but the article "As We May Think", which are available at the enterprise upon which you agree with Noam. Again, let me begins by noting them to help solve scientists in the amount of information, the number of problems.

Still, Bush did hit the nail on the verge of breaking down. Why? First, he says, there are now found in T-shirts and sandals, drinking personal attention next.

I'd like to say that you missed the connection be linked together in different world different than access it by typing in a short time with exponential colleges tonight.




Example 3 input: As you Like It, excerpts [link to full text]

	[Enter DUKE SENIOR, AMIENS, and two or three Lords,
	like foresters]

DUKE SENIOR	Now, my co-mates and brothers in exile,
	Hath not old custom made this life more sweet
	Than that of painted pomp? Are not these woods
	More free from peril than the envious court?
	Here feel we but the penalty of Adam,
	The seasons' difference, as the icy fang
	And churlish chiding of the winter's wind,
	Which, when it bites and blows upon my body,
	Even till I shrink with cold, I smile and say
	'This is no flattery: these are counsellors
	That feelingly persuade me what I am.'
	Sweet are the uses of adversity,
	Which, like the toad, ugly and venomous,
	Wears yet a precious jewel in his head;
	And this our life exempt from public haunt
	Finds tongues in trees, books in the running brooks,
	Sermons in stones and good in every thing.
	I would not change it.

AMIENS	Happy is your grace,
	That can translate the stubbornness of fortune
	Into so quiet and so sweet a style.

DUKE SENIOR	Come, shall we go and kill us venison?
	And yet it irks me the poor dappled fools,
	Being native burghers of this desert city,
	Should in their own confines with forked heads
	Have their round haunches gored.

Example 3 output: random Shakespeare, using order 6 model, excerpts [link to full text]

DUKE SENIOR	Now, my co-mates and thus bolden'd, man, how now, monsieur Jaques,
	Unclaim'd of his absence, as the holly!
	Though in the slightest for the fashion of his absence, as the only wear.

TOUCHSTONE	I care not for meed!
	This I must woo yours: your request than your father: the time,
	That ever love I broke
	my sword upon some kind of men
	Then, heigh-ho! sing, heigh-ho! sing, heigh-ho! sing, heigh-ho! unto the needless stream;
	'Poor deer,' quoth he,
	'Call me not so keen,
	Because thou the creeping hours of the sun,
	As man's feasts and women merely players:
	Thus we may rest ourselves and neglect the cottage, pasture?

	[Exit]

	[Enter DUKE FREDERICK	Can in his time in my heartily,
	And have me go with your fortune
	In all this fruit
	Till than bear
	the arm's end: I will through
	Cleanse the uses of the way to look you.
	Know you not, master,
	Sighing like upon a stone another down his bravery is not so with his effigies with my food:
	To speak my mind, and inquisition
	And unregarded age in corners throat,
	He will come hither:
	He dies that hath engender'd:
	And you to
	the bed untreasured of the brutish sting it.