COS 226, Spring 2001. Program 6

COS 226 Programming Assignment 6 - Checklist

Checklist 6: Language Modeling.

The following is a checklist which provides a summary of the assignment. This is meant only as a supplement; please read the original assignment description.

Frequently Asked Questions

Requirements

Possible Progress Steps

Sample Input

Performance

Frequently Asked Questions:
If you feel there is good reason to be storing the entire input string for your program, you are free to do so. Most methods need only the most recent k characters, so explain why you need to save the rest in your readme).

For this assignment, it is helpful to be able to generate different random trials, by changing the seed used for the random number generator.
One way to do so is to use random(), and thus have the following two lines at the top:

#include <sys/types.h> #include <sys/timeb.h>

and the following two lines in the main routine:

long seed=time(NULL); srandom(seed);

Requirements:

Functionality:

What is most important is that your program correctly generates each output character based on the correct probability, as described in the original assignment.
Submit only your final, most efficient, program. (all of the other progress steps in the assignment are to be considered only as suggested steps of progress)

In particular, please note the original comments on the use of space in this assignment.

Input Format: We will test your program by typing a.out k < inputfile on a unix system, where command line argument k is the order of the language model. The alphabet for the inputfile may contain all possible characters. (Note: this is in contrast to the input files we used for the previous assignment which only contained characters and spaces).

Output Format: The output for your program should be text which is generated by your program.

In the original assignment, it mentions that the length of your output should be the same number of characters as appeared in the input. However, since we are using extremely large inputfiles, feel free to stop the output at any reasonable length (say 5000 characters?)

There is a possiblity that your random process will at some point generate a match for the final k characters of the inputfile. If that string does not appear anywhere else in the input, then this will force a dead end to the random generation. If this causes your output to be too short, rerun to get a better sample output to turn in (or put in code to protect against this case).

readme: Your readme file for this program should contain the following:

A high-level description for the design of your algorithm, and what influenced your decisions.

A (brief) discussion of your performance characteristics, especially that of space usage

Submit one or two of the most amusing examples that your program generated.

Possible Progress Steps: These are purely suggestions for how you might make progress. You do not have to follow these steps.

Do an array-based implementation where each k+1-letter combination has an entry which counts the number of occurences in the text, and use this to generate your language model.

Try to signficantly reduce the space usage by either modifying your original method or completely redesigning a new data structure.

Make sure to experiment as the value of k grows.

Sample Input Here are some sample input files which you may use, although you are free to find some other interesting and fun examples. (Note: unlike assignment 5, these files contain original spacing, punctuation and capitalization).

Inputfile	Source	N
princeton.txt	A Packet article about Princeotn	7959
aesopshort.txt	collection of Aesop's fables	10280
moby1.txt	Moby Dick - Chapter 1	12218
amendments.txt	Constitutional Amendments	18369
y2kintro.txt	Introduction of the recent Senate report on Y2K	21224
baby.txt	How baby's learn language	22200
manifesto.txt	Communist Party Manifesto	72955
muchado.txt	Much Ado about Nothing	123413
aesop.txt	collection of Aesop's fables	191945
starr.txt	The Starr Report narrative	234378	(warning: explicit language)
lilwomen.txt	Little Women	1042048
mobydick.txt	Moby Dick	1191463

[Note: all of these files are also located in ~/cs226/prog6_files/ on the phoenix machines, so there is no need to download them if you are running on those machines.]

Performance Take these numbers with a grain of salt. The following table lists the user CPU time (the first field if you use the 'time' command), when Michael Goldwasser's program program was run in 1999 on flagstaff, generating 3000 characters of output. Take these numbers with a grain of salt. If you easily beat them, they indicate the fast pace of advances in technology; if you don't, you may have a poor underlying algorithm!

Inputfile	k=3	k=7	k=12	k=20
princeton.txt	0.02	0.10	0.17	0.27
aesopshort.txt	0.05	0.15	0.22	0.34
moby1.txt	0.04	0.16	0.26	0.42
amendments.txt	0.05	0.18	0.29	0.50
y2kintro.txt	0.08	0.24	0.42	0.70
baby.txt	0.08	0.25	0.39	0.70
manifesto.txt	0.17	0.67	1.20	2.14
muchado.txt	0.28	1.18	2.20	3.70
aesop.txt	0.38	1.70	3.40	5.89
starr.txt	0.41	1.85	3.60	6.60
lilwomen.txt	1.86	9.45	18.75	31.03
mobydick.txt	2.38	11.69	22.04	37.34

cos226 Class Page
rs@cs.princeton.edu

Last modified: April 2, 2001