COS 226 Programming Assignment 6 - Checklist


Checklist 6: Language Modeling.

The following is a checklist which provides a summary of the assignment. This is meant only as a supplement; please read the original assignment description.

  • Frequently Asked Questions
  • Requirements
  • Possible Progress Steps
  • Sample Input
  • Performance

  • Frequently Asked Questions:
    If you feel there is good reason to be storing the entire input string for your program, you are free to do so. Most methods need only the most recent k characters, so explain why you need to save the rest in your readme).

    For this assignment, it is helpful to be able to generate different random trials, by changing the seed used for the random number generator.
    One way to do so is to use random(), and thus have the following two lines at the top:

    #include <sys/types.h>
    #include <sys/timeb.h>
    and the following two lines in the main routine:
    long seed=time(NULL);
    srandom(seed);

    Requirements:
  • Functionality:
  • What is most important is that your program correctly generates each output character based on the correct probability, as described in the original assignment.
    Submit only your final, most efficient, program. (all of the other progress steps in the assignment are to be considered only as suggested steps of progress)
  • In particular, please note the original comments on the use of space in this assignment.
  • Input Format: We will test your program by typing a.out k < inputfile on a unix system, where command line argument k is the order of the language model. The alphabet for the inputfile may contain all possible characters. (Note: this is in contrast to the input files we used for the previous assignment which only contained characters and spaces).
  • Output Format: The output for your program should be text which is generated by your program.
  • In the original assignment, it mentions that the length of your output should be the same number of characters as appeared in the input. However, since we are using extremely large inputfiles, feel free to stop the output at any reasonable length (say 5000 characters?)
  • There is a possiblity that your random process will at some point generate a match for the final k characters of the inputfile. If that string does not appear anywhere else in the input, then this will force a dead end to the random generation. If this causes your output to be too short, rerun to get a better sample output to turn in (or put in code to protect against this case).
  • readme: Your readme file for this program should contain the following:
  • A high-level description for the design of your algorithm, and what influenced your decisions.
  • A (brief) discussion of your performance characteristics, especially that of space usage
  • Submit one or two of the most amusing examples that your program generated.

  • Possible Progress Steps: These are purely suggestions for how you might make progress. You do not have to follow these steps.
  • Do an array-based implementation where each k+1-letter combination has an entry which counts the number of occurences in the text, and use this to generate your language model.
  • Try to signficantly reduce the space usage by either modifying your original method or completely redesigning a new data structure.
  • Make sure to experiment as the value of k grows.

  • Sample Input Here are some sample input files which you may use, although you are free to find some other interesting and fun examples. (Note: unlike assignment 5, these files contain original spacing, punctuation and capitalization).
    Inputfile Source N
    princeton.txt A Packet article about Princeotn 7959
    aesopshort.txt collection of Aesop's fables 10280
    moby1.txt Moby Dick - Chapter 1 12218
    amendments.txt Constitutional Amendments 18369
    y2kintro.txt Introduction of the recent Senate report on Y2K 21224
    baby.txt How baby's learn language 22200
    manifesto.txt Communist Party Manifesto 72955
    muchado.txt Much Ado about Nothing 123413
    aesop.txt collection of Aesop's fables 191945
    starr.txt The Starr Report narrative 234378 (warning: explicit language)
    lilwomen.txt Little Women 1042048
    mobydick.txt Moby Dick 1191463
    [Note: all of these files are also located in ~/cs226/prog6_files/ on the phoenix machines, so there is no need to download them if you are running on those machines.]

    Performance Take these numbers with a grain of salt. The following table lists the user CPU time (the first field if you use the 'time' command), when Michael Goldwasser's program program was run in 1999 on flagstaff, generating 3000 characters of output. Take these numbers with a grain of salt. If you easily beat them, they indicate the fast pace of advances in technology; if you don't, you may have a poor underlying algorithm!
    Inputfile k=3 k=7 k=12 k=20
    princeton.txt 0.02 0.10 0.17 0.27
    aesopshort.txt 0.05 0.15 0.22 0.34
    moby1.txt 0.04 0.16 0.26 0.42
    amendments.txt 0.05 0.18 0.29 0.50
    y2kintro.txt 0.08 0.24 0.42 0.70
    baby.txt 0.08 0.25 0.39 0.70
    manifesto.txt 0.17 0.67 1.20 2.14
    muchado.txt 0.28 1.18 2.20 3.70
    aesop.txt 0.38 1.70 3.40 5.89
    starr.txt 0.41 1.85 3.60 6.60
    lilwomen.txt 1.86 9.45 18.75 31.03
    mobydick.txt 2.38 11.69 22.04 37.34


    cos226 Class Page
    rs@cs.princeton.edu
    Last modified: April 2, 2001