COS 226 Programming Assignment Checklist: Language Modeling

Frequently Asked Questions

How do I parse the command-line inputs? The easiest solution is something like

int main(int argc, char *argv[]) {
    int K, M;
    if (argc != 3) exit(EXIT_FAILURE);    /* means that the command line did not contain all required argumnents. */
    K = atoi(argv[1]);                    /* Converts first command line argument (a string) to an integer */
    M = atoi(argv[2]);                    /* Converts second command line argu,ent to an integer */
As usual, our emphasis in 226 is not on bulletproof input validation.

What is the alphabet size? The inputs consist of ASCII text, not just lowercase letters and whitespace. Do not assume it is 26 or 96 or some other arbitrary number.

My random process dead ends because it generates a match for for the final k characters of the input file, but this string does not appear anywhere else in the input. That's OK. If you prefer, you can repeatedly restart your code from the beginning until you get the right number of characters.

How do I randomly choose from a set of symbols with associated probabilities? This is best explained using an example. Assume you want to choose "A" with probability 0.5, "B" with probability 0.25, "C" with probability 0.1 and "D" with probability 0.15. The rand() function (defined in <stdlib.h>) returns a random integer uniformly distributed in the range 0..RAND_MAX-1 (RAND_MAX is also defined in <stdlib.h>). The following hardwired piece of code does the trick for this example in a simple (but suboptimal) way:

r = rand();
if (r < RAND_MAX * 0.5) symbol = 'A';
  else if (r < RAND_MAX*(0.5+0.25)) symbol = 'B';
    else if (r < RAND_MAX*(0.5+0.25+0.1)) symbol = 'C';
      else symbol = 'D';
Of course, in the general case, when the number of symbols and their associated probabilities are arbitrary, something more general should be written... (Note that the solution described in the above example is suboptimal. It takes O(N) time to pick a symbol among N symbols. Can you do this in O(log(N)) time? Save this optimization for last, when everything else is working, for extra credit).

Should my program generate a different output each time I run it? Yes. You can do this by setting the random number generator seed according to the system clock. Include <stdlib.h> and <time.h> and the following line (once only) at the beginning of the main routine:

unsigned int seed = time(NULL);
srand(seed);
Note that you might want to disable this feature when debugging.

Do I need to print the distinct keys or the frequency lists in lexicographic order like in the reference solutions? No, you are free to print them out in whatever order is most convenient. However, if you want to compare your solutions with our reference solutions, the lexicographic ordering might make things easier.

Input, Output, and Testing

Input and output. Here are a number of sample input files.

Our program ignores the last character read in because this is always a newline character in our sample files. This helps prevent the dead-ending on some small test files.

Reference solutions.   For reference, we provide executable code for our solution in Windows, Solaris, and OS X. We encourage you to develop a more efficient program! The executables can be found in the ftp site. It is certainly possible that you may be able to beat our solution, and it is also possible that you will get most/all of the credit even if your method is not quite as efficient.

Submission and readme

Submit the three clients described in the assignment page, freqcount.c, langmod.c and textgen.c, together with any additional .c or .h files you need in your project. Also submit the readme.txt file, using the skeleton found in the ftp site. We will compile your program with "gcc226" 3 times, each time with one of the clients and the rest of the .c files. You should practice sound design principals.

Here is the template readme.txt file. you are required to use it. In this readme.txt template file, the term "program" refers to the third client (textgen). It should contain the following information:

  • A high-level description for the design of your algorithm, and what influenced your decisions.
  • An analysis of your performance characteristics, especially that of space usage.
  • An argument why your code generates each output with the correct probability.
  • Submit one or two of the most amusing excerpts that your program generated.
  • Unix users may find the shell script timeit.csh useful for computing a table of CPU times.

    Getting started

  • Download the directories model and textfiles These directories are mirrored on arizona at /u/cs226/pub/. They contain reference solutions and sample input files.

  • Do the brute-force hashing solution, where each K+1-letter combination has an array entry recording the number of times it occurs in the text.

  • Try to significantly reduce the space usage by either modifying your original method or completely redesigning a new data structure. Many of the techniques you have learned in this class can lead to efficient solutions (trees, tries, hashing, sorting).

  • Make sure to experiment for small and large values of K. Note that the order 0 model computes the frequency counts of each individual letter in the input text. A large value of K would be something like 20.

  • COS 226 Home Page
    rs@cs.princeton.edu
    Last modified: March 10, 2002