COS 226 Programming Assignment Checklist: Burrows-Wheeler Data Compression

Frequently Asked Questions

What files should I submit? Submit only bwte.c, bwtd.c, mtfe.c and mtfd.c. Do not submit other files or use other filenames. Note that we originally misnamed the reference solutions bwe226 instead of bwte226.

My Burrows-Wheeler decoder works on small files but not on aesop.txt. Any ideas? Did you read in the index with scanf("%d\n", &first);. Despite the obvious intent of reading in an integer and ignoring all whitespace up to and including the first newline character, this is not how scanf works. In fact, it eats up all whitespace up until the first non-whitespace character (see King p. 493). Thus, it may delete whitespace characters beyond the newline character and interfere with decoding if the first sorted suffix ends with a whitespace character as in aesop.txt. Here are two fixes.

scanf("%d", &first);          fgets(buffer, sizeof(buffer), stdin);
getchar();                    sscanf(buffer, "%d", &first);

How should I determine the size of the compressed file? On Unix, compare "wc -c input.txt" with "bwte < input.txt | mtfe | hufe | wc -c". On Windows, type "bwte < input.txt | mtfe | hufe > compressed.txt" and use the File Manager to inspect the file sizes.

The system seems to include an extra carriage return and/or newline character at the end of my input. Is this a problem? No, this is normal, but whether you get a trailing "\r\n" or "\n" depends on your operating system. Feel free to chop them off for debugging, but leave them in for your final submission.

My Burrows-Wheeler encoder gives a different output than the assignment for the file abra.txt. On Unix, you should get the following output

bwte < abra.txt
3
ard
rcaaaabb

The reason for the difference is that Unix puts a '\n' at the end of every file. We used the string "abracadabra" in the assignment as an example instead of "abracadabra\n".

On my system getchar() and scanf() return EOF when they read in character number 26. Is this a problem? It seems to be a bug with your system (I experienced the same annoyance with Windows XP / lcc). I don't know of a good fix. For now, debug your move-to-front algorithms with integers, change it to char before you submit, and expect it to work on other systems.

Should I use char, signed char or unsigned char? As long as you use them consistently for the encoder and decoder, it shouldn't matter. Nevertheless, we strongly recommend unsigned char since you may want to use the characters as indices into an array. Do not assume char is an unsigned type since it depends on the operating system and compiler. However, when reading in a character with the idiom while((c = getchar()) != EOF), be sure that c is of type int.

I'm curious. What compression algorithm is used in PKZIP? In gzip? In bzip2? PKZIP uses LZW compression followed by the Shannon-Fano trees algorithm (an entropy encoder similar to Huffman). The Unix utility gzip combines a variation of LZ77 (similar to LZW) and Huffman coding. The program bzip2 combines the Burrows-Wheeler transform and Huffman coding.

Input, Output, and Testing

Input. Here are some sample input files. To fully test your program, you should also try to compress binary files (e.g., executables like bwte or jpg files). But don't kill yourself over this if it doesn't work - you will receive only a very minor deduction if your program works on text only.

Reference solutions. For reference, we provide our executable code on Windows, Solaris, and OS X. Due to bugs with either the Windows XP/NT command prompt (or both the lcc and gcc compilers), we were unable to create a Windows reference solution for the move-to-front heuristics. If anyone figures out how to fix it, please let us know.

Compression script. Unix users may wish to use the scripts compressit.csh to automate the process of determining the compressed file sizes. The following command tests your implementation against cat, Huffman encoding alone, our reference solution, gzip, and bzip2.

compressit.csh burrows cat hufe226 burrows226 gzip226 bzip226

Note that burrows and burrows226 are scripts that combine the three phases of the Burrow Wheeler data compression algorithm. The scripts gzip226 and bzip226 are scripts that run the standard Unix gzip and bzip programs and are only included for reference.

Timing script. The timeit.csh script measures execution times of the compression programs.

timeit.csh burrows wc hufe226 burrows226 gzip226 bzip226

Be careful when using /usr/bin/time with pipes - it only measures the running time of the first command. Gluing the commands together in the script burrows enables you to account for the total execution time.

readme.txt

Use the following readme file template. Besides providing details about your implementation which are not clear from the comments within your source code, your readme.txt file should also:

Report on the compression ratios for various test files.
Analyze your encoding and decoding algorithms empirically.

Possible Progress Steps

These are purely suggestions for how you might make progress. You do not have to follow these steps.

Download the directory burrows to your system. It contains an implementation of Huffman encoding and decoding and reference solutions.
Implement the move-to-front encoding and decoding algorithms. Not including declarations, our solutions are about 8 lines of code each. So, don't get carried away with fancy data structures unless you want to. When debugging, print out each character using "%d " so that you will see its value instead of a (potentially) unprintable character. Once you're convinced that your code works, use "%c" or putchar() to print it out as a byte.
For the Burrows-Wheeler transform, you should not explicitly form the N suffixes, or this part alone will take quadratic time. Instead, consider starting with the following suffix sorting code from the radix sorting lecture. Keep in mind that it does not assume the input string is cyclic, so you will need to make some minor modifications. Note that the driver program also filters whitespace and ignores non-printable characters - your final submission should not do this.
The Burrows-Wheeler decoding is the trickiest part, but it is very little code once you understand how it works. (Not including declarations and input, our solution is 9 lines of code.) You may find the key-indexed counting algorithm from the radix sorting lecture to be useful.

COS 226 Home Page
wayne@cs.princeton.edu

Last modified: March 24, 2003