COS 226 Programming Assignment Checklist: Burrows-Wheeler Data Compression

Special collaboration policy for Fall 2004: for this assignment, you are permitted to work jointly with one classmate. If you choose to do so, you and your partner should submit your code jointly (under one login name only) but each of you should submit your own readme.txt file.

Frequently Asked Questions

What program should I use for reading characters? You must use CharStdIn.java. This reads in bytes, whereas StdIn.java and In.java read in Unicode characters. We've added CharStdIn.readAll and CharStdIn.readLine functions to facilitate reading in chunks of bytes.

How do I read in the integer for the Burrows-Wheeler decoder? Use CharStdIn.java as followws.

int first = Integer.parseInt(CharStdIn.readLine());   // the integer
String s = CharStdIn.readAll();                       // the rest of the data

How should I determine the size of the compressed file? On OS X, Linux, or Unix, compare "wc -c input.txt" with "java BWTE < input.txt | java MTFE | hufe | wc -c". On Windows, type "java BWTE < input.txt | java MTFE | hufe > compressed.txt" and use the File Manager to inspect the file sizes.

Should I terminated my output with newlines so that it looks pretty? No. It is especially important that you don't do this since changing even one character can result in gibberish when you send the output to the next file. Use System.out.print instead of System.out.println to suppress the newline.

My Burrows-Wheeler encoder gives a different output than the assignment for the file abra.txt. On Unix and Windows, you should get the following output

% java BWTE < abra.txt
3
ard
rcaaaabb

The reason for the difference is that Unix puts a '\n' at the end of every file. For simplicity, we used the string "abracadabra" in the assignment as an example instead of "abracadabra\n".

Where can I get the Huffman encoder and decoder? Use the binary version for your platform in burrows/c. Unlike the version presented in class, these pack the bits 8 to the byte instead of printing them out as '0' and '1' characters. To execute, use hufe.exe < input.txt instead of java HUFE < input.txt since these are not Java .class files.

How much memory can my program consume? The Burrows-Wheeler encoder may use quite a bit, so you may need to use the -Xmx option when executing. You should definitely aim to use space linear in the input size N. Industrial strength Burrows-Wheeler compression algorithms typically use a fixed blocksize, and encode the message in these smaller chunks. This reduces the memory requirements, at the expense of some loss in compression ratio.

I curious. My programs don't work properly with binary data. Why not? It's not required, but if you're interested here's why. System.out.print(c) prints c as a Unicode character instead of an 8-bit byte. This matters for non-ASCII characters (128 - 255). To print a byte, use System.out.write(c) and flush the stream at end of your program with System.out.flush().

How do I use gzip and bzip on Windows? It's fine to use pkzip instead.

I'm curious. What compression algorithm is used in PKZIP? In gzip? In bzip2? PKZIP uses LZW compression followed by the Shannon-Fano trees algorithm (an entropy encoder similar to Huffman). The Unix utility gzip combines a variation of LZ77 (similar to LZW) and Huffman coding. The program bzip2 combines the Burrows-Wheeler transform, Huffman coding, and a fancy move-to-front style rule.

Input, Output, and Testing

Input. Here are some sample input files.

Reference solutions. For reference, we provide our executable code on Windows, Solaris, and OS X. Due to a bug in the Windows XP/NT command prompt, we were unable to create a Windows reference solution in C for the move-to-front heuristics. If anyone figures out how to fix it, please let us know.

Compression script. Unix users may wish to use the scripts compressit.csh to automate the process of determining the compressed file sizes. The following command tests your implementation against cat, Huffman encoding alone, gzip, and bzip2.

compressit.csh burrows cat HUFE gzip226 bzip226

Note that burrows is a script that combines the three phases of the Burrow Wheeler data compression algorithm. The scripts gzip226 and bzip226 run the standard Unix gzip and bzip programs and are only included for reference. The script HUFE uses hufe226-sparc, but takes a command line argument (for consistency with the other scripts), instead of reading from standard input.

Timing script. The timeit.csh script measures execution times of the compression programs.

timeit.csh cat HUFE HUFD MTFE MTFD BWTD BWTE burrows gzip226 bzip226

The scripts MTFE, MTFD, BWTD, and BWTE each take a command line argument and call the corresponding Java program. Be careful when using /usr/bin/time with pipes - it only measures the running time of the first command. Gluing the commands together in the script burrows enables you to account for the total execution time.

Submission

readme.txt. Use the following readme file template. Besides providing details about your implementation which are not clear from the comments within your source code, your readme.txt file should also:

Report on the compression ratios for various test files.
Analyze your encoding and decoding algorithms empirically.

What to submit. Submit BWTE.java, BWTD.java, MTFE.java, MTFD.java, and any other files needed by your programs. You need not submit CharStdIn.java or the Huffman encoder / decoder.

Possible Progress Steps

These are purely suggestions for how you might make progress. You do not have to follow these steps.

Download the directory burrows to your system. It contains an implementation of Huffman encoding and decoding and reference solutions.
Implement the move-to-front encoding and decoding algorithms. Not including declarations, our solutions are about 8 lines of code each. So, don't get carried away with fancy data structures unless you want to. When debugging, cast each character to an integer so that you will see its ASCII value instead of a (potentially) unprintable character. Once you're convinced that your code works, print it out as a character.
For the Burrows-Wheeler transform, you can use the method substring in the string library to form the suffixes. Recall that it does not explicitly form the suffixes, so you can build the N suffixes in linear time. Hint: to make the strings appear as if they were cyclic, consider string concatenation.
The Burrows-Wheeler decoding is the trickiest part, but it is very little code once you understand how it works. (Not including declarations and input, our solution is 9 lines of code.) You may find the key-indexed counting algorithm from the radix sorting lecture to be useful.