COS 226 Programming Assignment Checklist: Burrows-Wheeler Data Compression

Frequently Asked Questions

What program should I use for reading and writing the data? You must use BinaryStdIn.java and BinaryStdOut.java. These read and write sequences of bytes, whereas StdIn.java and StdOut.java (as do System.out.print() and Scanner) read and write sequences of Unicode characters. These are in stdlib.jar.

My programs don't work properly with binary data. Why not? Be absolutely sure that you are only using BinaryStdIn.java and BinaryStdOut.java for input and output. And that you use BinaryStdOut.flush(); or BinaryStdOut.close(); after you are done writing (for an example see Huffman.expand().

Why does BinaryStdIn return the 8-bits as a (16-bit unsigned) char instead of as (an unsigned 8-bit) byte? The primitive type byte is a bit annoying in Java. When you operate on a byte, it is typically promoted to an int. E.g., to convert a byte b to a char ch, you must write ch = (char) (b & 0xff) instead of ch = (char) b. By using char, we avoid the hassle.

How can I compare the contents of two files (to check that the decoded version equals the original)? On OS X and Linux, use the command diff file1 file2; on Windows use the command fc file1 file2.

How can I view the contents of a binary file? Use HexDump.java, as in the assignment. The command-line argument specifies the number of bytes per line to print.

How do I determine the sizes of the original and compressed files? Use HexDump.java, as in the assignment. Use a command-line argument of 0 to suppress all output except for the number of bytes.

How much memory can my program consume? The Burrows-Wheeler encoder may use quite a bit, so you may need to use the -Xmx option when executing. You must use space linear in the input size N. (Industrial strength Burrows-Wheeler compression algorithms typically use a fixed block size, and encode the message in these smaller chunks. This reduces the memory requirements, at the expense of some loss in compression ratio.) Therefore, depending on your operating system and configuration there may be some very large files for which your program will not have enough memory even with the -Xmx option.

How do I use gzip and bzip2 on Windows? It's fine to use pkzip or 7-zip instead.

I'm curious. What compression algorithm is used in PKZIP? In gzip? In bzip2? PKZIP uses LZW compression followed by the Shannon-Fano trees algorithm (an entropy encoder similar to Huffman). The Unix utility gzip combines a variation of LZ77 (similar to LZW) and Huffman coding. The program bzip2 combines the Burrows-Wheeler transform, Huffman coding, and a (fancier) move-to-front style rule.

Input, Output, and Testing

Input. Here are some sample input files. To fully test your program, you should also try to compress and uncompress binary files (e.g., .class or .jpg files).

Reference solutions. For reference, we have provided the output of compressing aesop.txt and us.gif. We have also provided the results of applying each of the three encoding algorithms in isolation. Note that the GIF file is a binary file and is already compressed.

Possible Progress Steps

These are purely suggestions for how you might make progress. You do not have to follow these steps.

Download the directory burrows to your system. It contains some sample input files and reference solutions.
Implement the Burrows-Wheeler transform, using the method substring() in the string library to form the suffixes. Recall that it does not explicitly form the suffixes, so you can build the N suffixes in linear time. Hint: to make the strings appear as if they were cyclic, consider string concatenation.
The Burrows-Wheeler decoding is the trickiest part, but it is very little code once you understand how it works. (Not including declarations and input, our solution is about 10 lines of code.) You may find the key-indexed counting algorithm from the string sorting lecture to be useful.
Implement the move-to-front encoding and decoding algorithms. Not including comments and declarations, our solutions are about 10 lines of code each. If yours is significantly longer, try to simplify it.