COS 226 Programming Assignment Checklist: Burrows-Wheeler Data Compression

Frequently Asked Questions

What program should I use for reading characters? You must use CharStdIn.java. This reads in bytes, whereas StdIn.java and In.java read in Unicode characters.

My programs don't work properly with binary data. Why not? System.out.print(ch) translates the character ch into one or more bytes according to the platform's default character encoding. This matters for non-ASCII characters (128 - 255). To print it as a byte, use System.out.write(ch), and flush the stream at end of your program with System.out.flush().

How do I read and write data for the move-to-front encoder and decoder?

while (!CharStdIn.isEmpty()) {
   char ch = CharStdIn.readChar();
   ...
   System.out.write( /* a char */ );
}
System.out.flush();

How do I read and write data for the Burrows-Wheeler encoder?

// read data
String s = CharStdIn.readAll();

...

// write data
System.out.println( /* an int */ );
for (int i = 0; i < N; i++) {
   ...
   System.out.write( /* a char */ );
}
System.out.flush();

How do I read and write data for the Burrows-Wheeler decoder?

// read data
int first = Integer.parseInt(CharStdIn.readLine());
String s  = CharStdIn.readAll();

...

// write data
for (int i = 0; i < N; i++) {
   ...
   System.out.write( /* a char */ );
}
System.out.flush();

How do I read and write data for the Huffman encoder

// read data
String s = CharStdIn.readAll();

...

// write preorder traversal by calling System.out.write(c)
System.out.println();

// write number of characters in input
System.out.println(numberOfChars);

// write sequence of bytes by calling System.out.write(c)
System.out.flush();

How do I read and write data for the Huffman decoder

// read preorder traversal by calling CharStdIn.readChar()
// see lecture notes on Huffman Decoder -- but you'll need to modify that code
CharStdIn.readLine();   // to eat up newline

// read number of characters to be printed
int numberOfChars = Integer.parseInt(CharStdIn.readLine());

// read bytes using CharStdIn.readChar() and write using System.out.write()
System.out.flush();

When I run the Huffman encoding algorithm on the input abracadabra! I get a different output for the bytes? Different operating systems may use different character encodings for extended ASCII symbols. As a result, the acute accent and the no-break space may be represented differently on your system (or may appear as a ? if your terminal font does not support it). For example, using IBM DOS extended ASCII, the resulting symbols should be |⊣|á, i.e., the vertical bar, the left tack, the vertical bar, and a acute.

How should I determine the size of the compressed file? On OS X, Linux, or Unix, compare "wc -c input.txt" with "java BWTE < input.txt | java MTFE | hufe | wc -c". On Windows, type "java BWTE < input.txt | java MTFE | hufe > compressed.txt" and use the File Manager to inspect the file sizes.

The program RadixQuicksort.java doesn't work with some inputs. Why not? It assumes that the strings are '\0'-terminated. Feel free to use Arrays.sort() or modify RadixQuicksort so that it sorts arbitrary strings.

I have a difference of one character when I combine all three encoders (even though when I run each encoder separately I have no differences). Is this a Windows curiosity? Yes, we think so. Try running on hats.

How much memory can my program consume? The Burrows-Wheeler encoder may use quite a bit, so you may need to use the -Xmx option when executing. You must use space linear in the input size N. (Industrial strength Burrows-Wheeler compression algorithms typically use a fixed blocksize, and encode the message in these smaller chunks. This reduces the memory requirements, at the expense of some loss in compression ratio.)

Why not use byte instead of char to represent an 8-bit byte? You certainly can, but if you do, be careful to note that a byte is a signed 8-bit integer (whereas char is an unsigned 16-bit integer). Also, it's convenient to be able to manipulate a sequence of bytes using the built-in String data type.

Where can I learn about bit-whacking operators in Java for Huffman encoding and decoding? Here's one resource. If you haven't used bit-whacking operations before, take this as a good opportunity to learn. Alternatively, you can use Integer.parseInt(s, 2) and Integer.toBinaryString().

How do I use gzip and bzip2 on Windows? It's fine to use pkzip or 7-zip instead.

How can I compare the contents of two files (to check that the decoded version equals the original)? On Linux or OS X, use the command diff file1 file2; on Windows use the command fc file1 file2.

I'm curious. What compression algorithm is used in PKZIP? In gzip? In bzip2? PKZIP uses LZW compression followed by the Shannon-Fano trees algorithm (an entropy encoder similar to Huffman). The Unix utility gzip combines a variation of LZ77 (similar to LZW) and Huffman coding. The program bzip2 combines the Burrows-Wheeler transform, Huffman coding, and a (fancier) move-to-front style rule.

When I try to run compressit.csh and timeit.csh, I get a "permission denied" message. What can I do to activate these scripts? Use chmod to add execute privileges:

   % chmod +x compressit.csh timeit.csh
   % chmod +x MTFE MTFD HUFE HUFD BWTE BWTD
   % chmod +x burrows  bzip226  gzip226

Input, Output, and Testing

Input. Here are some sample input files. To fully test your program, you should also try to compress and uncompress binary files (e.g., .class or .jpg files). But don't kill yourself over this if it doesn't work - you will receive only a minor deduction if your program works on text only.

Reference solutions. For reference, we have provided the output of compressing aesop.txt and us.gif. We have also provided the results of applying each of the three encoding algorithms in isolation. Note that the GIF file is a binary file and is already compressed.

Compression script. Linux/OS X users may wish to use the scripts compressit.csh to automate the process of determining the compressed file sizes. The following command tests your implementation against cat, Huffman encoding alone, gzip, and bzip2.

% ./compressit.csh cat ./HUFE ./burrows ./gzip226 ./bzip226

Note that burrows is a script that combines the three phases of the Burrow Wheeler data compression algorithm. The scripts gzip226 and bzip226 run the standard Unix gzip and bzip programs and are only included for reference.

Timing script. The timeit.csh script measures execution times of the compression programs.

% ./timeit.csh cat ./HUFE ./MTFE ./BWTE ./burrows ./gzip226 ./bzip226

The scripts HUFE, MTFE, and BWTE each take a command line argument and call the corresponding Java program. Be careful when using /usr/bin/time with pipes - it only measures the running time of the first command. Gluing the commands together in the script burrows enables you to account for the total execution time.

Possible Progress Steps

These are purely suggestions for how you might make progress. You do not have to follow these steps.

Download the directory burrows to your system. It contains StdCharIn.java and some testing data.
Implement the move-to-front encoding and decoding algorithms. Not including comments and declarations, our solutions are about 10 lines of code each. When debugging, cast each character to an integer so that you will see its ASCII value instead of a (potentially) unprintable character. Once you're convinced that your code works, print it out as a character using System.write().
Implement the binary version of the Huffman coding and decoding algorithms.
For the Burrows-Wheeler transform, you can use the method substring() in the string library to form the suffixes. Recall that it does not explicitly form the suffixes, so you can build the N suffixes in linear time. Hint: to make the strings appear as if they were cyclic, consider string concatenation.
The Burrows-Wheeler decoding is the trickiest part, but it is very little code once you understand how it works. (Not including declarations and input, our solution about 10 lines of code.) You may find the key-indexed counting algorithm from the radix sorting lecture to be useful.