COS 226 Programming Assignment #6

Checklist

Frequently Asked Questions

What program should I use for reading characters? You must use CharStdIn.java. This reads in bytes, whereas StdIn.java and In.java read in Unicode characters. We've included CharStdIn.readAll and CharStdIn.readLine functions to facilitate reading in chunks of bytes.

How do I read in the integer for the Burrows-Wheeler decoder? Use CharStdIn.java as followws.

int first = Integer.parseInt(CharStdIn.readLine());   // the integer
String s = CharStdIn.readAll();                       // the rest of the data

How should I determine the size of the compressed file? On OS X, Linux, or Unix, compare "wc -c input.txt" with "java BWTE < input.txt | java MTFE | hufe | wc -c".  On Windows, type "java BWTE < input.txt | java MTFE | hufe > compressed.txt" and use the File Manager to inspect the file sizes.

Should I terminate my output with newlines so that it looks pretty? No. It is especially important that you don't do this since changing even one character can result in gibberish when you send the output to the next file. Use System.out.print instead of System.out.println to suppress the newline.

My Burrows-Wheeler encoder gives a different output than the assignment for the file abra.txt. On Unix and Windows, you should get the following output

% java BWTE < abra.txt
3
ard
rcaaaabb
The reason for the difference is that Unix puts a '\n' at the end of every file. For simplicity, we used the string "abracadabra" in the assignment as an example instead of "abracadabra\n".  Alternatively, you can use the (newly added) file abra-ncr.txt which is identical to abra.txt, but without the '\n' at the end.

Where can I get the Huffman encoder and decoder? Use the binary version for your platform in burrows/c. This packs the bits 8 to the byte instead of printing them out as '0' and '1' characters.  To execute, use hufe.exe < input.txt instead of java HUFE < input.txt since these are not Java .class files.

How much memory can my program consume?  The Burrows-Wheeler encoder may use quite a bit, so you may need to use the -Xmx option when executing. You should definitely aim to use space linear in the input size N. Industrial strength Burrows-Wheeler compression algorithms typically use a fixed blocksize, and encode the message in these smaller chunks. This reduces the memory requirements, at the expense of some loss in compression ratio.

My program runs really slowly on files like aesop-4copies.txt which contain a single file repeated multiple times.  Is this a problem?  No, as long as your program is reasonably fast on other large files (which do not include duplicated text), you will not lose credit.  (We are also aware that there is a bug in the reference solutions that may cause them to crash on files of this kind.)

I'm curious. My programs don't work properly with binary data. Why not? It's not required, but if you're interested here's why. System.out.print(c) prints c as a Unicode character instead of an 8-bit byte. This matters for non-ASCII characters (128 - 255). To print a byte, use System.out.write(c) and flush the stream at end of your program with System.out.flush().

How do I use gzip and bzip on Windows? It's fine to use pkzip instead.

I'm curious. What compression algorithm is used in PKZIP? In gzip? In bzip2? PKZIP uses LZW (a variant of Lempel-Ziv) compression followed by the Shannon-Fano trees algorithm (an entropy encoder similar to Huffman). The Unix utility gzip combines a variation of LZ77 (similar to LZW) and Huffman coding. The program bzip2 combines the Burrows-Wheeler transform, Huffman coding, and a fancy move-to-front style rule.

Input, Output, and Testing

Input.   Here are some sample input files.

Reference solutions.   For reference, we provide our executable code on Windows, Solaris, and OS X.  Due to a bug in the Windows XP/NT command prompt, we were unable to create a Windows reference solution in C for the move-to-front heuristics.  If anyone figures out how to fix it, please let us know.

Compression script.   Unix users may wish to use the scripts compressit.csh to automate the process of determining the compressed file sizes. The following command tests your implementation against cat, Huffman encoding alone, gzip, and bzip2.

compressit.csh burrows cat HUFE gzip226 bzip226
Note that burrows is a script that combines the three phases of the Burrow Wheeler data compression algorithm. The scripts gzip226 and bzip226 run the standard Unix gzip and bzip programs and are only included for reference. The script HUFE uses hufe226-sparc, but takes a command line argument (for consistency with the other scripts), instead of reading from standard input.

Timing script.   The timeit.csh script measures execution times of the compression programs.

timeit.csh cat HUFE HUFD MTFE MTFD BWTD BWTE burrows gzip226 bzip226
The scripts MTFE, MTFD, BWTD, and BWTE each take a command line argument and call the corresponding Java program. Be careful when using /usr/bin/time with pipes - it only measures the running time of the first command.  Gluing the commands together in the script burrows enables you to account for the total execution time.

Submission

Program report.  As usual, you need to turn in an electronic program report.  Besides providing details about your implementation which are not clear from the comments within your source code, you also need to:

What to submit. Submit BWTE.java, BWTD.java, MTFE.java, MTFD.java, and any other files needed by your programs. You need not submit CharStdIn.java or the Huffman encoder / decoder.

Possible Progress Steps

These are purely suggestions for how you might make progress. You do not have to follow these steps.