Burrows–Wheeler Data Compression
How should I read and write the data? You must use BinaryStdIn and BinaryStdOut, which read and write sequences of bytes. Do not use either StdIn or StdOut, which read and write sequences of Unicode characters.
My programs don’t work properly with binary data. Why not? Be absolutely sure
that you use only
BinaryStdOut for reading
and writing data.
Also, be sure to call either
BinaryStdOut.close() after you are done writing; for an example, see
BinaryStdIn return the 8-bits as a (16-bit unsigned)
instead of as an (8-bit signed)
byte? The primitive type
byte is annoying
to use in Java. When you operate on a
byte, it is typically promoted to an
int and you must be careful because the type
byte is signed.
For example, to convert a
byte b to a
char c, you must write
c = (char) (b & 0xff) instead of
c = (char) b. By using
we avoid the hassle.
How do I use gzip and bzip2 on Windows? It's fine to use pkzip or 7-zip instead.
I'm curious. Which compression algorithm is used in PKZIP? In gzip? In bzip2? PKZIP uses LZW compression followed by Shannon–Fano (an entropy encoder similar to Huffman). The Unix utility gzip uses a variation of LZ77 (similar to LZW) followed by Huffman coding. The program bzip2 combines the Burrows–Wheeler transform, Huffman coding, and a (fancier) move-to-front style rule.
How can I view the contents of a binary file and determine its size? Use HexDump.java, as in the assignment. The command-line argument specifies the number of bytes per line to print; if the argument is 0, all output except for the number of bits will be suppressed.
Must I use
The constructor and methods neither write to standard output nor read from standard input,
so there is no need for either
You are free to use
StdOut.println() when testing in
How should i read the input in
The input is a sequence of extended ASCII characters (00 to FF). You can read it using
For the Burrows–Wheeler transform, which order do I use to sort the suffixes?
Use lexicographic order to sort the suffixes,
which is the natural order of the
String data type.
For the Burrows–Wheeler inverse transform, does
next always equal
And wouldn’t this mean that the index
first is redundant?
No, this is just a coincidence with the input string
Consider any two input strings that are cyclic rotations of one another,
"CADABRA!ABRA". They will have the same
sorted suffixes and
t array—their only difference will be in
Can I assume that the
inverseTransform() method in BurrowsWheeler receives only
valid inputs (that were created by a call to the
How much memory can my program consume? The Burrows–Wheeler transform may use
quite a bit, so you may need to use the
-Xmx option when executing. You must use
space linear in the input size n and alphabet size R.
(Industrial strength Burrows–Wheeler compression
algorithms typically use a fixed block size, and encode the message in these smaller
chunks. This reduces the memory requirements, at the expense of some loss in compression
Therefore, depending on your operating system and configuration
there may be some very large files for which your program will not have
enough memory even with the
I’m running out of memory in the
transform() method in Burrows–Wheeler. Any ideas?
Be sure not to create a new
String object for each circular suffix
It is OK to have multiple references to the same
String (for example if you use a
CircularSuffix nested class).
What is meant by “typical English text inputs”?
Inputs such as Aesop’s Fables, Moby Dick, or your most recent essay.
We do not mean inputs with very long repeated substrings
aesop-2copies.txt or an input will
1 million consecutive As) or random inputs.
To fully test your programs, you should use not only text files but also
binary files (such as
Reference solutions. For reference, we have provided the output of compressing
us.gif. We have also provided the results of applying each
of the three encoding algorithms in isolation. Note that
us.gif file is a binary file and
is already compressed.
To compare the contents of two files, you can use the following commands:
cmp file1 file2
fc file1 file2
Timing your program.
You want to record the “real” value.% time java-algs4 BurrowsWheeler - < mobydick.txt | java-algs4 MoveToFront - | java-algs4 edu.princeton.cs.algs4.Huffman - > mobyDickOutputFileName
timeTest.batin the same directory as your Burrows wheeler assignment. This is known as a "batch file".
% echo %time% % java-algs4 BurrowsWheeler - < mobydick.txt | java-algs4 MoveToFront - | java-algs4 edu.princeton.cs.algs4.Huffman - > mobyDickOutputFileName % echo %time%
timeTest.batby navigating a terminal window so that it's in the same directory as
timeTest.batand your .java files and type
Make sure you name the batch file
timeTest.bat instead of
Note that you can test multiple files by adding more lines to the batch file.
Timing using gzip, bzip2, 7zip, etc.
timecommand as above, but with
.bat) with the following text:
% echo %time% % 7za a -tzip mobyDickOutputFileName.zip mobydick.txt % echo %time%
This creates a file in .zip format (the same used natively by Windows for compression). To test unzipping time, use the following:
% echo %time% % 7za e mobyDickOutputFileName.zip % echo %time%
If you like, you can test other compression formats.
CircularSuffixArray. Be sure not to create new
Stringobjects when you sort the suffixes. That would take quadratic space. A natural approach is to define a nested class
CircularSuffixthat represents a circular suffix implicitly (via a reference to the input string and a pointer to the first character in the circular suffix). The constructor of
CircularSuffixshould take constant time and use constant space. You might also consider making
Comparable<CircularSuffix>interface. Note, that while this is, perhaps, the cleanest solution, it is not the fastest.
A video on compressing and another on expanding are provided for those wishing additional assistance. Be forewarned that video was made in early 2014 and is somewhat out of date. For example the API has changed.