COS 226 Burrows-Wheeler Data Compression Algorithm

COS 226 Programming Assignment

Burrows-Wheeler Data Compression Algorithm

Implement the Burrows-Wheeler data compression algorithm. This revolutionary algorithm outcompresses gzip and PKZIP, is relatively easy to implement, and is not protected by any patents. It forms the basis of the Unix compression utililty bzip2.

The Burrows-Wheeler compression algorithm consists of three different algorithmic components, which are applied in succession:

Burrows-Wheeler transform. Given a typical English text file, transform it into a text file in which sequences of the same character occur near each other many times.
Move-to-front encoding. Given a text file in which sequences of the same character occur near each other many times, convert it into a text file in which certain characters appear more frequently than others.
Huffman encoding. Given a text file in which certain characters appear more frequently than others, compress it by encoding freqently occuring characters with short codewords and rare ones with long codewords.

The only step that compresses the message is the final step. The final step is particularly effective because the first two steps result in a text file in which certain characters appear much more frequently than others. To decode a message, apply the inverse operations in reverse order: first apply the Huffman decoding, then the move-to-front decoding, and finally the inverse Burrows-Wheeler transform.

Huffman encoding and encoding. Implement the classic Huffman encoding and decoding algorithm for arbitrary data. As a starting point, you can use the implementations from lecture. In lecture, we represented the Huffman tree by its preorder traversal (using the sentinel character '*' to denote an internal node) and we represented the bits as sequence of '0' and '1' characters. For example, the Huffman tree corresponding to the input abracadabra! is

and the implementations from lecture produce the following output:

% more abra.txt
abracadabra!

% java HUFE < abra.txt  (version from lecture)
*a**d*!c*rb                       // preorder traversal of trie
12                                // number of characters to decode
0111110010110100011111001010      // the 28 bits

% java HUFE < abra.txt | java HUFD  (versions from lecture)
abracadabra!

Your main task is to pack the bits 8 to the byte when encoding (and unpack them when decoding). Also, your program should handle any input sequence of bytes, whereas the implementation in lecture assumed the input did not contain the '*' character. To do so, represent the Huffman tree by its preorder traversal, but precede each node by the character '0' if it is an internal node, and by the character '1' if it is an external node. On the next line, write the the number of characters you encoded. Finally, on the next line, write the bits, packed 8 to the byte.

% java HUFE < abra.txt  (version to turn in)
0*1a0*0*1d0*1!1c0*1r1b            // preorder traversal of trie
12                                // number of characters to decode
|´|                               // the 4 bytes of data

% java HUFE < abra.txt | java HUFD  (versions to turn in)
abracadabra!

The vertical line corresponds to hex 7C (01111100), the acute accent corresponds to hex B4 (10110100), and the no-break space corresponds to hex A0 (10100000). Since the number of bits was not a multiple of 8, we pad (four) 0s to the end to fill out the last byte.

Name your programs HUFE.java and HUFD.java. Check that HUFD decompresses any message compressed with HUFE.

Move-to-front encoding. The main idea of move-to-front encoding is to maintain an ordered list of all of the characters, and repeatedly read in characters from the input message, print out the position in which that character appears, and move that character to the front of the list. As a simple example, if the initial ordering over a 6-character alphabet is a b c d e f, and we want to encode the input caaabcccaccf, then we would update the move-to-front lists as follows

move-to-front    in   out
-------------    ---  ---
 a b c d e f      c    2 
 c a b d e f      a    1
 a c b d e f      a    0
 a c b d e f      a    0
 a c b d e f      b    2
 b a c d e f      c    2
 c b a d e f      c    0
 c b a d e f      c    0
 c b a d e f      a    2
 a c b d e f      c    1
 c a b d e f      f    5
 f c a b d e

If the same character occurs next to each other many times in the input, then many of the output values will be small integers, such as 0, 1, and 2. The extremely high frequency of certain characters makes an ideal scenario for Huffman coding.

Your task is to maintain an ordered list of the 256 extended ASCII characters. Initialize the list by making the ith character in the list equal to the ith extended ASCII character. Now, read in each character ch from standard input one at a time, output the index in the array where ch appears, and move ch to the front of the list. As an example, the move-to-front encoding of

 a  b  b  b  a  a  b  b  b  b  a  c  c  a  b  b  a  a  a  b  c

is given by the following sequence of integers:

97 98  0  0  1  0  1  0  0  0  1 99  0  1  2  0  1  0  0  1  2

Note that 'a' is 97 in ASCII and that we have printed out the indices as type int with separating whitespace using System.out.print() (for debugging only) rather than as type char without separating whitespace using System.out.write() (for the version to submit). Name your program MTFE.java.

Move-to-front decoding. Initialize an ordered list of 256 characters, where extended ASCII character i appears ith in the list. Read in each character i (but treat it as an integer between 0 and 255) from standard input one at a time, print the ith character in the list, and move that character to the front of the list. Name your program MTFD.java and check that it recovers any message encoded with MTFE.java.

Burrows-Wheeler transform. The goal of the Burrows-Wheeler transform is not to compress a message, but rather to transform it into a form that is more amenable to compression. The transform rearranges the characters in the input so that that there are lots of clusters with repeated characters, but in such a way that it is still possible to recover the original input. It relies on the following intuition: if you see the letters hen in English text, then most of the time the letter preceding it is t or w. If you could somehow group all such preceding letters together (mostly t's and some w's), then you would have an easy opportunity for data compression.

First treat the input string as a cyclic string and sort the N suffixes of length N. Here is how it works for the text message "abracadabra!". The 12 original suffixes are abracadabra!, bracadabra!a, ..., !abracadabra, and appear in rows 0 through 11 of the table below. Sorting these 12 strings yields the sorted suffixes. Ignore the next[] array for now - you will only need it for decoding.

 i     Original Suffixes          Sorted Suffixes       t      next
--    -----------------------     -----------------------     ----
 0    a b r a c a d a b r a !     ! a b r a c a d a b r a      3
 1    b r a c a d a b r a ! a     a ! a b r a c a d a b r      0
 2    r a c a d a b r a ! a b     a b r a ! a b r a c a d      6
*3    a c a d a b r a ! a b r     a b r a c a d a b r a !      7
 4    c a d a b r a ! a b r a     a c a d a b r a ! a b r      8
 5    a d a b r a ! a b r a c     a d a b r a ! a b r a c      9
 6    d a b r a ! a b r a c a     b r a ! a b r a c a d a     10
 7    a b r a ! a b r a c a d     b r a c a d a b r a ! a     11
 8    b r a ! a b r a c a d a     c a d a b r a ! a b r a      5
 9    r a ! a b r a c a d a b     d a b r a ! a b r a c a      2
10    a ! a b r a c a d a b r     r a ! a b r a c a d a b      1
11    ! a b r a c a d a b r a     r a c a d a b r a ! a b      4

The Burrows Wheeler transform t[] is the last column in the suffix sorted list, preceded by the row number where the original string abracadabra ends up.

% java BWTE < abra.txt
3
ard!rcaaaabb

Notice how there are 4 consecutive a's and 2 consecutive b's - this makes the file easier to compress. Write a program BWTE.java to read in a string and output the Burrows-Wheeler transform.

Inverting the Burrows-Wheeler transform. Now we describe how to undo the Burrows-Wheeler transform and recover the original message. If the jth original suffix (original string, shifted j characters to the left) is the ith row in the sorted order, then next[i] records the row in the sorted order where the (j+1)st original suffix appears. For example, the 0th original suffix abracadabra! is row 3 of the sorted order; since next[3] = 7, the next original suffix bracadabra!a is row 7 of the sorted order. Knowing the array next[] makes decoding easy, as with the following Java code:

int N = 12;
int[] next = { 3, 0, 6, 7, 8, 9, 10, 11, 5, 2, 1, 4 };
String t = "ard!rcaaaabb";
int x = 3;
for (int i = 0; i < N; i++) {
   x = next[x];
   System.out.write(t.charAt(x));
}

Amazingly, the information contained in the Burrows-Wheeler transform is enough to reconstruct next[], and hence the original message! Here's how. First, we know all of the characters in the original message, even if they're permuted in the wrong order. This enables us to reconstruct the first column in the suffix sorted list by sorting the characters. Since 'c' only occurs once in the message and the suffixes are formed using cyclic wrap-around, we can deduce that next[8] = 5. Similarly, 'd' only occurs once, so we can deduce that next[9] = 2.

 i      Sorted Suffixes     t      next
--    -----------------------      ----
 0    ! ? ? ? ? ? ? ? ? ? ? a
 1    a ? ? ? ? ? ? ? ? ? ? r
 2    a ? ? ? ? ? ? ? ? ? ? d
*3    a ? ? ? ? ? ? ? ? ? ? !
 4    a ? ? ? ? ? ? ? ? ? ? r 
 5    a ? ? ? ? ? ? ? ? ? ? c
 6    b ? ? ? ? ? ? ? ? ? ? a
 7    b ? ? ? ? ? ? ? ? ? ? a
 8    c ? ? ? ? ? ? ? ? ? ? a        5
 9    d ? ? ? ? ? ? ? ? ? ? a        2
10    r ? ? ? ? ? ? ? ? ? ? b
11    r ? ? ? ? ? ? ? ? ? ? b

However, since 'r' appears twice, it may seem ambiguous whether next[10] = 1 and next[11] = 4, or whether next[10] = 4 and next[11] = 1. Here's the key rule that resolves the ambiguity:

If sorted row i and j both start with the same character and i < j, then next[i] < next[j].

This rule implies next[10] = 1 and next[11] = 4. Why is this rule valid? The rows are sorted so row 10 is lexicographically less than row 11. Thus the ten unknown characters in row 10 must be less than the the ten unknown characters in row 11 (since both start with r). We also know that between the two rows that end with r, row 1 is less than row 4. But, the ten unknown characters in row 10 and 11 are precisely the first ten characters in rows 1 and 4. Thus, next[10] = 1 and next[11] = 4 or this would contradict the fact that the suffixes were sorted.

Name your program BWTD.java and check that it recovers any message encoded with BWTE.java.

% java BWTE < abra.txt | java BWTD
abracadabra!

Analysis. Once you have all 6 programs working, compress some of these text files. Calculate the compression ratio achieved for each file. Also, report the time to compress and decompress each file.

Input and output. The library StdIn is designed to read in strings composed of Unicode characters, and it is unsuitable for reading in binary data. Instead use the library CharStdIn. It has the following interface.

public static boolean isEmpty()   // is stdin empty?
public static char    readChar()  // read one byte and return it as a char
public static String  readLine()  // read one line of bytes and return as a String
public static String  readAll()   // read the rest of the input and return as a String

Similarly, use System.out.write(ch) to write out an extended ASCII character ch as a byte instead of System.out.print(ch).

Deliverables. Submit HUFE.java, HUFD.java, MTFE.java, MTFD.java, BWTE.java and BWTD.java, along with any other helper files needed to run your program (excluding CharStdIn.java and those in adt.jar and stdlib.jar). Also submit a readme.txt and answer all questions.