COS 226 Programming Assignment

Burrows-Wheeler Data Compression Algorithm

Implement the Burrows-Wheeler data compression algorithm. This revolutionary algorithm outcompresses gzip and PKZIP, is relatively easy to implement, and is not protected by any patents. It forms the basis of the Unix compression utililty bzip2. The Burrows-Wheeler compression algorithm consists of three different algorithmic components, which are applied in succession:

  1. Burrows-Wheeler transform. Given a normal English text file, transform it into a text file in which series of the same letter occur near each other many times.

  2. Move-to-front encoding. Given a text file in which series of the same letter occur near each other many times, convert it into a text file in which small integers appear more frequently than large ones.

  3. Huffman encoding. Given a text file in which certain symbols appear more frequently than others, compress it by encoding common letters with short codewords and rare letters with long codewords.
The only step that compresses the message is the final step. It is particularly effective because the first two steps result in a message in which certain symbols appear more frequently than others. To decode a message, apply the inverse operations in reverse order: first apply the Huffman decoding, then the move-to-front decoding, and finally the inverse Burrows-Wheeler transform.

Huffman encoding. We implemented the Huffman encoding algorithm in class, so we'll provide this part for you.

Huffman decoding. You did this in COS 126, so we'll provide the decoder for you too.

Move-to-front encoding. Initialize an ordered list of 256 characters, where extended ASCII character i appears ith in the list. Read in each character c from standard input one at a time, output the index in the array where c appears, and move c to the front of the list. As an example, the move-to-front encoding of

a b b b a a b b b b a c c a b b a a a b c
is given by the following sequence of integers:
97 98 0 0 1 0 1 0 0 0 1 99 0 1 2 0 1 0 0 1 2
Note that 'a' is 97 in ASCII and that we have printed out the indices as type int with separating whitespace (for debugging only) rather than as type char without separating whitespace (for the version to submit). Name your program mtfe.c.

Move-to-front decoding. Initialize an ordered list of 256 characters, where extended ASCII character i appears ith in the list. Read in each character i (but think of it as an integer between 0 and 255) from standard input one at a time, print the ith character in the list, and move that character to the front of the list. Name your program mtfd.c and check that it recovers any message encoded with mtde.c.

Burrows-Wheeler transform. The goal of the Burrows-Wheeler transform is not to compress a message, but rather to transform it into a form that is more amenable to compression. The transform rearranges the characters in the input so that that there are lots of clusters with repeated characters, but in such a way that it is still possible to recover the original input. It relies on the following intuition: if you see the letters hen in English text, then most of the time the letter preceding it is t or w. If you could somehow group all such preceding letters together (mostly t's and some w's), then you would have an easy opportunity for data compression.

First treat the input string as a cyclic string and sort the N suffixes of length N. Here is how it works for the text message "abracadabra". Ignore the next column for now - you will need it for decoding only.

 i     Original Suffixes          Sorted Suffixes   t     next
--    ---------------------     ---------------------     ----
 0    a b r a c a d a b r a     a a b r a c a d a b r        2
 1    b r a c a d a b r a a     a b r a a b r a c a d        5
*2    r a c a d a b r a a b     a b r a c a d a b r a        6
 3    a c a d a b r a a b r     a c a d a b r a a b r        7
 4    c a d a b r a a b r a     a d a b r a a b r a c        8
 5    a d a b r a a b r a c     b r a a b r a c a d a        9
 6    d a b r a a b r a c a     b r a c a d a b r a a       10
 7    a b r a a b r a c a d     c a d a b r a a b r a        4
 8    b r a a b r a c a d a     d a b r a a b r a c a        1
 9    r a a b r a c a d a b     r a a b r a c a d a b        0
10    a a b r a c a d a b r     r a c a d a b r a a b        3
The Burrows Wheeler transform t[] is the last column in the suffix sorted list, preceded by the the index into the suffix sorted row in which the original string appears.
2
rdarcaaaabb
Notice how there are 4 a's in a row and 2 consecutive b's - this makes the file easier to compress. Write a program bwte.c to read in a text string and output the Burrows-Wheeler transform.

Inverting the Burrows-Wheeler transform. Now we describe how to undo the Burrows-Wheeler transform and recover the original message. If the jth suffix appears in row i above, then next[i] records the row where the j+1st suffix (jth suffix, shifted one character to the right) appears. Knowing the array next[] makes decoding easy, as with the following C code:

int x = 2;
int next[11] = { 2, 5, 6, 7, 8, 9, 10, 4, 1, 0, 3 };
char t[11] = "rdarcaaaabb";
for (i = 0; i < 11; i++)
   putchar(t[x = next[x]]);
Amazingly, the information contained in the Burrows-Wheeler transform is enough to reconstruct next[], and hence the original message! First, we know all of the characters in the original message, even if they're permuted in the wrong order. This enables us to reconstruct the first column in the suffix sorted list by sorting the characters. Since c only occurs once in the message and the suffixes are formed using cyclic wrap-around, we can deduce that next[7] = 4. Similarly, d only occurs once, so we can deduce that next[8] = 1.
 i      Sorted Suffixes   t      next
--    ---------------------      ----
 0    a ? ? ? ? ? ? ? ? ? r
 1    a ? ? ? ? ? ? ? ? ? d
*2    a ? ? ? ? ? ? ? ? ? a
 3    a ? ? ? ? ? ? ? ? ? r 
 4    a ? ? ? ? ? ? ? ? ? c
 5    b ? ? ? ? ? ? ? ? ? a
 6    b ? ? ? ? ? ? ? ? ? a
 7    c ? ? ? ? ? ? ? ? ? a        4
 8    d ? ? ? ? ? ? ? ? ? a        1
 9    r ? ? ? ? ? ? ? ? ? b
10    r ? ? ? ? ? ? ? ? ? b
However, since r appears twice, it may seem ambiguous whether next[9] = 0 and next[10] = 3, or whether next[9] = 3 and next[10] = 0. Here's the key rule that resolves the ambiguity:
If row i and j both start with the same letter and i < j, then next[i] < next[j].
This rule implies next[9] = 0 and next[10] = 3. But why is this rule valid? The rows are sorted so row 9 is lexicographically less than row 10. Thus the nine unknown characters in row 9 must be less than the the nine unknown characters in row 10 (since both start with r). We also know that between the two rows that end with r, row 0 is less than row 3. But, the nine unknown characters in row 9 and 10 are precisely the first nine characters in rows 0 and 3. Thus, next[9] = 0 and next[10] = 3 or this would contradict the fact that the suffixes were sorted.

Name your program bwtd.c and check that it recovers any message encoded with bwte.c.

Analysis. Once you have all 6 programs working, compress some of these text files. Calculate the compression ratio achieved for each file. Also, report the time to compress and decompress each file.