COS 126
Prefix Codes
Due: Wed. 4/16, 11:59pm

Write a program, decipher.c, to decipher messages encoded using a prefix code, given the encoding tree. Such codes are widely used in applications that must compress data into as few bits as possible.

A prefix code is most easily represented by a binary tree in which the external nodes labelled with single characters or short strings that can be combined to for the message. The "code" for each string is a binary encoding of the path down from the root of the tree to the external node that holds string. A 0 bit identifies left branches in the path and a 1 bit identifies right branches. For example, in the following tree, in which shaded circles are internal nodes and transparent ovals are external nodes,

the code for the is 000, because the external node holding the is reached from the root by taking 3 left branches. The other codes in this tree are

the	000
was	00100
f	001010
c	001011
h	0011
at	010
in	011
_	1

The underscore, "_", denotes a space character. Note that the encodings have variable numbers of bits. This is a fundamental property of prefix codes, and it ensures that bits can be saved by using short encodings for frequently used message fragments.

A second fundamental property of prefix codes is that messages can be formed by simply stringing together the code bits, without keeping track of which bit corresponds to which message fragment. For example, the bitstring

000100101101010111000100110101001001001010010

encodes the message "the cat in the hat was fat". The first three 0's must encode the, then the 1 bit must encode a space, then 001011 must encode c, and so on as follows ("_" denotes a space):

000	the
1	_
001011	c
010	at
1	_
011	in
1	_
000	the
1	_
0011	h
010	at
1	_
00100	was
1	_
001010	f
010	at

The codes can be run together because no encoding is a prefix of another one. This property defines a prefix code, and it allows us to represent the encoding with a tree, as shown above. Thus, to decode a given bit string: Start at the root of the tree, take one bit at a time, and go left when the bit is 0 and right when it's 1, until you reach an external node. Then, print the string in that external node, and start over at the root. Implementing this algorithm is your primary task in this assignment.

To decode a bit string, you need an encoding tree, and to represent the tree, we need a linear encoding of the tree itself. One method is to do a preorder walk of the tree, printing, for each external node, the number of left links travelled to get to that node (since the last node) followed by the node's label. The linear representation of the tree above using this method is

3 the 2 was 1 f 0 c 0 h 1 at 0 in 0 _

where "_" represents the space character. To read an encoding tree, you can call buildcode, declared in /u/cs126/examples/buildcode.h:

struct node {
	char *label;
	struct node *left, *right;
};

extern struct node *buildcode(void); /*
	reads a linear encoding of a prefix code tree
	from the standard input, builds the tree,
	and returns the root of that tree. */

buildcode uses node structures for both internal and external nodes; internal nodes have null label fields and external nodes have null left and right fields. The source code for buildcode is in /u/cs126/examples/buildcode.c.

Write decipher.c, a program that reads an encoding tree and a encoded message from the standard input and writes the decoded message on the standard output. Here's an outline for decipher.c:

#include <stdio.h>
#include "buildcode.h"

void decipher(struct node *tree) { /* decode and print the standard input */ }

int main(void) {
	struct node *codetree = buildcode();

	decipher(codetree);
	return 0;
}

Your task is to write decipher. For simplicity, assume that the input message appears as 0 and 1 characters following the encoding tree in the standard input. In an actual application, these bits would be packed eight to the byte, thus using 1/8th the space. Test your program on the data in /u/cs126/data/cat, which is the message "the cat in the hat was fat". You'll need to compile and load buildcode along with your program, and load /u/cs126/lib/libmisc.a, because buildcode calls emalloc and strsave. All of this can be done with the command

% /u/cs126/bin/lcc -n decipher.c buildcode.c

/u/cs126/bin/lcc will find buildcode.h and buildcode.c in /u/cs126/examples, unless you have your own copies. Assignment 8 explains more about /u/cs126/bin/lcc.

Run your program on the file /u/cs126/data/code, which is a message encoded using a single-character encoding built according to frequency of usage of letters in English. /u/cs126/data/code contains a 289-character message, and the code uses 1267 bits, which would normally be stored in 159 characters, or 56% of the space required by normal ASCII. This kind of savings is typical.

Turn in your code and your documentation with the command

/u/cs126/bin/submit 9 readme decipher.c

Extra Credit

Add an option "-d" to print out the code table. That is, for each external node in the tree, print the node label followed by its bit encoding. For example:

% a.out -d </u/cs126/data/cat
the	000
was	00100
f	001010
c	001011
h	0011
at	010
in	011
_	1
the cat in the hat was fat

Copyright © 1996 David R. Hanson / drh@cs.princeton.edu
$Revision: 1.3 $ $Date: 1996/11/14 21:08:15 $