First, implement a separate routine to count the frequency of occurrence of each character in the text, and put them out on standard output. Second, implement your compression routine (call it encode.c). It read in the counts, then the file, from standard input and put the counts, then the compressed file on standard output. Third, implement your decompression routine (call it decode.c). It reads the counts, then the compressed file from standard input, the produces the original source on standard output. For example, the commands
cc count.c -o count; cc encode.c -o encode (cat data | count; cat data) | encodeshould produce the compressed file on standard output. Note that this arrangement allows your routines to be filters: that is, you should not allocate storage for the file being compressed or for the result. You only need a few small arrays, and no strings.
As usual, assume that the string contains only lower-case letters and spaces, with upper- and lower-case letters equivalent and strings of nonletters collapsed to a single space. Use the decimal digits 0-9 for the compressed output. This is for ease in debugging. We could easily change this to a bitstream or a bytestream (even with a separate filter), but will not bother to do so for this assignment. To figure out how many bytes would be in the code for comparison with other methods, you can simply multipy the number of decimal digits by log(256)/log(10)=2.408... .
First, check your understanding of the method by implementing it for tiny files like abracadabra using floating point numbers for the program variables. You will find that precision is a problem for encoding long strings, but you should be able to get a feel for the method for short strings.
Next, work on making your encode routine effective for long strings, by rescaling as code digits are output to avoid carrying leading digits that are known to be equal for the boundaries of the interval. Check that your new implementation produces the same codes as the old one for tiny files, then try longer files. You still need to be aware of some degenerate conditions (does your program work for aaaaaaaaa?), and you may find some quirky conditions difficult or impossible to handle (does your program work for a file with a million a`s and one b?). Make note of such conditions and move on.
The most difficult part of the assignment is rescaling for the decoding routine. Think about this very carefully before trying to implement it. You might find it useful to use integers instead of floating point numbers for certain parts of the calculation, particularly for long messages.
Submit encode.c, decode.c, and count.c for the best implementation that you come up with, along with a README file describing its limitations. Do not spend an excessive amount of time on this program! Not many people will be able to develop a bulletproof implementation for giant files in the short amount of time available, and we will give decent credit for implementations that work for tiny files. It is more important that your documentation show evidence of some understanding about the types of files for which your implementation fails than it is for your program to be able to compress Moby Dick.
Extra Credit. Make your program adaptive (start with uniform counts, then readjust them occasionally on the basis of characters seen so far) and make it produce binary files. Then compare it with pack, gzip, and compress.