COS 333 Assignment 3: Size Matters (Spring 2008)

Due midnight, Friday, Feb 29

Tue Feb 19 08:30:09 EST 2008

This assignment is an exercise in using scripting languages to explore and manipulate two largish text files. Your job is to compress these files (separately) as much as you can, using standard Unix tools and scripting languages like Awk, Perl, Python, PHP, and so on; you can't use compiled languages like C, C++, or Java. Here are help files for Awk, Perl, Python and PHP that might get you started.

There are many Unix tools that will help too: gzip and bzip2 do general-purpose compression; grep, sed, sort, uniq and wc let you find, count and rearrange things; and of course shells and scripting languages let you do your own processing and combine other pieces.

Your compression must be lossless, so the original data can be reconstituted exactly. If you remove something redundant, you have to be able to recreate it. Similarly, if you rearrange any part of the data, for example by sorting, you have to be able to recover what you started with. Your techniques should also be expected to achieve similar compression on an input file that is similar.

1. What's in a Name?

The file names.txt (3.1 MB) contains nearly 89,000 surnames from the 1990 census, sorted by frequency of occurrence. The original data and background information can be found at the Census Bureau.

Compress this file as much as you can, using only standard Unix tools and scripting languages. You have to also provide the procedure for uncompressing. For calibration, I squeezed it by a factor of nearly 10 with only 6-8 lines of Awk, and a factor of 12 is well within range.

It's probably easiest to do your experiments by focusing first on a compression script, but always keeping in mind the process by which you will uncompress. Once you think the compression is good enough, collect all the compression operations into a script that can be run by a single command, and all the uncompression operations into another script. The compression script creates a single output file; the decompression script reads a single input file. We will test your work by running just those two scripts.

You must submit two scripts c1 and u1 that compress and uncompress respectively, as specified below. We will run them on the census data. The output after compressing and then uncompressing must match the input exactly.

2. Rank Hath Its Privileges

The file tpop.txt (2.1 MB) contains the hourly Amazon rank of The Practice of Programming from June 14, 2001 to February 13, 2008. Each line gives the rank and the date in standard Unix form:
	5,693	Thu Jun 14 12:09:38 EDT 2001

Write a script c2 to compress this file as much as you can. Again, c2 must be accompanied by an uncompression script u2 that will restore the file to its original state: each script reads stdin and writes stdout. For calibration, I got a factor of over 23 without much work, so you should be able to do at least that well.

3. Advice

There are no right answers for this assignment. We're looking for a combination of straightforward ideas and an occasional flash of insight, expressed as an orderly set of programs that do the job automatically. There are many possible approaches that you can mix and match, so play with it over a period of time rather than doing it in a single sitting; this gives your subconscious a chance to contribute. The basic idea of compression is to remove redundancy -- things that are repeated or similar to others -- and to use fewer bits per item for items that occur often. Bzip2 does a great job of that. What you can do that bzip2 can't do is to take advantage of semantic relationships, and you can rearrange the input as well so long as you do it reversibly.

A hint for combining output files: the ar command is similar to the tar command but has much lower space overhead; you might consider using it to collect pieces into a single file if necessary.

The credit will be divided more or less evenly among the two files. You should be able to get a factor of 11-12 for the first file and at least 23-24 for the second. Try to do better than that; there won't be enormous credit for doing worse than my sleazy hacks.

There's no credit for gaming the system, for example by writing a program that hides the file contents somewhere and then magically finds them, since we might well try them on analogous but different data. You are trying to compress the file to send it to someone else. You have to include in the transmission all the information necessary to recreate it at the other end. So if you write decompression scripts or build data tables of your own, those must be included in the size. Standard tools like sort, bzip2, Awk, Perl, etc., can be assumed to exist everywhere and are not included in any size computation.

Along with fleeting fame and a good grade, there will be some utterly useless prizes for the best couple of compressions. Don't kill yourself, however -- this assignment is meant to be a bit of a change of pace after the first few. Enjoy!

4. Submission

You can mix and match Unix commands and scripts in scripting languages as you like; the result must be runnable by invoking a single command called c1 or c2. This command might be self-contained or it might invoke other files; that's your choice. It doesn't have to be robust either; it just has to work right if it's used properly. It should also run at a reasonable speed -- if it takes more than a couple of minutes, something's wrong.

The compression script for names, c1, must compress its standard input and write the result to stdout. The matching uncompression script u1 must uncompress its standard input to stdout, so the result of the two steps should be identical to the original file. We will test by

	c1 <names.txt >temp
	u1 <temp >newnames
	cmp names.txt newnames
and use the combined size of temp, c1 and u1 as the basic measure of how good your compression is.

The scripts for tpop.txt are to behave the same, but must be named c2 and u2. Your scripts must not alter or remove names.txt or tpop.txt (so we can run multiple tests without having to recreate the inputs).

To make this more concrete, here are working versions of c1 and u1, to show the general format. This achieves a factor of 5.25 merely by discarding multiple blanks; for contrast, bzip2 compresses this file by about 4.95.

	#!/bin/sh
	# c1: compression script for names.txt

	awk '{print $1, $2, $3, $4}' | bzip2

	#!/bin/sh
	# u1: uncompression script for names.txt

	bunzip2 | awk '{printf("%-14s %5s %6s %6d\n", $1, $2, $3, $4)}'

Create and submit a single file a3.tar that contains the four scripts named above, and any other programs that your process requires:

	tar cf a3.tar c1 u1 c2 u2 other programs

We will test it by running a sequence like this in a directory that contains only names.txt and tpop.txt:

	tar xf a3.tar
	c1 <names.txt >temp1; u1 <temp1 >newnames; cmp names.txt newnames
	c2 <tpop.txt >temp2;  u2 <temp2 >newtpop;  cmp tpop.txt newtpop
You have to submit all the pieces necessary to make this just plain work when we run it. Submit with
	~cos333/bin/submit 3 a3.tar

Please follow the rules on what to submit. Don't include standard Unix tools or the data files. It will help your grade if you get the filenames right and submit exactly what's asked for. Thanks.