Due midnight, Friday, March 8
Tue Feb 26 10:05:35 EST 2013
This assignment is an exercise in using scripting languages to explore and manipulate two largish text files. Your job is to compress these files (separately) as much as you can, using standard Unix tools and scripting languages like Awk, Perl, Python, PHP, Ruby, and so on; you can't use general-purpose languages like C, C++, or Java. Here are help files for Awk, Perl, and Python that might get you started.
There are many Unix tools that will help too: gzip and bzip2 do general-purpose compression; grep, sed, sort, uniq and wc let you find, count and rearrange things; and of course shells and scripting languages let you do your own processing and combine other pieces.
Your compression must be lossless, so the original data can be reconstituted exactly. If you remove something redundant, you have to be able to recreate it. Similarly, if you rearrange any part of the data, for example by sorting, you have to be able to recover what you started with. Your techniques should also be expected to achieve similar compression on an input file that has the same structure but where some of the data values might be different.
The file names.txt (3.1 MB) contains nearly 89,000 surnames from the 1990 census, sorted by frequency of occurrence. The original data and background information can be found at the Census Bureau.
Compress this file as much as you can, using only standard Unix tools and scripting languages. You have to also provide the procedure for uncompressing. For calibration, I squeezed it by a factor of 12.5 with a dozen lines of Awk, so a factor of 12 should be within range; the best ever was about 14.2 if I remember right.
It's probably easiest to do your experiments by focusing first on a compression script, but always keeping in mind the process by which you will uncompress. Once you think the compression is good enough, collect all the compression operations into a script that can be run by a single command, and all the uncompression operations into another script.
You must submit two scripts c1 and u1 that compress and uncompress respectively: c1 creates a single compressed file from a single input file, and u1 creates an uncompressed file from a single compressed input file. We will test your work by running just those two scripts on the census data. After
c1 names.txt >compressed.form u1 compressed.form >newnames.txtnewnames.txt must match names.txt exactly.
4,736 Sat Feb 23 11:30:05 EST 2013
Write a script c2 to compress this file as much as you can. Again, c2 must be accompanied by an uncompression script u2 that will restore the file to its original state: each script reads a file and writes stdout. For calibration, I got a factor of 18.6 without too much work.
I have weeded out some glitches in the raw data (like the time a couple of years ago when the rank went to 4,130,947 for a single hour.) You can assume that any minor changes we make for testing will preserve the important aspects of the data like order, number of fields, range of data, format of lines, and so on. But if you find something really weird in the data, like time going backwards, let us know; we're not trying to trick you with flaky data.
There are no right answers for this assignment. We're looking for a combination of straightforward ideas and an occasional flash of insight, expressed as an orderly set of programs that do the job automatically. There are many possible approaches that you can mix and match, so play with it over a period of time rather than doing it in a single sitting; this gives your subconscious a chance to contribute. The basic idea of compression is to remove redundancy -- things that are repeated or similar to others -- and to use fewer bits per item for items that occur often. Bzip2 does a great job of that. What you can do that bzip2 can't is take advantage of semantic relationships, and you can also rearrange the input so long as you do it reversibly.
We will compute the compression factor like this:
c1 names.txt >compressed.form
factor = bytes in names.txt / (bytes in compressed.form + c1 + u1)
Standard tools like sort, bzip2, Awk, Python, etc., can be assumed to exist everywhere and are not included in any size computation.
It's probably unkind of us, but comments, blank lines, etc., in your programs count as part of the size. It won't be a big effect, however. The "Check" button in the submission scripts at Dropbox will report your compression factors if your scripts work.
Your scripts should work on data that is quite similar to this, for example, the same data set a year ago or a year from now.
You do not have to use Awk or any other particular tool.
The credit will be divided more or less evenly between the two files. You should be able to get a factor of 12 for names.txt and 18 for crank.txt without undo strain. Try to do better than that; there won't be enormous credit for doing worse than my sleazy hacks.
Details are not final, but roughly you will get 50% for matching my factors of 12 and 18. We will then give you an additional score that is proportional to how well you do compared to the best. Thus if you get the best compression, your score is 100% on that part. If you get halfway between me and the best, you get 75%.
If on the other hand you don't do as well as I do, we will scale down your score in proportion, using bzip2 as the baseline. If you get 8 where the nominal midpoint is 12, and bzip2 gets 5, you'll get (8-5) / (12-5) or about 43. (These details are subject to change as we see how things go, so don't too wrapped up in them.)
Along with fleeting fame and a good grade, there will be some utterly useless prizes for the best couple of compressions on each data set. Don't kill yourself, however -- this assignment is meant to be a bit of a change of pace. Enjoy!
You can mix and match Unix commands and scripts in scripting languages as you like; the result must be runnable by invoking a single command. This command must be self-contained. It doesn't have to be robust; it just has to work if it's used properly. It should run at a reasonable speed, however -- if it takes more than a couple of minutes, something's wrong, and your script will be killed.
c1 and u1 each take one input file and write to stdout. We will test correctness by
c1 names.txt >compressed.form u1 compressed.form >newnames cmp names.txt newnames
The scripts for crank.txt are to behave the same, but must be named c2 and u2.
To make this more concrete, here are working versions of c1 and u1, to show the general format. This achieves a factor of 5.25 merely by discarding multiple blanks; for contrast, bzip2 compresses this file by about 4.95. Thus this would earn a score of about 4.
# c1: compress names.txt
awk '{print $1,$2,$3,$4}' $1 | bzip2 -c
# u1: uncompress for names.txt
mv $1 $1.bz2
bunzip2 -c $1.bz2 | awk '{printf("%-14s %5s %6s %6d\n", $1,$2,$3,$4)}'
We will test your code by running in a directory that contains only your code, names.txt and crank.txt. Submit using this Dropbox link.
Please be careful that your scripts do not alter the input test files.