Christopher Moretti
Department of Computer Science
Princeton University

Assignment 5: ... and Serial Number

Due: 23:59 Friday 3/25/2016

Preface

This assignment is an exercise in using scripting languages to explore and manipulate two rather large text files. You will compress these files (separately) as much as you can, while managing one important criterion: your compression must be lossless, so the original data can be reconstituted exactly.

Your tools for this job are scripting languages like AWK, Perl, Python, PHP, Ruby, and so on. Importantly, you can't use general-purpose languages like C, C++, or Java. As a reminder, here are help files for AWK and Python from previous assignments. You might also reference the lecture notes on Bash, AWK, and Python.

There are many Unix tools that might also help: gzip, bzip2 and xz do general-purpose compression; grep, sed, sort, uniq and wc let you find, count and rearrange things; and of course shells and scripting languages let you do your own processing and combine other pieces.

 

Part 1: In Accordance with the Truth of Things (or, "Art Thou Not 3984 and 3101?")

The file names.txt (5.5 MB) contains nearly 152,000 surnames from the 2010 census, sorted by frequency of occurrence. Background information about the data can be found at the Census Bureau.

Your task is to provide a tool for compression and uncompression in which the former compresses this file as much as possible, while the latter can still recover the original file exactly. You should use only standard Unix tools and scripting languages.

 

For calibration of what "as much as possible" entails, consider the following information. I have a very simplistic 15-line script using mostly AWK that compresses it by a factor of 11.50 and uses a 23-line script of mostly AWK to recover the original. The highest compression factor obtained last year for this file was 12.50, while the 90th percentile was 11.70. (This year's will be different, of course -- hopefully better, as you have a record to aim for!) I will set the midline (used in calculations below) at 11.00 (this corresponds to just under the 50th percentile last year).

 

You must submit two scripts c1 and u1 that compress and uncompress respectively: c1 creates a single compressed file, written to stanndard output, from a single input file given as a command line argument, and u1 creates an uncompressed file, written to standard output, from a single compressed input file given as a command line argument. We will test your work by running just those two scripts. The result of u1 must match the original input exactly (for instance, as might be judged by the cmp command)

To make this more concrete, here are working versions of c1 and u1, to show the general format. This c1 achieves a factor of 5.48 merely by discarding multiple blanks; for contrast, bzip2 compresses this file by about 5.17.

# c1: compress names.txt
awk '{print $1,$2,$3,$4}' $1 | bzip2 -c
# u1: uncompress for names.txt
mv $1 $1.bz2
bunzip2 -c $1.bz2 | awk '{printf("%-15s %5s %6s %6d\n", $1,$2,$3,$4)}'

 

Part 2: Victory; A Matter of Staying Power (or, "Rank Hath Its Privileges")

The file crank.txt (4.2 MB) contains the hourly Amazon rank of The C Programming Language from June 14, 2001 to February 18, 2015. Each line gives the rank and the date in standard Unix form. There is a tab between the rank and the date, and the date is in this format, with exactly one space between each date field:

4,287	Wed Feb 18 14:30:06 EST 2015

 

You must submit two scripts c2 and u2 that compress and uncompress respectively: c2 creates a single compressed file, written to stanndard output, from a single input file given as a command line argument, and u2 creates an uncompressed file, written to standard output, from a single compressed input file given as a command line argument. We will test your work by running just those two scripts. The result of u2 must match the original input exactly.

 

For calibration of what "as much as possible" entails, consider the following information. I have a very simplistic 16-line script using mostly AWK that compresses it by a factor of 19.37 and uses a 21-line script of mostly AWK to recover the original. The highest compression factor obtained last year for this file was 22.62, while the 90th percentile was 20.93. (This year's will be different, of course -- hopefully better, as you have a record to aim for!) I will set the midline (used in calculations below) at 19.00 (this corresponds to just under the 50th percentile last year).

 

General Requirements for Both Parts

You can mix and match Unix commands and scripts in scripting languages any way you like; the result must be runnable by invoking a single command (your four scripts, {c,u}{1,2}). This command must be self-contained. It doesn't have to be robust; it just has to work if it's used properly. It should run at a reasonable speed, however -- we will judge this by the Dropbox timeouts: if it takes more than a minute, something's wrong, and your script will be killed. If a tool doesn't run on the CS Dropbox, you can't use it no matter how useful it might be; the same goes for Python modules and the like. Use the dropbox as a reasonable test of whether your tools are valid in this environment.

 

Your compression and uncompression techniques should be expected to work correctly, and largely achieve similar compression on an input file that has the same structure but where some of the exact data values might differ. Examples might be the equivalent census data from 2000 or 2020, or the equivalent Amazon data if updated through this year.

 

Advice

The first piece of advice is one that is often heart-wrenching to Princeton students: there are no "right answers" for this assignment! Instead, seek a combination of straightforward ideas, iteration and perseverence over experiments of multiple techniques, and perhaps an occasional flash of insight. There are many possible approaches that you can mix and match, so play with it over a period of time rather than doing it in a single sitting; this gives your subconscious a chance to contribute.

 

The basic idea of compression is to remove redundancy -- things that are repeated or similar to others -- and to use fewer bits per item for items that occur often. Bzip2 does a great job of that, and it is a waste of your time to try to invent special mechanisms that are subsumed by what bzip2 already does. What you can do that bzip2 can't is take advantage of semantic relationships, and you can also rearrange the input so long as you do it reversibly.

 

It is probably easiest to do your experiments by focusing first on a compression script, but always keeping in mind the process by which you will uncompress. (Do keep in mind that if you remove something redundant, you have to be able to recreate it. Similarly, if you rearrange any part of the data, for example by sorting, you have to be able to recover what you started with.) Once you think the compression is good enough, collect all the compression operations into a script that can be run by a single command, and all the uncompression operations into another script.

 

The data has been slightly massaged to remove some real glitches, time a few years ago when the rank went to 4,130,947 for a single hour. If you find something really weird in the data, like time going backwards, let us know; we're not trying to trick you with flaky data (though real data is often very flaky). And before you ask: the fact that the cumulative percentage in names.txt doesn't match exactly is not a glitch, it's a result of rounding, and you must deal with it.

 

Submission and Evaluation

Submit using the dropbox link https://dropbox.cs.princeton.edu/COS333_S2016/Five.

The dropbox script will do a relatively straightforward check of your code when you submit. We will test your code by running in a directory that contains only your code, names.txt and crank.txt. Please be careful that your scripts do not alter the input test files. The "Check All Submitted Files" button on Dropbox will report your compression factors if your scripts are correct (lossless).

We will test correctness by a sequence of commands such as these (but do not assume filenames of either the input or output files):

c1 names.txt >compressed.form
u1 compressed.form >newnames
cmp names.txt newnames
c2 crank.txt >compressed.form
u2 compressed.form >newcrank
cmp crank.txt newcrank

 

We will compute the compression factor like this:

c1 names.txt >compressed.form
factor = bytes in names.txt / (bytes in compressed.form + c1 + u1)
c2 crank.txt >compressed.form
factor = bytes in crank.txt / (bytes in compressed.form + c2 + u2)

Note that, unfortunately, this means comments, spaces, etc. count as part of the size. In reality it won't have a big effect, however. If you would like to gun for absolute minimization, we strongly suggest developing within a structure of well-commented readable code, and only minifying as a separate process.

 

The credit will be divided evenly between the two parts of the assignment. For each part, the score breakdown is as follows, with exact details described below:

          40% (20 points): Submission and Correctness
          30% (15 points): Baseline-to-Midline Performance
          30% (15 points): Midline-to-Highline Performance

 

Incorrect programs receive no performance points. If you do not reach the baseline (bzip2), you will receive no performance points.

 

If you reach the midline specified in each part above, you will receive all 15 of the first set of performance points. If you do not reach the midline, we will scale down your score in proportion to the baseline (bzip2). So, for example, if you got 7, where the baseline was 5 and the midline was 11, you would earn ((7-5)/(11-5))*15, or about 5 of the 15 points in the first set of performance points.

 

Beyond the midline, the second set of performance points will be relative to the best compression ratios in the submissions from this semester. To avoid any perception or reality of fostering cutthroat behavior, but balancing against disincentivizing striving to maximize one's ratio, we will define this to be "among the top 10%". Thus, if you get in the top 10% of compression ratios, you will earn all the points in the second set. Otherwise, your score will be scaled based on the midline and the lowest ratio in the top 10%. So, for example, if you got a 20 when the midline was a 19, the 90th percentile was a 21, and the the highest compression score in the class were a 50(!), you would earn ((20-19)/(21-19))*15 = 7.5 of the 15 points in the second set of performance points.