Princeton University
COS 217: Introduction to Programming Systems

Assignment 1: A File Compression Program

Purpose

The purpose of this assignment is to help you learn or review (1) the fundamentals of the C programming language, and (2) how to use the GNU/UNIX programming tools, especially bash, xemacs, and gcc.

Background

File compression is a common computer application. Often it is necessary to compress files for convenient distribution to users, for fast communication across a network, or to save disk space. So computer scientists have developed many file compression algorithms.

Run-length compression is a simple, but often effective, compression algorithm for text files. The run-length compression algorithm designates some special character, typically the character with all bits on (that is, 25510, alias FF16, alias 3778, alias 111111112) as the "escape" character. If a non-escape character occurs consecutively more than twice in the input steam, the algorithm compresses the occurrences of the character into a three-character "escape sequence" escapechar char count, where escapechar is the escape character, char is the character in the input, and count is the number of consecutive occurrences of the character. Thus the algorithm is effective at compressing files that contain many or long sequences of identical characters.

Your First Task

Your first task is to write a file compression program named squash. Your squash program should use the run-length compression algorithm to compress a given stream of characters.

Your squash program should be a UNIX filter. That is, your squash program should read from standard input and write to standard output. A typical command-line execution of your program might look like this:

squash < file1 > compressedfile

Your squash program should work correctly when an input character is escapechar itself. Hint: Use an escape sequence to encode escape characters.

Note that the maximum value of count is 255. Nevertheless, your squash program should work correctly when an input character occurs consecutively more than 255 times. Hint: Use multiple consecutive escape sequences.

Your program should work for standard input lines of any length.

Your Second Task

Your second task is to write a UNIX filter named unsquash.  Your unsquash program should be the inverse of your squash program. It should read from standard input a character stream that, presumably, has been compressed by your squash program. It should write the original character stream to standard output. A typical command-line execution of your unsquash program might look like this:

unsquash < comressedfile > file2

If your unsquash program detects that standard input contains a malformed squashed stream, then it should stop writing to standard output, write the message "Malformed squashed stream" to standard error, and exit. The stream is malformed if unsquash encounters an escapechar that is not followed by both a char and a count.

Design

You should design your squash program as a deterministic finite state automaton (FSA). The FSA concept will be described in lectures. The FSA concept also is described in Section 7.3 of the book "Introduction to Computer Science," by Drs. Sedgewick and Wayne. That book section is available through the web at http://www.cs.princeton.edu/introcs/73fsa/.

Generally, a program should not consist of one large main() function. Instead a program should consist of multiple small functions, each of which performs a single well-defined task. However, for this assignment properly splitting your programs into multiple functions would involve using call-by-reference -- a technique that we will not have covered completely prior to this assignment's due date. So, you need not split your squash and unsquash programs into multiple functions. (Nevertheless, we encourage you to do so.)

Generally, a (large) C program should consist of of multiple source code (.c) files.  For this assignment, you need not split your source code into multiple files. That is, you may place all source code that comprises the squash program in a single file (named squash.c), and all source code that comprises the unsquash program in a single file (named unsquash.c). Subsequent assignments will ask you to write programs which should consist of multiple source code files.

Limit line lengths in your source code to 78 characters. Doing so allows us to print your code in two columns, thus saving paper.

We suggest that your programs read characters from standard input using the standard C getchar() function, and write characters to standard output using the standard C putchar() function..

Logistics

You should create your program on hats using bash, xemacs and gcc.

Step 1: Create Source Code

Use xemacs to create source code in files named squash.c and unsquash.c.

Step 2: Preprocess, Compile, Assemble, and Link

Use the gcc command with the -Wall, -ansi, and -pedantic options to preprocess, compile, assemble, and link your programs.

Step 3: Execute

Execute your programs multiple times on various input files that test all logical paths through your code.

We have provided several files in hats directory /u/cos217/Assignment1:

chmod 700 grade1
chmod 700 grade1diff
chmod 700 grade1differr

to make the scripts executable.

You should copy those file to your project directory and use them to help you test your programs. You should create additional test files as you see fit.

Note: The bash scripts use the UNIX diff (file difference) command.  The command:

diff somefile1 somefile2

writes to standard output an indication of the differences, at the line level, between somefile1 and somefile2. If somefile1 and somefile2 are identical, then diff produces no output.

Note: The bash scripts also use the UNIX od (octal dump) command. The command:

od -vc somefile

writes to standard output the characters that comprise somefile. It writes non-printable characters as octal numbers.

Step 4: Create a readme File

Use xemacs to create a text file named "readme" that contains:

Descriptions of your code should not be in the readme file. Instead they should be integrated into your code as comments.

Step 5: Submit

Submit your work electronically on hats via the command:

/u/cos217/bin/i686/submit 1 squash.c unsquash.c readme

If the directory /u/cos217/bin/i686 is in your PATH environment variable, then you can abbreviate that command as:

submit 1 squash.c unsquash.c readme

If you are using the bash shell and have copied files .bashrc and .bash_profile from the /u/cos217 directory to your HOME directory, then directory /u/cos217/bin/i686 indeed is in your PATH environment variable. You can examine your PATH environment variable by executing the command "printenv PATH".

Grading

We will grade your work on functionality and design. We will consider understandability to be an important aspect of good design. See the next section for guidelines concerning program understandability. To encourage good coding practices, we will build using "gcc -Wall -ansi -pedantic" and take off points based on warning messages.

Program Understandability

An understandable program: