Princeton University
COS 217: Introduction to Programming Systems

Assignment 1: A "De-Comment" Program


Purpose

The purpose of this assignment is to help you learn or review (1) the fundamentals of the C programming language, (2) the details of the "de-commenting" task of the C preprocessor, and (3) how to use the GNU/Unix programming tools, especially bash, emacs, and gcc217.


Rules

Make sure you study the course Policies web page before doing this assignment or any of the COS 217 assignments. In particular, note that you may consult with the course instructors, lab TAs, listserv, etc. while doing assignments, as prescribed by that web page.

However, there is one exception: most assignments have an "on your own" part. You must do the "on your own" part of the assignment completely on your own, without consulting with the course instructors, lab TAs, Piazza, etc., except for clarification of requirements. You might think of the "on your own" part of each assignment as a small open-book take-home exam.

Sometimes the "own your own" part is more challenging than the rest of the assignment. Sometimes it involves learning material that hasn't been covered in the course, or working ahead of the pace of the course.

For this assignment, avoiding the use of global variables (as described below) is the "on your own" part. That part is worth 5 percent of this assignment. In subsequent assignments the "on your own" part will be worth more.


Background

The C preprocessor is an important part of the C programming system. Given a C source code file, the C preprocessor performs three jobs:

The second of those jobs -- the de-comment job -- is more substantial than one might think. For example, when de-commenting a program the C preprocessor must be sensitive to:


The Task

Your task is to compose a C program named decomment that performs a subset of the de-comment job of the C preprocessor, as defined below.


Functionality

Your program should be a Unix filter. A filter is a program that reads characters from the standard input stream, and writes characters to the standard output stream and possibly to the standard error stream. Specifically, your program should (1) read text, presumably a C program, from the standard input stream, (2) write that same text to the standard output stream with each comment replaced by a space, and (3) write error and warning messages as appropriate to the standard error stream. A typical execution of your program from the shell might look like this:

decomment < somefile.c > somefileWithoutComments.c 2> errorsAndWarnings

In the following examples a space character is shown as "s" and a newline character as "n".

Your program should replace each comment with a space. Examples:

Standard Input Stream Standard Output Stream Standard Error Stream
abc/*def*/ghin abcsghin
abc/*def*/sghin abcssghin
abcs/*def*/ghin abcssghin

Your program should define "comment" as in the C90 standard. In particular, your program should consider text of the form (/*...*/) to be a comment. It should not consider text of the form (//...) to be a comment. Example:

Standard Input Stream Standard Output Stream Standard Error Stream
abc//defn abc//defn

Your program should allow a comment to span multiple lines. That is, your program should allow a comment to contain newline characters. Your program should add blank lines as necessary to preserve the original line numbering. Examples:

Standard Input Stream Standard Output Stream Standard Error Stream
abc/*defnghi*/jklnmnon abcsnjklnmnon
abc/*defnghinjkl*/mnonpqrn abcsnnmnonpqrn

Your program should not recognize nested comments. Example:

Standard Input Stream Standard Output Stream Standard Error Stream
abc/*def/*ghi*/jkl*/mnon abcsjkl*/mnon

Your program should handle C string literals. In particular, your program should not consider text of the form (/*...*/) that occurs within a string literal ("...") to be a comment. Examples:

Standard Input Stream Standard Output Stream Standard Error Stream
abc"def/*ghi*/jkl"mnon abc"def/*ghi*/jkl"mnon
abc/*def"ghi"jkl*/mnon abcsmnon
abc/*def"ghijkl*/mnon abcsmnon

Similarly, your program should handle C character literals. In particular, your program should not consider text of the form (/*...*/) that occurs within a character literal ('...') to be a comment. Examples:

Standard Input Stream Standard Output Stream Standard Error Stream
abc'def/*ghi*/jkl'mnon abc'def/*ghi*/jkl'mnon
abc/*def'ghi'jkl*/mnon abcsmnon
abc/*def'ghijkl*/mnon abcsmnon

Note that the C compiler would consider the first of those examples to be erroneous (multiple characters in a character literal). But many C preprocessors would not, and your program should not.

Your program should handle escaped characters within string literals. That is, when your program reads a backslash (\) while processing a string literal, your program should consider the next character to be an ordinary character that is devoid of any special meaning. In particular, your program should consider text of the form ("...\"...") to be a valid string literal which happens to contain the double quote character. Examples:

Standard Input Stream Standard Output Stream Standard Error Stream
abc"def\"ghi"jkln abc"def\"ghi"jkln
abc"def\'ghi"jkln abc"def\'ghi"jkln

Similarly, your program should handle escaped characters within character literals. That is, when your program reads a backslash (\) while processing a character literal, your program should consider the next character to be an ordinary character that is devoid of any special meaning. In particular, your program should consider text of the form ('...\'...') to be a valid character literal which happens to contain the quote character. Examples:

Standard Input Stream Standard Output Stream Standard Error Stream
abc'def\'ghi'jkln abc'def\'ghi'jkln
abc'def\"ghi'jkln abc'def\"ghi'jkln

Note that the C compiler would consider both of those examples to be erroneous (multiple characters in a character literal). But many C preprocessors would not, and your program should not.

Your program should handle newline characters in C string literals without generating errors or warnings. Examples:

Standard Input Stream Standard Output Stream Standard Error Stream
abc"defnghi"jkln abc"defnghi"jkln
abc"defnghinjkl"mno/*pqr*/stun abc"defnghinjkl"mnosstun

Note that a C compiler would consider those examples to be erroneous (newline character in a string literal). But many C preprocessors would not, and your program should not.

Similarly, your program should handle newline characters in C character literals without generating errors or warnings. Examples:

Standard Input Stream Standard Output Stream Standard Error Stream
abc'defnghi'jkln abc'defnghi'jkln
abc'defnghinjkl'mno/*pqr*/stun abc'defnghinjkl'mnosstun

Note that a C compiler would consider those examples to be erroneous (multiple characters in a character literal, newline character in a character literal). But many C preprocessors would not, and your program should not.

Your program should handle unterminated string and character literals without generating errors or warnings. Examples:

Standard Input Stream Standard Output Stream Standard Error Stream
abc"def/*ghi*/jkln abc"def/*ghi*/jkln
abc'def/*ghi*/jkln abc'def/*ghi*/jkln

Note that a C compiler would consider those examples to be erroneous (unterminated string literal, unterminated character literal, multiple characters in a character literal). But many C preprocessors would not, and your program should not.

Your program should detect an unterminated comment. If your program detects end-of-file before a comment is terminated, it should write the message "Error: line X: unterminated comment" to the standard error stream. "X" should be the number of the line on which the unterminated comment begins. Examples:

Standard Input Stream Standard Output Stream Standard Error Stream
abc/*defnghin abcsnn Error:slines1:sunterminatedscommentn
abcdefnghi/*n abcdefnghisn Error:slines2:sunterminatedscommentn
abc/*def/ghinjkln abcsnn Error:slines1:sunterminatedscommentn
abc/*def*ghinjkln abcsnn Error:slines1:sunterminatedscommentn
abc/*defnghi*n abcsnn Error:slines1:sunterminatedscommentn
abc/*defnghi/n abcsnn Error:slines1:sunterminatedscommentn

Your program (more precisely, its main function) should return EXIT_FAILURE if it was unsuccessful, that is, if it detects an unterminated comment and so was unable to remove comments properly. Otherwise it should return EXIT_SUCCESS or, equivalently 0.

Your program should work for standard input lines of any length.

You may assume that the final character in the standard input stream is the newline character. That is, if the standard input stream contains any characters at all, then you may assume that the last one of them is the newline character.

Your program may assume that the backslash-newline character sequence does not occur in the standard input stream. That is, your program may assume that logical lines are identical to physical lines in the standard input stream. So your de-comment program need not perform the first of the three jobs described above in the "Background" section.


Design

Design your program as a deterministic finite state automaton (DFA, alias FSA). The DFA concept is described in lectures, and in this Web page written by Professors Sedgewick and Wayne: http://www.cs.princeton.edu/introcs/java/73dfa/.

Your program should not consist of one large main function. Instead your program should consist of multiple small functions, each of which performs a single well-defined task. In this program you should create one function to implement each state of your DFA, as described in lectures.

Generally, a (large) C program should consist of of multiple source code files. For this assignment, you need not split your source code into multiple files. Instead you may place all source code in a single source code file. Subsequent assignments will ask you to write programs consisting of multiple source code files.

We suggest that your program use the standard C getchar function to read characters from the standard input stream.


Logistics

You should create your program on the hats cluster using bash, emacs, and gcc217.

Step 1: Design a DFA

Express your DFA using the traditional "labeled ovals and labeled arrows" notation. More precisely, use the same notation as is used in the examples from Section 7.3 of the Sedgewick and Wayne book. Let each oval represent a state. Give each state a descriptive name. Let each arrow represent a transition from one state to another. Label each arrow with the character, or class of characters, that causes the transition to occur. We encourage (but do not require) you also to label each arrow with action(s) that should occur (e.g. "print the character") when the corresponding transition occurs.

Express as much of the program's logic as you can within your DFA. The more logic you express in your DFA, the better your grade on the DFA will be.

To properly report unterminated comments, your program must contain logic to keep track of the current line number of the standard input stream. You need not show that logic in your DFA.

Create a textual representation of your DFA in a file named dfa. For example, the last DFA given in Section 7.3 of the Sedgewick and Wayne book could be expressed as this Textual DFA. Use that textual notation.

Step 2: Create Source Code

Use emacs to create source code in a file named decomment.c that implements your DFA.

Step 3: Preprocess, Compile, Assemble, and Link

Use the gcc217 command to preprocess, compile, assemble, and link your program. Perform each step individually, and examine the intermediate results to the extent possible.

Step 4: Execute

Execute your program multiple times on various input files that test all logical paths through your code.

We have provided several files in hats directory /u/cos217/Assignment1:

Copy those files to your project directory, and use them to help you test your program.

You also should test your program against its own source code using a command sequence such as this:

sampledecomment < decomment.c > output1 2> errors1
decomment < decomment.c > output2 2> errors2
diff -c output1 output2
diff -c errors1 errors2
rm output1 errors1 output2 errors2

Step 5: Create a readme File

Use emacs to create a text file named readme (not readme.txt, or README, or Readme, etc.) that contains:

Descriptions of your code should not be in the readme file. Instead they should be integrated into your code as comments.

Your readme file should be a plain text file. Don't create your readme file using Microsoft Word or any other word processor.

Step 6: Submit

Submit your work. Submit your dfa file, your decomment.c file, the files that gcc217 generated from it, and your readme file electronically by issuing these commands on hats:

submit 1 dfa
submit 1 decomment.c
submit 1 decomment.i decomment.s decomment.o decomment
submit 1 readme

We cannot accept your DFA via e-mail. We cannot accept your DFA electronically in any form other than plain text.


Grading

We will grade your work on two kinds of quality: quality from the user's point of view, and quality from the programmer's point of view. From the user's point of view, a program has quality if it behaves as it should. The correct behavior of your program is defined by the previous sections of this assignment specification, and by the behavior of the given sampledecomment program.

From the programmer's point of view, a program has quality if it is well styled and thereby easy to maintain. In part, good style is defined by the rules given in The Practice of Programming (Kernighan and Pike), as summarized by the Rules of Programming Style document. For this assignment we will pay particular attention to rules 1-24.

These additional more course-specific rules apply:

To encourage good coding practices, we will deduct points if gcc217 generates warning messages.


Avoiding Global Variables

Suppose a program contains functions f and g, and that f calls g. Further suppose that f wishes to pass some values to g. f can do that using ordinary parameters.

Now suppose g wishes to pass some values back to f, its caller. g can pass the first value back via its return value. But how can g pass additional values back to f?

One approach is to use global variables, where a global variable is one which is defined outside of all functions. g could assign the additional values to global variables; f then could fetch those values by accessing the global variables.

However, as prescribed by Kernighan and Pike style rule 25, and for reasons that we will discuss later in the course, generally you should avoid using global variables. Instead all communication of values into and out of a function should occur via the function's parameters and its return value.

Indeed in your decomment program you should avoid using global variables. You can do that using either of these two approaches:

Some notes:


This assignment was written by Robert M. Dondero, Jr. with input from many other faculty members