Princeton University
COS 217: Introduction to Programming Systems

Assignment 1: A "De-Comment" Program

Purpose

The purpose of this assignment is to help you learn or review (1) the fundamentals of the C programming language, (2) the details of the "de-commenting" task of the C preprocessor, and (3) how to use the GNU/UNIX programming tools, especially bash, xemacs, and gcc.

Background

The C preprocessor is an important part of the C programming system. Given a C source code file, the C preprocessor performs three jobs:

  1. Merge "physical" lines of source code into "logical" lines. That is, when the preprocessor detects a line that ends with the backslash character, it merges that physical line with the next physical line to form one logical line.
  2. Remove comments from ("de-comment") the source code.
  3. Handle preprocessor directives (#define, #include, etc.) that reside in the source code.

The de-comment job is substantial. For example, the C preprocessor must be sensitive to:

Your Task

Your task is to compose a C program named "decomment" that performs a subset of the de-comment job of the C preprocessor, as defined below.

Functionality

Your program should be a UNIX "filter." That is, your program should read characters from the standard input stream, and write characters to the standard output stream and possibly to the standard error stream. Specifically, your program should (1) read text, presumably a C program, from the standard input stream, (2) write that same text to the standard output stream with each comment replaced by a space, and (3) write error and warning messages as appropriate to the standard error stream. A typical execution of your program from the shell might look like this:

decomment < somefile.c > somefilewithoutcomments.c 2> errorandwarningmessages

In the following examples a space character is shown as "s" and a newline character as "n".

Your program should replace each comment with a space. Examples:

Standard Input Stream Standard Output Stream Standard Error Stream
abc/*def*/ghin abcsghin
abc/*def*/sghin abcssghin  
abcs/*def*/ghin abcssghin  

Your program should define "comment" as in the C89 standard. In particular, your program should consider text of the form (/* ... */) to be a comment. It should not consider text of the form (// ... ) to be a comment. Example:

Standard Input Stream Standard Output Stream Standard Error Stream
abc//defn abc//defn  

Your program should allow a comment to span multiple lines. That is, your program should allow a comment to contain newline characters. Your program should add blank lines as necessary to preserve the original line numbering. Examples:

Standard Input Stream Standard Output Stream Standard Error Stream
abc/*defnghi*/jklnmnon abcsnjklnmnon  
abc/*defnghinjkl*/mnonpqrn abcsnnpqrnstun  

Your program should detect nested comments, and generate warning messages when they occur. Specifically, your program should write the message "Warning: line X: comment end outside of comment" or "Warning: line X: comment start inside of comment" as appropriate to the standard error stream. "X" is the number of the line containing the offense. Examples:

Standard Input Stream Standard Output Stream Standard Error Stream
abc/*def*/ghi*/jkln abcsghi*/jkln Warning:slines1:scommentsendsoutsidesofscommentn
abc/*def/*ghi*/jkln abcsjkln Warning:slines1:scommentsstartsinsidesofscommentn
abc/*denf/*ghi*/jnkl*/mnon abcsnjnkl*/mnon Warning:slines2:scommentsstartsinsidesofscommentn
Warning:slines3:scommentsendsoutsidesofscommentn
abc*/*def*/ghin abc*sghin Warning:slines1:scommentsendsoutsidesofscommentn
abc/*def/*/ghin abcsghin Warning:slines1:scommentsstartsinsidesofscommentn

Your program should handle C string literals. In particular, your program should not consider text of the form (/* ... */) that occurs within a string literal ("...") to be a comment. Examples:

Standard Input Stream Standard Output Stream Standard Error Stream
abc"def/*ghi*/jkl"mnon abc"def/*ghi*/jkl"mnon
abc/*def"ghi"jkl*/mnon abcsmnon
abc/*def"ghijkl*/mnon abcsmnon

Similarly, your program should handle C character literals. In particular, your program should not consider text of the form (/* ... */) that occurs within a character literal ('...') to be a comment. Examples:

Standard Input Stream Standard Output Stream Standard Error Stream
abc'def/*ghi*/jkl'mnon abc'def/*ghi*/jkl'mnon  
abc/*def'ghi'jkl*/mnon abcsmnon  
abc/*def'ghijkl*/mnon abcsmnon  

Note that the C compiler would consider the first of those examples to be erroneous (multiple characters in a character literal). But many C preprocessors would not, and your program should not.

Your program should handle newline characters in C string literals without generating errors or warnings.

Standard Input Stream Standard Output Stream Standard Error Stream
abc"defnghi"jkln abc"defnghi"jkln  
abc"defnghinjkl"mnon abc"defnghinjkl"mnon  

Note that a C compiler would consider those examples to be erroneous (newline character in a string literal).  But many C preprocessors would not, and your program should not.

Similarly, your program should handle newline characters in C character literals without generating errors or warnings.

Standard Input Stream Standard Output Stream Standard Error Stream
abc'defnghi'jkln abc'defnghi'jkln  
abc'defnghinjkl'mnon abc'defnghinjkl'mnon  

Note that a C compiler would consider those examples to be erroneous (multiple characters in a character literal, newline character in a character literal). But many C preprocessors would not, and your program should not.

Your program should handle unterminated string and character literals without generating errors or warnings. Examples:

Standard Input Stream Standard Output Stream Standard Error Stream
abc"def/*ghi*/jkln abc"def/*ghi*/jkln  
abc'def/*ghi*/jkln abc'def/*ghi*/jkln  

Note that a C compiler would consider those examples to be erroneous (unterminated string literal, unterminated character literal, multiple characters in a character literal). But many C preprocessors would not, and your program should not.

Your program should detect an unterminated comment. If your program detects end-of-file before a comment is terminated, it should write the message "Error: line X: unterminated comment" to the standard error stream. "X" should be the number of the line on which the unterminated comment begins.

Standard Input Stream Standard Output Stream Standard Error Stream
abc/*defnghin abcsnn Error:slines1:sunterminatedscommentn
abcdefnghi/*n abcdefnghisn Error:slines2:sunterminatedscommentn
abc/*def/ghinjkln abcsnn Error:slines1:sunterminatedscommentn
abc/*def*ghinjkln abcsnn Error:slines1:sunterminatedscommentn
abc/*defnghi*n abcsnn Error:slines1:sunterminatedscommentn
abc/*defnghi/n abcsnn Error:slines1:sunterminatedscommentn

Your program should work for standard input lines of any length.

Your program may assume that the backslash-newline character sequence does not occur in the standard input stream. That is, your program may assume that logical lines are identical to physical lines in the standard input stream.

Your program may assume that the backslash-doublequote character sequence does not occur within string literals. That is, your program may assume that text of the form ("...\" ...") does not appear in the standard input stream. Similarly, your program may assume that the backslash-quote character sequence does not occur within character literals.  That is, your program may assume that text of the form ('...\'...') does not appear in the standard input stream.

You may assume that the final line of the standard input stream ends with the newline character, as files created with xemacs typically do.

Design

We strongly recommend that you design your program as a deterministic finite state automaton (DFA). The DFA concept is described in Section 7.3 of the book Introduction to CS (Sedgewick and Wayne). That book section is available through the web at http://www.cs.princeton.edu/introcs/73fsa/. Your grade may suffer if you do not use the DFA concept.

Generally, a (large) C program should consist of of multiple source code files.  For this assignment, you need not split your source code into multiple files. Instead you may place all source code in a single source code file. Subsequent assignments will ask you to write programs consisting of multiple source code files.

We suggest that your program use the standard C getchar() function to read characters from the standard input stream.

Logistics

You should create your program on hats using bash, xemacs and gcc.

Step 1: Create Source Code

Use xemacs to create source code in a file named decomment.c.

Step 2: Preprocess, Compile, Assemble, and Link

Use the gcc command with the -Wall, -ansi, and -pedantic options to preprocess, compile, assemble, and link your program. At least once, perform each step individually, and examine the intermediate results to the extent possible.

Step 3: Execute

Execute your program multiple times on various input files that test all logical paths through your code.

We have provided several files in hats directory /u/cos217/Assignment1. You should copy those files to your project directory, and use them to help you test your decomment program.

sampledecomment < somefile.c > output1 2> errors1
decomment < somefile.c > output2 2> errors2
diff output1 output2
diff errors1 errors1
rm output1 errors1 output2 errors2

The UNIX diff command finds differences between two given files. The executions of the diff command shown above should produce no output. If the command "diff output1 output2" produces output, then sampledecomment and your program have written different characters to the standard output stream. Similarly, if the command "diff errors1 errors2" produces output, then sampledecomment and your program have written different characters to the standard error stream.

Step 4: Create a readme File

Use xemacs to create a text file named "readme" that contains:

Descriptions of your code should not be in the readme file. Instead they should be integrated into your code as comments.

Step 5: Submit

Submit your work electronically on hats via the command:

/u/cos217/bin/i686/submit 1 decomment.c readme

If the directory /u/cos217/bin/i686 is in your PATH environment variable, then you can abbreviate that command as:

submit 1 decomment.c readme

If you are using the bash shell and have copied files .bashrc and .bash_profile from the /u/cos217 directory to your HOME directory, then directory /u/cos217/bin/i686 indeed is in your PATH environment variable. You can examine your PATH environment variable by executing the command "printenv PATH".

Grading

We will grade your work on two kinds of quality: quality from the user's point of view, and quality from the programmer's point of view. To encourage good coding practices, we will compile using "gcc -Wall -ansi -pedantic" and take off points based on warning messages.

From the user's point of view, a program has quality if it behaves as it should. The correct behavior of the decomment program is defined by the previous sections of this assignment specification, and by the behavior of the given sampledecomment program.

From the programmer's point of view, a program has quality if it is well styled and thereby simple to maintain. In part, style is defined by the rules given in The Practice of Programming (Kernighan and Pike), as summarized by the Rules of Programming Style document. For this assignment we will pay particular attention to rules 1-24. These additional rules apply:

Names: You should use a clear and consistent style for variable and function names.  One example of such a style is to prefix each variable name with characters that indicate its type. For example, the prefix "c" might indicate that the variable is of  type char, "i" might indicate int, "pc" might mean pointer to char, "ui" might mean unsigned int, etc.  But there are other clear and readable styles you could use, which don't necessarily include the type of a variable in its name, as long as they result in clear and readable programs.

Comments: Each source code file should begin with a comment that includes your name, the number of the assignment, and the name of the file.

Comments: Each function -- especially the main() function -- should begin with a comment that describes what the function does. (The comment should not describe how the function works.) It should do so by explicitly referring to the function's parameters and return value. The comment also should state what, if anything, the function reads from the standard input stream or any other stream, and what, if anything, the function writes to the standard output stream, the standard error stream, or any other stream. Finally, the function's comment should state which global variables the function uses or affects. In short, a function's comments should describe the flow of data into and out of the function.

Function modularity: Your program should not consist of one large main() function. Instead your program should consist of multiple small functions, each of which performs a single well-defined task. For example, you might create one function to implement each state of your DFA.

Line lengths: Limit line lengths in your source code to 72 characters. Doing so allows us to print your work in two columns, thus saving paper.

Special Note

As prescribed by Kernighan and Pike style rule 25, generally you should avoid using global variables. Instead all communication of data into and out of a function should occur via the function's parameters and its return value. You should use ordinary "call-by-value" parameters to communicate data from a calling function to your function. You should use your function's return value to communicate data from your function back to its calling function. You should use "call-by-reference" parameters to communicate additional data from your function back to its calling function, or as bi-directional channels of communication.

However, call-by-reference involves using pointer variables, which we have not discussed yet. So for this assignment you may use global variables instead of call-by-reference parameters. (But we encourage you to use call-by-reference parameters.)

In short, you should use ordinary call-by-value function parameters and function return values in your program as appropriate. But you need not use call-by-reference parameters; instead you may use global variables. In subsequent assignments you should use global variables only when there is no reasonable alternative.