Hints for Assignment 2

Mon Feb 13 12:06:27 EST 2017

The Awk interpreter reads an awk program, breaks it into syntactic tokens like ++ and while with lex.c, parses it according to the grammar in awkgram.y, generating a parse tree that is then interpreted by functions in run.c, using supporting functions in other source files.

Spend a bit of time figuring out the basic structure of the program. Use grep and the search functions of your editor to find out where things are defined and used. Compile first without any changes to make sure that works. Then make small changes and see what happens.

Adding the new comment convention requires straightforward fiddling of the lexical analyzer. The lexer has a lookahead mechanism that you can use to test what single character is coming next before actually reading it, so there's no need to back up. That will help you identify // after you see /.

trim will be a new built-in function, so you have to figure out how existing built-in functions are handled. You might look for a vaguely similar string-handling function first, and see how it works. You have to add a new reserved word (where are reserved words defined?), a new defined constant that corresponds to that word (where are existing constants defined?), and new semantics in the same place where the other built-in functions are handled (what function is that?).

For until and the variant for statement, you have to find all the places where the new statements require a change.

Lexical analysis is in lex.c; you're adding a new keyword, so it goes into the table. Make sure you heed the comment at the top of the table!

The AWK grammar needs new rules that specify the syntax of an until statement and creates the right kind of node in the parse tree. There are two places in the grammar proper where something has to be added, plus some declarations at the front. The keyword identified by the lexer, say until, is handled as a symbolic constant with a name like UNTIL.

In the grammar, the components of the right hand side of a rule are referred to as $1, $2, etc. Each has a type that is set at the top of the grammar; the possible types are defined in the %union declaration and used later in the grammar to indicate the proper type of an element on the parsing stack. These types keep the C compiler happy. The result of a rule is called $$. By default, $$ is set to $1, but if you want to pass some different value, you have to assign it explicitly, e.g., $$ = $3. In this assignment, you shouldn't have to do anything beyond careful copying.

You need to write new functions that define the semantics. Find the code that implements the while statement, then copy, paste, and edit it carefully. Functions are all declared in proto.h, so you will need to add the name of your new function to that. Similarly, find the function that handles the right for loop and modify it to process either case.

AWK parses an awk program and creates a parse tree made up of Nodes in the interior and Cells at the leaves; Cells correspond to values, but they hold a variety of things, identified by type fields and often lied about with casts, which is certainly not a model of good software design. The values in Cells are usually set by setsval and setfval, and retrieved by getsval and getfval, but these ostensible interfaces are often bypassed in an effort to go faster (misplaced efficiency) or to do an operation that is more complicated (an incomplete API).

During execution, AWK starts at the root of the parse tree and calls the appropriate semantic routine from run.c for each Node encountered; this is highly recursive since most Nodes have children. The function execute(Node*) is the focal point; it decides what function to call by indexing into an array of function pointers created by maketab in proctab.c and compiled into proctab. Return values from functions determine the results, including some control flow constructs like break that require early termination of loops.

The link between the type of syntactic object (e.g., UNTIL) stored in a node and the function that is called when that node is to be executed is set by a table. The tricky bit is that the table is defined in a C program in maketab.c, and created by compiling and running maketab to make another C file (proctab.c) that is then compiled. This is all handled by the makefile, but you have to remember to update maketab.c or your function won't get called.

Parse tree Nodes are created during parsing by functions named stat1(), stat2(), etc., distinguished by the number of children they have. An until is really the same as a while: it has two children, the statements to be repeated and the condition that terminates the loop. The new for statement differs only by having one extra child, so it's easiest to simply create a bigger Node, where one child might sometimes be empty.

AWK uses an intricate and fragile scheme of temporary Cells to hold variables, constants, and the like during computation; managing those correctly is a pain, so you must follow the existing patterns precisely or you will create a memory leak or worse. These temporary cells are returned by most semantic functions. Find a built-in that is semantically close to trim and modify its code.

Somewhat the same comments apply to string management. Although there was originally a string abstraction, it's too often violated. But roughly, getsval(Cell*) returns the string stored in the Cell, but it's not your copy, so if you want to mutate it, you have to make a copy (allocating space with tostring()) and work on that.

All numbers in Awk are of type Awkfloat, which is a double. The functions getfval() and setfval() handle numeric values for Cells. You probably won't need to use this fact.