Hints for Assignment 2

Mon Feb 18 09:32:20 EST 2019

The Awk interpreter reads an awk program, breaks it into syntactic tokens like ++ and while with lex.c, parses it according to the grammar in awkgram.y, generating a parse tree that is then interpreted by functions in run.c, using supporting functions in other source files.

Spend a bit of time figuring out the basic structure of the program. Use grep and the search functions of your editor to find out where things are defined and used. Make small changes and see what happens.
 

trim is a built-in function, so you have to figure out how existing built-in functions are handled. You might look for a math or string-handling function first, and see how those work. You have to add a new reserved word (where are reserved words defined?), a new defined constant that corresponds to that word (where are existing constants defined?), and new semantics in the same place where the other built-in functions are handled (what function is that?).
 

For new flavor of for, you have to find all the places where the new statement requires a change.

Lexical analysis is in lex.c; you're adding three new keywords, so they go into the table. Make sure you heed the comment at the top of the table!

The AWK grammar needs new rules that specify the syntax of the new for statement and creates the right kind of node in the parse tree. This means two additions in the grammar proper, plus some declarations at the front. A keyword identified by the lexer, say from, is handled as a symbolic constant with a name like FROM.

In the grammar, the components of the right hand side of a rule are referred to as $1, $2, etc. Each has a type that is set at the top of the grammar; the possible types are defined in the %union declaration and used later in the grammar to indicate the proper type of an element on the parsing stack. These types keep the C compiler happy. The result of a rule is called $$. By default, $$ is set to $1, but if you want to pass some different value, you have to assign it explicitly, e.g., $$ = $3. In this assignment, you shouldn't have to do anything beyond careful copying.

You need to write a new function that defines the semantics. Find the code that implements the for statement, then copy, paste, and edit it carefully; you will find it helpful to look at the adjacent instat function as well as the existing standard for. Functions are all declared in proto.h, so you will need to add the name of your new function to that.

AWK parses an awk program and creates a parse tree made up of Nodes in the interior and Cells at the leaves; Cells correspond to values, but they hold a variety of things, identified by type fields and often lied about with casts, which is certainly not a model of good software design. The values in Cells are usually set by setsval and setfval, and retrieved by getsval and getfval, but these ostensible interfaces are often bypassed in an effort to go faster (misplaced efficiency) or to do an operation that is more complicated.

During execution, AWK starts at the root of the parse tree and calls the appropriate semantic routine from run.c for each Node encountered; this is highly recursive since most Nodes have children. The function execute(Node*) is the focal point; it decides what function to call by indexing into an array of function pointers created by maketab in proctab.c and compiled into proctab. Return values from functions determine the results, including some control flow constructs like break that require early termination of loops.

The link between the type of syntactic object (e.g., FOR) stored in a node and the function that is called when that node is to be executed is set by a table. The tricky bit is that the table is defined in a C program in maketab.c, and created by compiling and running maketab to make another C file (proctab.c) that is then compiled. This is all handled by the makefile, but you have to remember to update maketab.c or your function won't get called.

Parse tree Nodes are created during parsing by functions named stat1(), stat2(), etc., distinguished by the number of children they have. The new for is fairly parallel to the existing ones, but it has more children; the parse tree nodes for the new for statements are bigger than any others, so a couple of new auxiliary functions have to be added as well.

AWK uses an intricate and fragile scheme of temporary Cells to hold variables, constants, and the like during computation; managing those correctly is a pain, so you must follow the existing patterns precisely or you will create a memory leak or worse. These temporary cells are returned by most semantic functions. Find a built-in that is semantically similar to trim and modify its code.

Somewhat the same comments apply to string management. Although there was originally a string abstraction, it's too often violated. But roughly, getsval(Cell*) returns the string stored in the Cell, but it's not your copy, so if you want to mutate it, you have to make a copy (allocating space with tostring()) and work on that.

All numbers in Awk are of type Awkfloat, which is a double. The functions getfval() and setfval() handle numeric values for Cells.