Last update: Sat Feb 10 21:12:38 EST 2001

AWK Testing

Brian Kernighan
Princeton University and Bell Labs
bwk@bell-labs.com

Introduction

This page describes the testing strategies that have evolved over the past 15 years for maintaining The One True AWK. In an attempt to keep the program working, and to maintain our sanity as bugs are fixed and the language changes slowly, we have developed a large number of ad hoc and systematic tests, and tools for running them. At the moment, there are somewhat over 1000 tests, which can be run automatically by a single command.

This description is mainly for a software engineering course, to illustrate one pragmatic approach to testing a small but important real program over a very long time; we also hope that the tests themselves will be helpful to developers of other versions of AWK (all five of them), and perhaps of interest to others. And of course if you would like to contribute new and better tests, we'd be happy to hear from you.

Test Cases

This section describes general classes of test cases, with a few illustrations of each. This is not meant as a complete taxonomy. In total, there are nearly 7000 lines of tests, in more than 350 files. A few of these tests have been borrowed or adapted from others, notably from gawk via Arnold Robbins and Nelson Beebe. We are very grateful to Arnold and Nelson for their contributions.

  • Language features in isolation:
    This includes a wide variety of simple but arbitrary tests of field splitting, input and output, build-in variables and functions, control flow, and so on.

  • Representative small programs,
    such as the very short programs in the first two chapters of the AWK book. For example, the first test is just
    	{ print }
    
    which prints each input line; the second example is
    	{ print $1, $3 }
    
    which prints the first and third fields.

  • Bigger complete programs:
    Examples include the chem preprocessor and the text formatter of Chapter 5 of the AWK book.

  • Specific language areas:
    Some aspects of AWK are themselves almost complete languages, for instance regular expressions, substitution with sub and gsub, and expression evaluation. These can be tested more completely, as we will see below.

  • Boundaries:
    One of the most fruitful places to look for errors is at "boundary conditions". Instances include creating fields past the last one, setting nominally read-only variables like NR or NF, and so on.

  • Timing of basic operations for performance monitoring.
    There are somewhat over a dozen tests that exercise the most fundamental AWK actions: input, field splitting, loops, regular expressions, etc., on largish inputs. The times for old and new versions of the program are compared; this provides a rough check to ensure that no performance bug is inadvertently introduced.

  • Tests that provoke each error message except for those that "can't happen".

  • Tests for each command-line option.

  • Test cases for identified bugs.
    These are tests that should have been there -- that would have exposed a bug if the test had been made. Each time a bug is found, a new set of tests that check that it has been fixed are added.

  • New tests for new features.
    AWK does not change very fast, but new features are added occasionally. Each of these is accompanied by a set of tests that attempt to verify that the feature works properly. For example, command-line variable setting was added and then refined as part of the POSIX standardization process; there are now about 20 tests that exercise this single feature of the language.

  • Stress tests:
    Very large strings, very long lines, huge numbers of fields and the like are all places where implementations are likely to break. In theory, AWK has no fixed size limits on anything, so this is an attempt to verify proper behavior in this area.

  • Coverage tests:
    We have attempted to create tests that will cause every statement of the program to be executed at least once. Unfortunately, this ideal is extermely hard to achieve, and we have never gotten much past about 80% coverage.
  • Test Data

    The other side of the coin is the data used as input to test cases.
  • Straightforward, orderly, realistic data,
    of the kind that real users provide. Examples include the "countries" file from Chapter 2, or the password file from a Unix system, or the output of commands like who, ls -l, etc., or big text files like the Bible and dictionaries.

  • Boundary-condition data:
    Null inputs, empty files, empty fields, files without newlines at the end or anywhere, files with CRLF, CR only, etc. A recent productive test involved trying all AWK "programs" consisting of a single ASCII character. Some of these are legal (letter, digit, comment, semicolon) and possibly even meaningful (non-zero digit), but most are not. This uncovered two bugs in the lexical analyzer.

  • Random input, usually generated by program.
    A small AWK program generates files with lots of lines containing random numbers of fields of random contents; these can be used for a variety of tests.

  • High volume input.
    Big files, big strings, huge fields, all stress a program. Generating them by a program is easiest, but sometimes they can be internally generated; this example creates million-character strings in an attempt to break printf:
    echo 4000004 >foo1
    $awk '
    BEGIN {
    	x1 = sprintf("%1000000s\n", "hello")
    	x2 = sprintf("%-1000000s\n", "world")
    	x3 = sprintf("%1000000.1000000s\n", "goodbye")
    	x4 = sprintf("%-1000000.1000000s\n", "goodbye")
    	print length(x1 x2 x3 x4)
    }' >foo2
    cmp -s foo1 foo2 || echo '^GBAD: T.overflow huge sprintfs'
    

  • Illegal input.
    A standard example is binary data where text is expected. AWK seems to be fairly robust against this kind of assault, though claiming it is rash.
  • Mechanisms

    The main goal here is automation: let the machine do the work. There are separate shell scripts for different types of tests, all run from a single master script.
  • Regression tests:
    Compare the output of the new version of the program to the output of the old version, and compare the outputs.
    oawk=${oawk-awk}
    awk=${awk-../a.out}
    
    echo oawk=$oawk, awk=$awk
    for i
    do
    	echo "$i:"
    	$oawk -f $i test.data >foo1 
    	$awk -f $i test.data >foo2 
    	if cmp -s foo1 foo2
    	then true
    	else echo -n "$i:	BAD^G ..."
    	fi
    	diff -b foo1 foo2 | sed -e 's/^/	/' -e 10q
    done
    

  • Independent implementations:
    This is the same as regression testings, except that we are comparing the output of two independent versions of the program. For AWK, this is easy, since there are several others around, including gawk and mawk.

  • Independent computation of the right answer.
    This is used a lot in the AWK tests: a shell script echoes the right answer to a file, runs the test; compares the results; and prints a diagnostis in case of error. For example, this is one of the tests for I/O redirection:
    awk=${awk-../a.out}
    $awk 'NR%2 == 1 { print >>"foo" }
          NR%2 == 0 { print >"foo" }' /etc/passwd
    diff foo /etc/passwd || echo 'BAD: T.redir (print > and >>"foo")'
    
    This applies the ">" and ">>" output operators to alternate input lines; the result at the end should be that the input file has been copied.

    This example is an extreme test of the function call mechanism; it computes Ackermann's function for several pairs of argument values and compares the results to values computed earlier by a C program:

    $awk '
    function ack(m,n) {
    	k = k+1
    	if (m == 0) return n+1
    	if (n == 0) return ack(m-1, 1)
    	return ack(m-1, ack(m, n-1))
    }
    { k = 0; print ack($1,$2), "(" k " calls)" }
    ' <<! >foo2
    0 0
    1 1
    2 2
    3 3
    3 4
    3 5
    !
    cat <<! >foo1
    1 (1 calls)
    3 (4 calls)
    7 (27 calls)
    61 (2432 calls)
    125 (10307 calls)
    253 (42438 calls)
    !
    diff foo1 foo2 || echo 'BAD: T.func (ackermann)'
    

    Although this kind of test is the most useful, since it is the most portable and least dependent on other things, it is among the hardest to create, especially for large volumes, since each test has to be carefully written out by hand.

  • Use shell scripts or a scripting language (AWK or Perl) to control tests

  • Use specialized languages to generate test cases and assess their results.
    This is the most interesting kind of test. A program can convert a compact specification in a set of tests, each with its own data and answer or other standard of correctness, and run them. AWK tests regular expressions and substitute commands this way. For regular expressions, an AWK program (naturally) converts a sequence of lines like this:
    ^a.$	~	ax
    		aa
    	!~	xa
    		aaa
    		axy
    		""
    
    into a sequence of test cases. In effect, this is a simple language for regular expression tests: it reads
    ^a.$	~	ax	"the pattern ^a.$ matches ax"
    		aa	"and matches aa"
    	!~	xa	"but does not match xa"
    		aaa	"and does match aaa"
    		axy	"and does not match axy"
    		""	"and does not match the empty string"
    

    Another such language describes substitute commands, and a third language describes input and output relations for expressions. The test expression follows the word "try", and after that are inputs and correct outputs; an AWK program generates and runs the tests.

    try { print ($1 == 1) ? "yes" : "no" }
    1	yes
    1.0	yes
    1E0	yes
    0.1E1	yes
    10E-1	yes
    01	yes
    10	no
    10E-2	no
    
    There are nearly 300 regular expression tests, 130 substitution tests, and over 100 expression tests; more are easily added.

  • Use consistency checks within a test case.
    For example,
    	{ i++}
        END { if (i != NR) print "error" }
    
    Splitting an input lines into fields should produce NF fields:
    { if (split($0, x) != NF) print "error"
    
    Deleting all elements of an array should leave no elements in the array, so this code should print 0 at the end.
    BEGIN {
    	for (i = 0; i < 100000; i++)
    		x[i] = i
    	for (i in x)
    		delete x[i]
    	n = 0
    	for (i in x)
    		n++
    	print n
    }
    
  • Advice

    This section summarizes some of the lessons learned. Most of these are completely obvious and everyone knows them, but in spite of that, it's easy to forget. Further advice may be found in Chapter 6 of The Practice of Programming.

    Mechanize. This is the main lesson. The more automated your testing process, the more likely it is that you will run it routinely and often. And the more that tests and test data are generated automatically from compact specificiations, the easier it will be to extend them. For AWK, the single command REGRESS runs all the tests. The process takes a couple of minutes. It produces several hundred lines of output, but most consists just of filenames that are printed as tests progress. Having this large and easy to run set of tests has saved us from much embarrassment -- it's all too easy to think that some fix to the program is benign, when in fact something has been broken. The tests find such problems with high probability.

    Make test output self-identifying. You have to know what tests ran and especially which ones caused error messages, core dumps, etc.

    Make sure you can reproduce a test that fails. Reset random number generators and files and anything else that might inadvertently preserve state from one test to the next. Each test should start with a clean slate.

    Add a test for each bug. Better tests originally should have caught the bug. At least this should prevent you from having to find this bug again.

    Add tests for each new feature or change. While the new thing is fresh is a good time to figure out how to test whether it works correctly; presumably there was some testing anyway, so make sure it's preserved.

    Never throw away a test. A corollary to the previous point.

    Check your tests and scaffolding. It's easy to get into a rut and assume that your tests are working because they produce the expected (i.e., mostly empty) output. Go back from time to time and take a fresh look -- data files may no longer be appropriate, or may have changed underfoot. (In preparing to write this note, we found that the "big" data set we thought we were using had somehow mutated into a tiny one.) Paths to programs and data may have changed and you could be testing the wrong things.

    Make your tests portable. Tests should run on more than one system; otherwise, it's too easy to miss errors in both your tests and your programs. Commands like the shell, built-ins (or not) like echo, search paths for commands, and the like are all potentially different on different machines, and just because something works one place is no assurance that it will work elsewhere.

    Make sure that your tester reports progress. Too much output is bad, but there has to be some. The AWK tests report the name of each file that is being tested; if something seems to be taking too long, this is a clue about where the problem is.

    Watch out for things that break. Make the test framework robust against the many things that can go wrong: infinite loops, tests that prompt for user input, tests that print spurious output, and tests that don't really distinguish success from failure.

    Downloading Test Cases and Data

    We'll put something here eventually: what's in the file(s), how to run things, how to deal with portability issues on different systems.