Due midnight, Friday February 15
Sat Feb 9 15:47:43 EST 2008
This assignment has two parts. The first is to extend the regular expression code in Section 9.2 of The Practice of Programming; the code itself can be found here. The second part is to develop a set of good test cases, and a program to run them.
The detailed structure of the RE code in TPOP and the interface it presents to the rest of the program is quite specialized -- the goal was to present the essence as compactly as possible, not to make it easy to generalize. So although the algorithms are fine for our purposes here, the packaging and some of the details have to be different. Basically, we are going to separate RE processing into two pieces: compilation, which checks the RE for valid syntax and creates a suitable internal representation for subsequent matching operations, and execution, which uses that representation to match against a target string.
As much as possible, we want to hide the implementation details, so they can be readily changed without affecting the rest of the program. In an object-oriented language like C++ or Java, one would create a class whose public functions or methods define the operations and whose private functions and variables implement them. With a little care, it's possible to write C in this fashion as well, using an opaque type, which amounts to a pointer to a structure whose definition is hidden. All access to objects of this type is through functions that use the pointer; no other part of the program has access to the implementation. This is the mechanism used in the FILE * pointers of the standard I/O library for C.
The assignment is to add the + and ? operators, shorthands for character classes, a quoting mechanism, and leftmost longest matching that identifies the start and end of a match:
Specifically, you have to
The TPOP code interprets the input regular expression as matching proceeds. For these extensions, it is probably easier to separate compilation from execution. That is, first make a pass over the regular expression to create a representation that is convenient for subsequent matching. This will necessitate creating a data structure to store a compiled regular expression, and modifying the code to use that data structure as it looks for matches. That data structure is the real structure behind your RE type.
Of course the purpose of defining an interface is to make it possible to have different implementations behind and change them at will. Whatever choice you make, it must not affect user code. Your code must work if there is more than one RE in use at the same time. It does not have to be re-entrant (only one caller will be active at one time) but it does have to be able to handle interleaved calls involving multiple regular expressions, which restricts your use of global and static variables.
(
Small update in this paragraph to clarify when
characters are or are not metacharacters.) Use ctype.h
functions like isdigit to determine character types. The six
shorthands listed above are the only special ones. If a \ is
followed by any other character, that character is taken literally, so,
for example, \++ matches a sequence of one or more plus signs.
A ^ not at the beginning and a $ not at the end are
also taken literally, and so are *, + and ?
if they appear in a context where they are not metacharacters. (Think
about the regular expression ***: the first and third
*'s are not metacharacters. The TPOP code works this way
already; if you don't change this part, it will still work.)
We will try to be reasonable about borderline syntax cases, so be careful but not compulsive; interfaces and data structures are the important part of this assignment.
Modified this part 2/7/08 4pm to clarify what functions you can use.
For this part, you have to write a single C file RE.c that contains only your structure and function definitions. The only function names that should be visible outside the file are the five RE functions defined above; no variable names should be externally visible either, and you must not call any printing functions; it's fine to use things like the functions in string.h and ctype.h, and of course malloc and free in your code. (You can verify this by running the Unix command nm on RE.o.) We will test your code by compiling and linking your RE.c with our own driver, using these declarations only:
typedef struct RE RE; RE* RE_compile(char *regexp); int RE_match(RE *re, char *target); char *RE_start(RE *re); char *RE_end(RE *re); void RE_free(RE *re);These come from the header file RE.h that you must downloand; you must #include RE.h in your RE.c and your driver.
Your code must be written in C and compile without any warning messages using the command
gcc -c -pedantic -std=c99 -Wall RE.c
For calibration, my version of RE.c is roughly 175 lines long; if your solution is a lot longer than that, you may be off on the wrong track.
Use the code from TPOP. Do not (repeat: do not) start from scratch. The assignment is to modify existing code, not to start off with your own implementation of a regular expression parser. You will indeed make modest changes within some of the functions and you will add some new functions, but the end result should be structurally very similar.
For the second part of this assignment, we will use a simple testing language. Each test case consists of one line with two or three tab-separated fields. The first field is a regular expression. The second field is a string. If the regular expression is supposed to match the string, the third field is the second field with brackets that mark the matching substring. If there are only two fields, the regular expression does not match the string. There are two special cases: a field consisting of "" is to be treated as empty, and the character @ is to be treated as if it were a single blank (' ').
Thus a typical set of lines might read
x xxx [x]xx x axb a[x]b x abc x* "" [] x*a abxab [a]bxab x+a abxab ab[xa]b \d 123 [1]23 \S @ \s @@@ [@]@@which means "x matches the first x in xxx", "x matches the x in axb", "x does not match abc", "x*" matches the empty string", "x*a matches the initial a in abxab", "x+a matches the xa in abxab", "\d matches the first digit in 123", "\S does not match a single blank", and "\s matches the first of three blanks". (Be sure you understand these examples.)
Your job is to create a bunch of tests and use them to mechanically verify that your program works properly. To simplify life, don't use [ and ] anywhere in your test cases except in the third field. Blank lines are ok. The Awk program check.tests will check whether you've made some careless error in your tests, though it's not very rigorous. Do not submit test files that have not passed this test.
It's up to you how you arrange to run all your tests. I wrote a program called tester.c (a variant of the grep from TPOP) that reads through a file of tests and reports when something unexpected happens; it prints no output unless something goes wrong. It's not very general nor is it a model of good programming but it's enough to run a bunch of tests mechanically. You're welcome to adapt it or just use it as is.
We will test your code by compiling and linking your RE.c with our own tester, using only the functions definitions above, and using our test cases as well as yours.
It is almost always best to attack such a task in stages, making sure each one works before the next one is started. I began with leftmost longest, then added + and ?, then start and end, then converted to a suitable data structure, then added shorthands; your mileage may differ.
There are at least three approaches to "suitable data structure". Your "compilation" phase could just save the regular expression and then interpret metacharacters during matching. You could turn on the 8th bit for each metacharacter and quoted character to distinguish it from a regular character (this works for 7-bit ASCII input, which is fine). You could represent each regular expression character by a structure that holds both the character and an indication of what it is. I've done the last two variants; they seem about equally clean or dirty. The first option will make compilation trivial at the price of probably complicating matching. Your choice.
No matter how you proceed, define suitable functions rather than writing everything in one routine. When you add code for + and ?, use separate functions but call on ones that already exist. Keep it simple: this program does not need to be fast, or save memory, or be the slightest bit clever. As is true with all programming, trying to be fast and/or clever is often a recipe for disaster. When things go wrong, use printf statements, or try a debugger, or both. Here's my one-page help file for gdb.
Talking over the precise meaning of the assignment with friends and the TAs is strongly encouraged. You must write your code yourself, however, so once you start coding, you may no longer discuss programming issues with other students. You may test your code against a friend's implementation of RE and vice versa, but only using his or her object file (.o) and you have to write your own tests, not copy theirs. It's also OK to tell someone that a specific test appears to fail on their implementation.
That is, this is OK:
This is meant to be a sort of arm's-length interoperability check, not a collaboration. We're not going to be compulsive about this, but code and test cases really have to be fundamentally your own work.
Prepare good test cases ahead of time. Test early, test often, know what you are testing. Automate the testing as much as you can; a shell script might make this easier. Steve McConnell's Code Complete has a good chapter on testing. (Advertisement: so does TPOP.) I also wrote a short ;login: article on how Awk is tested; you might skim it to see if it suggests anything useful.
When you are finished, submit your source file RE.c and
your file of test cases re.tests, using the Arizona or Hats command
~cos333/bin/submit 1 RE.c re.tests
You can submit multiple times; we'll grade the latest one.
We'll probably grade this out of 100 somewhat along these lines:
As you can see, core dumps are discouraged, so be sure you don't walk off the end of an array or dereference a null pointer. We also care about style; you must follow the guidelines and general format of Chapter 1 of The Practice of Programming, and your code should look more or less like that.
Please follow the rules on what to submit. It will be a help if you get the filenames right and submit exactly what's asked for. Thanks.