COS 333 Assignment 1 (Spring 2018): Regular Expression Matching

Due midnight, Friday, February 16

Thu Feb 8 17:06:45 EST 2018

This assignment is a warmup to make sure you're comfortable with C and Java, that you know how to compile and run programs without using the training wheels of earlier courses, and that you can submit code via the COS 333 submission mechanism.

The assignments in this course are not just graded exercises, but are also meant to teach important aspects of programming and software engineering. For this assignment, those include:

And before we forget, there's one other thing -- your code has to work. Since the class is so large, grading will be entirely automatic, your code won't likely be looked at by humans at all, and your grade will be largely determined by how well you do on our tests. However, we most strongly recommend that you remember and put into practice all that you learned about good style in previous courses. It will help you greatly when your code for this assignment doesn't work.

There's one exception to the hands-off approach. We will routinely run all submissions through MOSS, which is extremely effective in detecting programs that are similar. If your program is deemed interesting by MOSS, we will look at it very carefully, and if appropriate will send our opinions to the Committee on Discipline. Please do not copy anyone else's work or let anyone copy yours; it's simply not worth it.

This assignment has three parts:

  1. Extend the regular expression code in Section 9.2 of The Practice of Programming
  2. Create a comprehensive test suite
  3. Convert your C code to Java

1. Extend regular expressions

The first step is to extend the regular expression code in The Practice of Programming, as found here.

To begin, read the description in the book carefully and be sure you understand it. You might find it helpful to read this paper too. Then download the code and modify match and the other functions so they match this slightly extended class of regular expressions.

The six shorthands listed above are the only special ones. If a \ is followed by any other character, that character is taken literally, so, for example, \\ matches a single backslash and x\+\+ matches a literal x followed by two plus signs. If the last character of the regular expression is a bare \, the regular expression is illegal; thus the regular expression x\\ matches x followed by \, but x\ is illegal. (Further explanation here.)

A ^ not at the beginning and a $ not at the end are taken literally, and so are *, + and ? if they appear in a context where they are not metacharacters. Think about the regular expression ***: the first and third *'s are not metacharacters. The TPOP code works this way already; if you don't change this part, it will still work.

You have to submit a single C file regexp.c that contains only your function definitions. The only name that should be visible outside your regexp.c is match, so all other names must be declared static; that limits their scope to the file where they are defined. The function signature for match is

	int match(const char *re, const char *s); 
match(re, s) returns 1 if re matches s, 0 if it does not, or -1 if there is an error. The only RE error is one ending in a bare \, but if you allocate memory, running out of memory would also return -1.

Your regexp.c must not print anything ever, and it must never exit or abort; it is a grievous error to do any of these.

We will test your code by compiling and linking your regexp.c with our own driver (a straightforward modification of grep.c), using this declaration only:

	int match(const char *re, const char *s); 
and this compile command:
	gcc -pedantic -std=c99 -Wall [our]grep.c [your]regexp.c 

Your code must compile and link without change and without warning messages. Note that match's arguments are const; your code must not change their contents.

For calibration, my regexp.c is about 140 lines long with ordinary formatting, blank lines, etc.; if your solution is a lot longer than this, you may be off on the wrong track. (See the advice section below for more information about data structures.)

To head off endless Piazza questions, your regexp.c must not call things like printf nor include non-standard header files, and you do not need to use eprintf and similar routines from the original in your driver.

Don't get wrapped up in language lawyering; your task is to convert the original code to our specification, not to invent your own definition. Follow the original code. Do not create your own matching algorithm or reinterpret what a regular expression is!

2. Create tests and run them

The only way to maintain your sanity and have any hope of producing correct code is to create a comprehensive set of test cases that you can use easily.

For this part of the assignment, we will use a simple testing language. Each test case consists of one line with three fields. The fields are separated by tab characters and the fields themselves do not contain tabs. The first field is a regular expression. The second field is a string. The third field is 1 if the regular expression matches the string, 0 if not, and -1 if the RE is illegal. Blank lines in the test file are ignored.

Thus a typical set of lines might be:

	\	\	-1	(illegal RE)
	x\	x\	-1	(illegal RE)
	x	xx	1
	x	axb	1
	x	ab	0
	x*		1
	x*	y	1
	x+		0
	x+a	abxab	1
	x?x?x	xx	1
	*	*	1
	*	?	0

Be sure you understand these!

Your job is to create at least 50 tests that probe every aspect of the regular expression matching code and use them to verify that your program works properly. Put them in a file called tests.txt.

It's not required for the assignment, but it is strongly recommended that you automate the process of running all the tests from something like a program or a shell script. I wrote simple C and Java programs that read tests.txt and call match, but there are plenty of other ways to do this.

Steve McConnell's Code Complete has a good chapter on testing. (Advertisement: so does TPOP.) I also wrote a short article on how Awk is tested; you might skim it to see if it suggests anything useful.

3. Translate your code to Java

You have to implement a class Regexp with one public method:

	int match(String re, String s)

	String regexp, str;
        Regexp re = new Regexp();
        int m = re.match(regexp, str);
re.match(re, s) returns 1 if re matches s, 0 if not, and -1 if there is an error.

As with the C version, your class must not expose any other members, must never print anything, and must never throw exceptions.

The C version makes good use of pointers; if there is a match between the current text character and the current regular expression character, it recursively checks the rest of the text against the rest of the regular expression, using pointers to refer to them. Java has no pointers (references are pointers in disguise, but you can't do very much with them), so you will have to use string indices (String.charAt) and/or substrings (String.substring). I have tried it both ways; most recently I used substrings everywhere except for looking at individual characters.

Use Character.isDigit, .isWhitespace and .isLetterOrDigit. [clarification added 2/7]

For calibration, my Regexp.java is about 100 lines long with ordinary formatting, blank lines, etc.; if your solution is a lot longer than this, you may be off on the wrong track. (See the advice section below for more information about data structures.)

4. Advice

It is always best to attack a complex task in stages, making sure each one works before the next one is started. Do the C version first, completely, adding tests as you go along. I started by isolating matching into regexp.c, then added the ? metacharacter, then +, then the shorthands, testing after each stage.

Define suitable functions rather than writing everything in one routine. When you add code for + and ?, use separate functions but call on ones that already exist where possible. Localize shorthand matching in one function.

There are several ways to represent and interpret shorthands in the regular expression:

  1. Interpret the RE verbatim.
  2. Process the shorthands first into a better representation before doing any matching. Since this program only has to work with 7-bit ASCII, you can use the 8th bit to encode metacharacterness.
  3. Alternatively, define a struct (C) or class (Java) that encodes a literal or shorthand character of the RE; the two members are a type and a character appropriate to that type.

(1) is how the original works. The shorthands make it trickier because it's hard to keep the indexing straight as you deal with backslashes. My C version uses (3) and my Java version uses (2). Generally, (2) is less code (that's why my Java version is shorter than my C version) and I think it's easier than (3) but your mileage may vary.

No matter what path you choose, keep it simple: this program does not need to be fast, or save memory, or be the slightest bit clever. As is true with all programming, trying to be fast and/or clever is a recipe for disaster. If things go wrong, use printf or master a debugger, or both. I personally prefer printf, but it will never hurt to know gdb. Here's my one-page cheat sheet on gdb, and the much more extensive GDB reference and tutorial from COS 217.

You are strongly advised to create good test cases and figure out a way to run your tests automatically. If you do this early, your program will work sooner and your life will be much better.

5. Submission

Submit your regexp.c, Regexp.java, and tests.txt using the CS dropbox at http://dropbox.cs.princeton.edu/COS333_S2018/asgn1. You can submit multiple times; each submission is checked and we'll grade the last one that arrives. When you submit, a check script runs your tests and our tests through your RE code, so it will give you some indication of how well your program appears to work, but it's a sanity check, not a comprehensive test. The grading scripts are significantly more rigorous.

As fair warning, you can expect significant deductions for failing to meet the specification, on the order of 25-50% penalties for failure to compile, infinite loops, spurious output, or failure to follow the API (using the same standards as in COS 126 and COS 226). Aside from that, you will lose some small number of points for each of our tests that your code fails, each of your tests that is incorrect, and each case we catch that your tests do not cover. You can be assured, however, that you cannot earn a negative score.

The usual lateness penalties will then be assessed if applicable.