COS 333 Assignment 1: Regular Expression Matching

Due Thursday February 14 at 10:00 pm

Mon Feb 4 14:27:14 EST 2019

Introduction

This assignment is a warmup to make sure that you haven't forgotten C programming since COS 217 (you did take 217, didn't you?), that you know how to compile and run C and Java programs, and that you can submit code via the COS 333 submission mechanism.

The assignments in this course are not just graded exercises, however, but are also meant to teach important aspects of programming and software engineering. For this assignment, those include:

And there's one other minor thing -- your code has to work. Since the class is large, grading for assignments will be entirely automatic, your code won't likely be looked at by humans at all, and your grade will be determined by how well you do on our automated tests. However, we strongly recommend that you remember and put into practice all that you learned about good style in previous courses. It will help you when your code doesn't work.

There's one exception to the hands-off approach. We will routinely run all submissions through systems like MOSS, which are extremely effective in detecting programs that are similar. If your program is deemed interesting by MOSS or a similar tool, we will look at it very carefully, and if appropriate will send our opinions to the Committee on Discipline.

Please do not copy anyone else's work or let anyone copy yours; it's simply not worth it.

Assignment

glob   /ɡläb/     noun     a lump of a semiliquid substance

In ancient Unix times, filename regular expression matching ("wildcards") was done not in the shell itself, but in a separate program called glob. This lives on in various nooks and crannies -- csh and tcsh have a shell built-in command called glob that does this operation -- and a variety of programming languages have functions or operators that match regular expressions of this class, for instance, the glob modules in Python and Javascript, and Like in Visual Basic and SQL, each with its own syntax. (See the Wikipedia article on Glob.)

This assignment has three parts.:

  1. Modify the regular expression code in Section 9.2 of The Practice of Programming to process regular expressions of this class.
  2. Translate your C version into Java that does exactly the same thing.
  3. Develop a compact specification of a set of test cases, and a program that will run the tests from the specification.

The detailed structure of the RE code in TPOP and the interface it presents to the rest of the program is quite specialized -- the goal for the book was to present the essence of pattern-matching as compactly as possible, not to make it easy to generalize. So although the basic algorithm is fine for our purposes here, the packaging and some significant details have to be different. As always, we want to hide the implementation details, so they can be readily changed without affecting the rest of the program.

1. Modify the regular expression code

To begin, read the description in the book carefully and be sure you understand it. Then download the code and modify match and the other functions as necessary so they match the new class of regular expressions:

  1. A ? matches any single character; thus a?b matches aab, axb, etc., but not ab or axxb.
  2. A * matches any sequence of zero or more characters; thus a*b matches ab, axb, etc., but not a or b.
  3. A backslash \ quotes the next character of the RE, turning off any special meaning it might have had. Thus \*\\ matches *\ and *\? matches anything that ends with ?. If there is a bare backslash at the end of the RE (that is, not followed by another character), that is an illegal RE. For example, \ and \*\ are both illegal.
  4. No other characters have any special meaning; in particular, characters like ^ and $ that are special in grep are not special in glob.
  5. Matches are "anchored": they must match the entire input string as if the regular expression had been surrounded by ^ and $ in grep. Thus a*z matches any string that begins with a and ends with z, including az, but it does not match aza or zaz.

You have to modify match to handle this class of regular expressions. Your int match(char *re, char *s) must return 1 if the regular expression re matches the string s, and 0 if not; it returns −1 if there is any kind of error, such as an illegal regular expression. match does not print anything ever, nor does it ever exit or abort. It must not modify its input arguments, and it must free any memory it allocates.

The only illegal REs are ones that end with a bare \; that is, \ and \\\ are illegal, but \\ is legal. The empty regular expression is legal and matches (only) the empty string. If you make only straightforward modifications to the existing code, odd cases like a**b will just work.

For this part, you have to submit a single C file regexp.c that contains only your function definitions. The only name that should be visible outside the file is match, so everything else must be declared static. (You can use the nm command to verify visibility.) We will test your code by compiling and linking your regexp.c with our own driver, using this declaration only:

int match(char *regexp, char *text); 
and this compile command:
gcc -Wall -Wextra our_grep.c your_regexp.c 

Your code must compile and link without change and without warning messages.

For calibration, my regexp.c is 43 lines long including comments and blank lines. If your solution is much longer than this, you are probably off on the wrong track. My grep.c for testing is about 15 lines long.

To head off Piazza questions, note that your regexp.c must not call functions like printf nor include non-standard header files, and your driver does not need to use eprintf and similar routines from the original.

Don't get wrapped up in language lawyering; the task is to convert the original code to our definition, not to invent your own definition. Follow the original code. Do not create your own matching algorithm or reinterpret what a regular expression is!

2. Translate your C version into Java

Self-explanatory. Your Java code must behave exactly the same way as your C code does. Naturally both must be correct.

You must submit a file Regexp.java that contains a public class Regexp with a single public static method called match that has this signature:

public static int match(String regexp, String text);
As with the C version, your Java code must not print anything, it must not throw exceptions, and the only public member is the match function. (Note the different meanings of static in C and Java.) We will compile it with
javac our_java.java your_Regexp.java 

The C version makes good use of pointers; if there is a match between the current text character and the current regular expression character, it recursively checks the rest of the text against the rest of the regular expression, using pointers to refer to them. Java has no pointers (references are pointers in disguise, but you can't do very much with them), so you will have to use string indices (String.charAt) and/or substrings (String.substring). I used substrings everywhere except for looking at individual characters, but indices work fine too. My code has just under 40 lines in the RE part itself and my tester is 20 lines.

3. Create tests and run them

For the third part of this assignment, we will use a variant of a testing language for Awk that will be described in class. Each test case consists of one line with three fields separated by @ characters. The first field is a regular expression. The second field is a text string. The third field is 1 if the regular expression should match the string; it is 0 if the regular expression should not match the string; and it is −1 if the first field is an invalid RE. Blank lines and fields beyond the third will be ignored so you can comment your tests and group related ones together.

Thus a typical set of tests might read

\@\@-1@illegal RE
\\\@\\\@-1@illegal RE

x@x@1@easy ones
x@y@0
x@@0
x@xx@0
x@xy@0

\x@x@1@quotes
\*@*@1
\*@x@0
\\@\@1
\\@\\@0
x?y@xxy@1@? tests
x?y@xy@0
x??y@xxyy@1
x**y@xy@1@* tests
x**y@x***y@1
x**y@xyz@0
x***@y***@0
Be sure you understand these tests. Then create a bunch of tests of your own and use them to mechanically verify that your program works properly.

It's up to you how you arrange to run all your tests so long as it's automated. I wrote a short Awk script that reads through a file of tests and creates a shell script that runs all the tests through a C program. The shell script reports when something unexpected happens but otherwise prints no output. Here's the essence of the C program, which you could adapt or ignore utterly.

int main(int argc, char *argv[])
{
	char *re = argv[1];
	char *str = argv[2];
	int expect = atoi(argv[3]);
	int result = match(re, str);
	if (result != expect) {
		printf("error %s %s %d %d\n", re, str, expect, result);
	}
	return 0;
}

We will test your code by compiling and linking your regexp.c and regexp.java with our own drivers, using only the definitions of match above, and using our test cases in addition to yours. Don't forget the ancillary requirements like no visible names, no memory leaks, no exceptions, etc.

To head off other problems, make sure that your files follow the Unix convention of terminating lines with \n only; no carriage returns. And watch out for so-called "smart" quotes that are not ASCII. Learn to use tools like od or xxd that will display such weirdnesses.

4. Advice

It is always best to attack such a task in stages, making sure each one works before the next one is started. For example, you might split the original grep.c into two files, one of which contains match() and its dependencies, and one that uses match() but does not know about the implementation. Then handle the basics of the glob algorithm itself. Write some tests and automate running them so that it's easy to check that your basic code is right. Do easy tests before harder ones. Then add backslash processing and add some more tests. Get the C program working perfectly before undertaking the Java translation.

I started with matching of vanilla characters, then added metacharacters * and ?, then figured out backslashes, then isolated matching into regexp.c, testing after each stage. Your approach may differ, but do not try to do it all at once.

Define suitable functions rather than writing everything in one routine. Keep it simple: this program does not need to be fast, or save memory, or be the slightest bit clever. As is true with all programming, trying to be fast and/or clever is a recipe for disaster. If things go wrong, use printf or master a debugger, or both. Here's my one-page cheat sheet on gdb.

Talking over the precise meaning of the assignment with friends and the TAs is encouraged. You must write your code yourself, however, so once you start coding, you may no longer discuss programming issues with other students. You may run your test cases through anyone else's implementation, but only by using his or her object file (regexp.o) and you have to write your own tests, not copy theirs. It's also OK to tell someone that a specific test appears to fail on their implementation. That is, this is OK:

  • Jack compiles his regexp.c and sends his regexp.o to Jill.
  • Jill runs her tests using her driver and Jack's regexp.o.
  • If one of Jill's tests fails, and she is convinced that it should have worked, she can tell Jack "This apparently valid test case doesn't work right with your implementation."
  • Jack and Jill together decide whether the test is valid. Both can submit it.

    This is meant to be a sort of arm's-length interoperability check, not a collaboration. We're not going to be up tight about this, but code and test cases really have to be fundamentally your own work.

    Prepare good test cases ahead of time; your tests ought to cover essentially all the lines of code in your program. You are strongly advised to set up your testing process early, so that you can easily run all your tests after each change you make. Test early, test often, know what you are testing. Steve McConnell's Code Complete has a good chapter on testing. (Advertisement: so does TPOP.) I also wrote a short note on how AWK is tested; you might skim it to see if it suggests anything useful.

    5. Submission

    Submit your regexp.c, Regexp.java, and retests.txt using the CS TigerFile system at http://tigerfile.cs.princeton.edu/COS333_S2019/asgn1. You can submit multiple times; each submission is checked and we'll grade the last one that arrives. When you submit, a check script runs your tests and some of our tests through your RE code, so it will give you an indication of how well your program appears to work, but it's a sanity check, not a comprehensive test. The grading scripts are significantly more rigorous.

    (To anticipate another genre of Piazza questions, no, we will not tell you specifically what test(s) you are failing. If you create minimal and orthogonal tests and use binary search, you can narrow things down yourself quite well.)

    Fair warning: you can expect significant deductions for failing to meet the specification, on the order of 25-50% penalties for failure to compile, infinite loops, spurious output, or failure to follow the API (using the same standards as in COS 126 and COS 226). Aside from that, you will lose some small number of points for each of our tests that your code fails, each of your tests that is incorrect, and each case we catch that your tests do not cover. You can be assured, however, that you cannot earn a negative score.

    The usual lateness penalties will then be assessed if applicable.