COS 333 Assignment 1: Regular Expression Matching in Java

Due midnight, Friday February 15

Sun Feb 10 10:28:03 EST 2013

This assignment has two parts. The first part is to convert the regular expression code in Section 9.2 of The Practice of Programming from C into Java, and add some new features. The second part is to develop a set of good test cases and write a program to run them automatically.

The detailed structure of the RE code in TPOP and the interface it presents to the rest of the program is quite specialized -- the goal was to present the essence as compactly as possible, not to make it easy to generalize. So although the algorithms are fine for our purposes here, the packaging and some of the details have to be different.

As much as possible, we want to hide the implementation details, so they can be readily changed without affecting the rest of the program. In a C library, one uses opaque types: the implementation reveals only a pointer to an anonymous data type and a set of functions; everything else is declared static in the implementation files and is thus hidden. (Think of the FILE * pointer in the standard I/O library.) In Java and other object-oriented languages, one creates a class whose public functions or methods define the legal operations. In this exercise, you'll convert the C code into a Java class Regexp with a suitable constructor and methods for matching and reporting on where a match occurred.

1. Add regular expression extensions

You have to implement a class Regexp with these public methods:

Note that for a match of the empty string both start and end will have the same value. If there is no match, both start and end are -1.

To complicate life, you also have to implement two extensions, as in Assignment 0:

Any string is a legal regular expression. The empty regular expression matches all inputs. As in the original, ^ and $ are not special except at the beginning and end respectively. A repetition like *, + or ? is not special at the beginning nor directly after another repetition; that is, x+?y matches one or more x's followed by a literal ? followed by y.

Don't get wrapped up in language lawyering; the task is to convert the code from C to Java, not to invent your own regular expression definition. Follow the original code. Do not under any circumstances create your own matching algorithm or reinterpret what a regular expression is!

The TPOP code interprets the regular expression as matching proceeds, and that is likely to work best here too. But you do have to keep track of where a match begins and ends, and record that in a data structure so that subsequent calls to start() and end() work. It's up to you how you do this, but your code must work if there is more than one Regexp in use at the same time. It does not have to be re-entrant (only one caller will be active at one time) but it does have to be able to handle interleaved calls involving multiple regular expressions, which means no static variables. Finally, you must use Java's String and char data types so that the code will work with Unicode inputs.

For calibration, my version of Regexp.java is about 110 lines in the Regexp class proper, compared to about 45 in the original C version. If your solution is a lot longer than that, you are likely off on the wrong track.

2. Create tests and run them

For this part of the assignment, we will use a simple testing language. Each test case consists of one line with three fields. The fields are separated by @ characters and the fields themselves do not contain @ characters. The first field is a regular expression. The second field is a string. If the regular expression matches the string, the third field is the second field with brackets that mark the matching substring. If the regular expression does not match the string, then the third field contains no brackets, that is, the third field is identical to the second field. Blank lines in the test file are ignored.

Thus a typical set of lines might be:

   x@xxx@[x]xx
   x@axb@a[x]b
   x@abc@abc
   x*@@[]
   x*a@abxab@[a]bxab
   x+a@abxab@ab[xa]b
   x?x?x@xx@[xx]
which means:
x matches the first x in xxx
x matches the x in axb
x does not match abc
x* matches the empty string
x*a matches the initial a in abxab
x+a matches the xa in abxab
x?x?x matches the xx in xx

Be sure you understand why!

Your job is to create a bunch of tests and use them to verify mechanically that your program works properly. Blank lines are ok. The Awk program check.tests will check whether you've made some careless error in your tests, though it's not very rigorous. Do not submit test files that have not passed this test.

The REtest program reads a file of tests from stdin. For each input line:

Thus if all tests are correct and the RE algorithm is right, REtest will not print any output. My REtest.java is about 50 lines, a Java variant of the grep from TPOP.

Steve McConnell's Code Complete has a good chapter on testing. (Advertisement: so does TPOP.) I also wrote a short article on how Awk is tested; you might skim it to see if it suggests anything useful.

3. Advice

The C version makes good use of pointers; if there is a match between the current text character and the current regular expression character, it recursively checks the rest of the text against the rest of the regular expression, using pointers to refer to them. Java has no pointers (references are pointers in disguise, but you can't do very much with them), so you will have to use string indices (String.charAt) and/or substrings (String.substring). The two approaches are almost identical in code size; each makes some aspects easier and others harder. Your choice, but be systematic.

Start with the original C source, and convert to Java one step at a time; make sure that each step works before the next one is started. Keep it simple: this program does not need to be fast, or save memory, or be the slightest bit clever.

Talking over the precise meaning of the assignment with friends, the TAs, and Piazza is encouraged. You must write your code yourself, however, so once you start coding, you may no longer discuss programming issues with other students. You may run your test cases through anyone else's implementation, but only using his or her Regexp.class file and you have to write your own tests, not copy theirs. It's also OK to tell someone that a specific test appears to fail on their implementation.

That is, this is OK:

  • Jack compiles his Regexp.java and sends Regexp.class to Jill.
  • Jill runs her tests using her driver and Jack's Regexp.class.
  • If one of Jill's tests fails, and she is convinced that it should have worked, she can tell Jack "This apparently valid test case doesn't work right with your implementation."
  • Jack and Jill together decide whether the test is valid. Both can submit it.

    This is meant to be a sort of arm's-length interoperability check, not a collaboration. We're not going to be compulsive about this, but code and test cases really have to be fundamentally your own work.

    4. Submission

    When you are finished, submit your source and tests using the CS Dropbox. You have to submit 3 files: Regexp.java, REtest.java and REtests.txt (containing at least 50 tests). You can submit multiple times; we'll grade the last one that arrives. We will give you some indication of how well your program appears to work when you submit.

    Please follow the rules on what to submit. It will be a big help to us and to your grade if you get the filenames and format right and submit exactly what's asked for. Thanks.

    5. Grading

    We will compile your Regexp.java with our own driver, and run our tests through that. We will also compile your REtest.java and run our tests through that. We will also assess your tests and your tester.

    It won't be possible for us to inspect your code manually. Instead we will grade your code based upon its behavior only. The good news: the code that you submit need not be well-styled -- although we strongly suggest that you use good style for your sake. The bad news: if your code fails tests we won't inspect your code manually to assign partial credit. So please make sure your code compiles and behaves properly. The only exceptions should be I/O (e.g., file not found in REtest.java), and those should be caught with a message.