Christopher Moretti
Department of Computer Science
Princeton University

Assignment 1: Regular Java Expressions

Due: 23:59 Friday 2/12/2016

Preface

This assignment is a warmup to make sure you're still comfortable with C and Java, that you know how to compile and run programs, and that you can submit code via Dropbox. Oh, there's one other thing -- your code has to work. Since the class is so large, grading for assignments will be almost entirely automatic: your code won't likely be looked at by humans at all, and your grade will be determined by how well you do on our automated tests. Even so, we strongly recommend that you remember and put into practice all that you learned about good style in previous courses. It will help you greatly in debugging, and in understanding your code when you return to it in a {day, week, year}.

There are two exceptions to the hands-off approach. First, we're not completely cruel: we will do our best to debug disastrous errors that prevent our automated testing from running, but only at considerable cost in terms of points ... thus, you should do everything you can to ensure that your code works. Second, we will routinely run all submissions through systems like MOSS, which are extremely effective in detecting programs that are similar. If your program is deemed interesting by MOSS or a similar tool, we will look at it very carefully, and if appropriate will send our opinions to the Committee on Discipline. Come to office hours, post plaintive cries for help on Piazza, or anything else within reason and acceptable conduct, but please do not copy anyone else's work or let anyone copy yours -- 20+ hours of our time to prepare a case, and a year of your life serving an academic suspension simply aren't worth it. On that note, be sure to examine the Collaboration Policy on the Assignments page posthaste!

 

Specification

This assignment has two parts. The first part is to convert the regular expression code in Section 9.2 of The Practice of Programming from C into Java, and add a few new features. The second part is to develop a set of good test cases and an infrastructure to test your program on these cases automatically. (Automated testing will be a theme in this course -- Steve McConnell's Code Complete has a good chapter on testing, as does TPoP.)

The detailed structure of the regular expression code in TPoP and the interface it presents to the rest of the program are quite specialized -- the goal was to present the essence as compactly as possible, not to make it easy to generalize. So although the algorithms are fine for our purposes here, the packaging and some of the details have to be different.

As much as possible, we want to hide the implementation details, so they can be readily changed without affecting the rest of the program. In a C library, one uses opaque types: the implementation reveals only a pointer to an anonymous data type and a set of functions; everything else is declared static in the implementation files and is thus hidden. In Java and other object-oriented languages, one creates a class whose only public functions or methods are those that define the legal operations, with all other components (functions, methods, member and instance variables) are private. In this exercise, you'll convert the C code into a Java class Regexp with a suitable constructor and methods for matching and reporting on where a match occurred.

You have to implement a public class Regexp with these (and only these) public methods:

          Regexp (String re)       // the constructor: stores the regular expression re for subsequent matching operations.
          boolean match (String s) // returns true if the regular expression matches s, otherwise false.
          int start ()             // returns the index of the first character of the matched substring.
          int end ()               // returns the index one beyond the last character of the matched substring.

Note that for a match of an empty string both start and end will have the same value. If there is no match (either because match has not yet been called or because it returned false) both start and end should return -1.

To begin, read the description in the book carefully and be sure you understand it. You might find it helpful to read this paper too. Download the code from here.

Now convert the code from C to Java and modify match and the other functions so they match the same regular expressions, with these two additions:

Do not change anything else in the matching algorithm beyond what is needed for this assignment.

Any string is a legal regular expression. There are no errors, so the constructor is likely to do little more than record the regular expression string for later use. In a more complex design, it might prepare some efficient representation so that subsequent matching would be more efficient. (You might find it instructive to look at the official Java Pattern and Matcher classes to see how Sun factored the job, but don't use those here!)

The empty regular expression matches all inputs. As in the original, ^ and $ are not "activated" except at the beginning and end respectively. A quantifier like *, +, or ? is not "activated" at the beginning nor directly after another "activated" quantifier; that is, a+?b matches one or more a's followed by a literal ? followed by b. On the other hand, a quantifier is "activated" after a quantifier character that is not "activated". In other words, in isolation or at the beginning of an RE, ** would match 0 or more *s, ?? would match 0 or 1 ?, and *+ would match 1 or more *s.

The TPoP code interprets the regular expression as matching proceeds. But you have to keep track of where a match begins and ends, and record that in a data structure so that subsequent calls to start() and end() work. It's up to you how you do this, but your code must work if there is more than one regular expression in use at the same time. It does not have to be re-entrant (only one caller will be active at one time) but it does have to be able to handle interleaved calls involving multiple regular expressions, which almost surely means no static variables in the class.

 

For the testing part of the assignment, we will use a simple testing language. Each test case consists of one line with three fields. The fields are separated by @ characters and the fields themselves do not contain @ characters. Thus, a test with anything other than exactly two @ characters is invalid. The first field is a regular expression. The second field is a string. If the regular expression matches the string, the third field is the second field with brackets that mark the matching substring. If the regular expression does not match the string, then the third field contains no brackets, that is, the third field is identical to the second field. Blank lines in the test file are ignored. Thus a typical set of valid lines might be:

            x@xxx@[x]xx
            x@axb@a[x]b
            x@abc@abc
            x*@@[]
            x*a@abxab@[a]bxab
            x+a@abxab@ab[xa]b
            x?x?x@xx@[xx]

indicating:

Be sure you understand these examples, especially where they involve empty strings!

Your job is to create a comprehensive of tests and use them to verify mechanically that your program works properly. Blank lines are ok. The shell script program checktests.sh will execute an AWK program to check whether you've made some careless error in your tests, though it's not very rigorous. Do not submit test files that have not passed this minimal test.

The REtest program reads a sequence of tests from stdin (NB: that's the standard input stream, not the 126/226 infrastructure; you should be using java.io.*, System.{in,out}.*, or something similar to do your I/O). For each input line:

Thus if all tests are correct and the RE algorithm is right, REtest will not print any output. This facilitates easy verification of automatic testing.

 

Advice

The C version makes good use of pointers; if there is a match between the current text character and the current regular expression character, it recursively checks the rest of the text against the rest of the regular expression, using pointers to refer to them. Java has no pointers (references are pointers in disguise, but you can't do very much with them), so you will have to use string indices (String.charAt) and/or substrings (String.substring).

It is up to you as to the order of attack -- whether to implement the new functionality in C and then port the entire thing, or to port the extant code and then implement the new functionality in Java only. As for the new functionality, it is likely easiest to start with + and ?, then move on to start and end, but YMMV. But no matter what, one step at a time is always a good strategy: make sure that each step works before the next one is started.

Modularity is your friend: define suitable functions rather than writing everything in one routine. When you add code for + and ?, use separate functions, but where possible call on ones that already exist (wide code cloning is bad practice for innumerable reasons). Do not change anything in the matching algorithm beyond what is needed for this assignment. Keep it simple: this program does not need to be fast, or save memory, or be the slightest bit clever. As is true with all programming, trying to be fast and/or clever is a recipe for disaster.

 

Submission and Evaluation

When you are finished, submit your source and tests using the CS Dropbox link dropbox.cs.princeton.edu/COS333_S2016/One. You have to submit three files: Regexp.java, REtest.java, and REtests.txt (containing at least 50 tests, hopefully of high quality). You can submit multiple times; we'll grade the last one that arrives. We will give you some indication of how well your program appears to work when you submit, by running some tests of our own, but they are not a complete test. Do your own testing; don't rely on our tests to validate your code.

We will compile your Regexp.java with our own driver, and run our tests through that. We will compile your REtest.java and run our tests through that. We will also compile your REtest.java with our Regexp.java. We will test properties of your tests to measure their completeness.

As fair warning, you can expect significant deductions for failing to meet the specification: on the order of 25-50% penalties for failure to compile, infinite loops, spurious output, or failure to follow the API (using the same standards as in COS126 and COS226). Aside from that, you will lose some small number of points for each of our tests that fails, each of your tests that is incorrect, and each case we catch that your tests do not cover. You can be assured that you cannot earn a negative score.