Princeton University
COS 333: Advanced Programming Techniques

Assignment 1: Regular Expressions in Java and Python

Purpose

The purpose of this assignment is to help you learn or review (1) the Java programming language, (2) the Python programming language, and (3) regular expressions. It also will give you some experience with composing tests and test drivers.

Rules

Make sure you study the COS 333 "Policies" Web page before doing this assignment or any of the course's assignments.

You must work with one partner on this assignment. If the count of students working on the assignment is an odd number, then exactly one team may consist of three students; we will ask that team to implement some features in addition to those that are described in this document.

Only one of the partners should submit files. Your readme file and your source code files should contain your name and your partner's name.

Your Tasks

The assignment has four parts. The first part is to convert the regular expression code in Section 9.2 of Practice of Programming (TPOP) from C into Java, and add a few new features. The second part is to convert that regular expression code to Python and add the same new features. The third part is to develop a set of good test cases. The fourth part is to create drivers that run your regular expression code against your test cases.

The detailed structure of the regular expression code in TPOP and the interface it presents to the rest of the program is quite specialized -- the goal was to present the essence as compactly as possible, not to make it easy to generalize. So although the algorithms are fine for our purposes here, the packaging and some of the details have to be different.

As much as possible, one wants to hide the implementation details, so they can be readily changed without affecting the rest of the program. In a C library, one uses opaque types: the implementation reveals only a pointer to an anonymous data type and a set of functions; everything else is declared static in the implementation files and is thus hidden. In Java and Python, one creates a class whose (public) methods define the legal operations. In this assignment, you'll convert the C code into a Java class with a suitable constructor and a method for matching and reporting on where a match occurred. You also will do the same using Python.

Part 1. Implement Regular Expressions in Java

Define a class named RegExp in a file named, of course, RegExp.java. The class should have this structure:


public class RegExp
{
   // Create a new RegExp object using regular expression re.
   public RegExp(String re) {...}

   // Return null if this RegExp object does not match txt. Otherwise
   // return an array of two ints. The first int is the (zero-based)
   // index within txt at which the match begins. The second int is the
   // (zero-based) index within txt that is one-beyond the index at
   // which the match ends. If this RegExp object was created from the
   // empty string, then return the array {0, 0}.
   public int[] match(String txt) {...}

   ...
}

The class also can (and should) define fields and private methods.

Also implement two extensions to the basic TPOP mechanism:

Any string is a legal regular expression; there are no errors, so the RegExp constructor never detects an error. In your implementation, the constructor is likely to do nothing more than store the regular expression string for later use; in a more complex design, the constructor might compile the given regular expression into some representation so that subsequent matching would be more efficient. (You might also find it instructive to look at the official Java Pattern and Matcher classes to see how Sun factored the job, but don't use those here!)

The empty regular expression matches the beginning of any string, even the empty string. As in the original, ^ and $ are not special except at the beginning and end respectively. A repetition like *, + or ? is not special at the beginning nor directly after another repetition; that is, a+?b matches any number of a's followed by a literal ? followed by b.

Don't get wrapped up in language lawyering; the task is to convert the code from C to Java, not to invent your own regular expression definition. Follow the original code. Do not under any circumstances create your own matching algorithm or reinterpret what a regular expression is!

The TPOP code interprets the regular expression as matching proceeds, and that is likely to work best here too. But you do have to keep track of where a match begins and ends. It's up to you how you do this, but your code must work if there is more than one RegExp object in use at the same time -- which means no static fields. Finally, you must use Java's String and char data types; that means using BufferedReader for reading input.

For calibration, without comments my version of RegExp.java consists of about 100 lines, compared to about 45 in the original C version. If your solution is a lot longer than that, you are likely off on the wrong track.

Part 2. Implement Regular Expressions in Python

Define a class named RegExp in a file named regexp.py. The class should have this structure:


class RegExp (object):

   # Create a new RegExp object using regular expression re.
   def __init__(self, re):
       ...

   # Return None if this RegExp object does not match txt. Otherwise
   # return a tuple of two ints. The first int is the (zero-based)
   # index within txt at which the match begins. The second int is the
   # (zero-based) index within txt that is one-beyond the index at
   # which the match ends.  If this RegExp object was created from the
   # empty string, then return the tuple (0, 0).
   def match(self, txt):
       ...

   ...

The class also can (and should) define fields and additional methods.

The Python RegExp class should have essentially the same functionality as the Java RegExp class, as the above comments indicate.

For calibration, without comments my version of regexp.py consists of about 80 lines.

Part 3. Create Tests

For the third part of this assignment, we will use a simple testing language. Each test case consists of one line with three fields. Each field is surrounded by '<' and '>' characters. Assume that the fields themselves contain neither '<' nor '>' characters. The first field is a regular expression. The second field is a string. If the regular expression matches the string, the third field is the second field with brackets that mark the matching substring. If the regular expression does not match the string, then the third field contains no brackets, that is, the third field is identical to the second field.

Thus a typical set of lines might be:

<x><xxx><[x]xx>
<x><axb><a[x]b>
<x><abc><abc>
<x*><><[]>
<x*a><abxab><[a]bxab>
<x+a><abxab><ab[xa]b>
<x?x?x><xx><[xx]>
which means:

Be sure you understand why!

Your job is to create a bunch of tests to verify that your regular expression classes work properly. More precisely, create a test file containing a significant number of tests that probe every aspect of the regular expression matching code. Blank lines are OK in the test file.

Part 4: Create Test Drivers

Create a Java class named TestRegExp, in a file named (of course) TestRegExp.java, that tests your Java RegExp class. Your TestRegExp class's main method should accept one command-line argument, which should be the name of a file containing tests. It's OK to assume that each line of the test file is valid, that is, contains exactly three fields. Then the main method should print to stdout those lines of the test file, and only those lines, that contain tests that your RegExp class fails to handle properly. More precisely, it should:

Similarly, create a Python file named testregexp.py that test your Python RegExp class. Your testregexp.py file should accept one command-line argument, which should be the name of a file containing tests. It should behave the same as your Java TestRegExp class.

TPOP has an excellent chapter on testing; make sure you read it. Steve McConnell's Code Complete also has a good chapter on testing. Dr. Kernighan wrote a short article on how Awk is tested; you might skim it to see if it suggests anything useful. You might also think about why the words verify and every above are misused.

Advice

The C version makes good use of pointers; if there is a match between the current text character and the current regular expression character, it recursively checks the rest of the text against the rest of the regular expression, using pointers to refer to them. Java and Python have no pointers (references are pointers in disguise, but you can't do very much with them), so you will have to use string indices (str.charAt(i) in Java, str[i] in Python) and/or substrings (str.substring(i,j) in Java, str[i:j] in Python). The two approaches are almost identical in code size; each makes some aspects easier and others harder. Your choice, but be systematic.

It's up to you to decide whether it's easier to do the modifications on the C version and then translate to Java and Python, or do the translations first. I used this sequence:

  1. Translate the C code to Java and test.
  2. Add the enhancements to the Java code: leftmost longest, then +, then ?, finally the start and end indices, testing after each enhancement.
  3. Translate the Java code to Python and test.

Either way, one step at a time is always a good strategy: make sure that each step works before you start the next one.

Define suitable subordinate methods rather than writing everything in one method. When you add code for + and ?, use separate methods but call on ones that already exist where possible. Keep it simple: your code does not need to be fast, or save memory, or be the slightest bit clever. As is true with all programming, trying to be fast and/or clever is a recipe for disaster.

Talking over the precise meaning of the assignment with friends and the TAs is encouraged. You must write your code yourself, however, so once you start coding, you may no longer discuss programming issues with other students.

However you may swap test files (but not code) with other teams. If you suspect that some of the other teams' tests are invalid, then you can discuss those tests with the other team. This is meant to be a sort of arm's-length interoperability check, not a collaboration.

The penguins directory /u/cos333/Assignment1 contains the file GivenRegExp.class containing a definition of the Java class GivenRegExp. Your RegExp classes should have the same behavior as GivenRegExp. Feel free to use the GivenRegExp class to resolve any questions about the desired behavior of your RegExp classes.

It might be possible to use a Java decompiler, such as JAD or Mocha, to generate Java source code from the given GiveRegExp.class file. Then it would be possible to use that source code, directly or indirectly, to write your RegExp classes. DOING THAT IS ILLEGAL! Don't use a Java decompiler, directly or indirectly, to generate any Java code for this assignment. Instead, write your Java (and Python) code manually.

Submission

When you are finished, submit your source code files on penguins using these commands:

/u/cos333/bin/i686/submit 1 RegExp.java TestRegExp.java
/u/cos333/bin/i686/submit 1 regexp.py testregexp.py
/u/cos333/bin/i686/submit 1 readme

As noted above in the "Rules" section, either you or your partner, but not both, should submit files. You can submit multiple times; we'll grade the latest one.

Your readme file should contain:

Your readme file should be a plain text file. Don't create your readme file using Microsoft Word or any other word processor.

Please follow the rules on what to submit. It will be a big help to us and to your grade if you get the filenames right and submit exactly what's asked for. Thanks.

Grading

We will use this procedure to grade your RegExp.java file:

javac [your]RegExp.java
javac [our]TestRegExp.java
java TestRegExp [our]tests

We will use this procedure to grade your TestRegExp.java file:

javac [our]RegExp.java
javac [your]TestRegExp.java
java TestRegExp [our]tests
We will use this procedure to grade your regexp.py file:
python [our]testregexp.py [our]tests
where [our]testregexp.py file uses [your]regexp.py file. We will use this procedure to grade your testregexp.py file:
python [your]testregexp.py [our]tests
where [your]testregexp.py file uses [our]regexp.py file.

It won't be possible for us to inspect your code manually. Instead we will grade your code based upon its behavior only. The good news: the code that you submit need not be well-styled -- although we strongly suggest that you use good style for your sake. The bad news: if your code fails tests on some particular functionality, then we won't be able to inspect your code manually to try to assign partial credit for that functionality. So please make sure your code behaves properly.

This assignment was adapted by Robert M. Dondero, Ph.D.
from a similar assignment designed by Brian Kernighan, Ph.D.