COS 333 Assignment 1: Regular Expression Matching

Due midnight, Friday, February 17

Sat Feb 4 09:43:31 EST 2017

This assignment is a warmup to make sure you're still comfortable with C and Java, that you know how to compile and run programs, and that you can submit code via the COS 333 submission mechanism. And there's one other purpose -- we want to make sure that you can write code that actually works. Since the class is so large, grading for assignments will be entirely automatic, your code won't likely be looked at by humans at all, and your grade will be determined by how well you do on our automated tests. However, we strongly recommend that you remember and put into practice everything you learned about good style in previous courses. It will help you greatly in debugging, and in understanding your code when you return to it a {day, week, year} later.

There are two exceptions to the hands-off approach. First, we're not completely cruel: we will do our best to help you to debug disastrous errors that prevent our automated testing from running, but only at considerable cost in terms of points. Thus you should do everything you can to ensure that your code works. Second, we will routinely run all submissions through systems like MOSS, which are extremely effective in detecting programs that are similar. If your program is deemed interesting by MOSS or a similar tool, we will look at it very carefully, and if appropriate will send our opinions to the Committee on Discipline. Come to office hours, post plaintive cries for help on Piazza, or do anything else within reason and acceptable conduct, but please do not copy anyone else's work or let anyone copy yours -- 20+ hours of our time to prepare a case, and a year of your life serving an academic suspension simply aren't worth it. On that note, be sure to examine the collaboration policy on the course web page page posthaste!

Specification

In ancient Unix times, filename regular expression matching ("wildcards") was done not in the shell itself, but in a separate program called glob. This lives on in various nooks and crannies -- csh and tcsh have a shell built-in command called glob that does this operation -- and a variety of programming languages have functions or operators that match regular expressions of this class, for instance, Python's glob module, and Like in Visual Basic and SQL, each with its own syntax. (See the Wikipedia article on Glob.)

This assignment has three parts.:

  1. Modify the regular expression code in Section 9.2 of The Practice of Programming to process regular expressions of this class.
  2. Translate your C version into Java that does exactly the same thing.
  3. Develop a compact specification of a set of test cases, and a program that will run the tests from the specification.

1. Modify the regular expression code

To begin, read the description in the book carefully and be sure you understand it. Then download the code and modify match and the other functions as necessary so they match the new class of regular expressions:

  1. A ? matches any single character; thus a?b matches aab, axb, etc., but not ab or axxb.
  2. A * matches any sequence of zero or more characters; thus a*b matches ab, axb, etc., but not a or b.
  3. A backslash \ quotes the next character of the RE, turning off any special meaning it might have had. Thus \*\\ matches *\ and *\? matches anything that ends with ?. If there is a bare backslash at the end of the RE (that is, not followed by another character), that is an illegal RE. For example, \ and \*\ are both illegal.
  4. Matches are "anchored": they must match the entire input string as if the regular expression had been surrounded by ^ and $ in grep. Thus a*z matches any string that begins with a and ends with z, including az, but it does not match aza or zaz.

You have to modify match to handle this class of regular expressions. Your int match(char *re, char *s) must return 1 if the regular expression re matches the string s, and 0 if not; it returns -1 if there is any kind of error, such as an illegal regular expression. match does not print anything ever, nor does it ever exit or abort. It must not modify its input arguments, and it must free any memory it allocates.

The only illegal REs are ones that end with a bare \; that is, \ and \\\ are illegal, but \\ is legal. The empty regular expression is legal and matches (only) the empty string. If you make only straightforward modifications to the existing code, odd cases like a**b will just work.

For this part, you will have to submit a single C file regexp.c that contains only your function definitions. The only name that should be visible outside the file is match, so everything else must be declared static. (You can use the nm command to verify visibility.) We will test your code by compiling and linking your regexp.c with our own driver, using this declaration only:

	int match(char *regexp, char *text); 
and this compile command:
	gcc -Wall -Wextra our_grep.c your_regexp.c 

Your code must compile and link without change and without warning messages.

For calibration, my regexp.c is 43 lines long including comments and blank lines. If your solution is much longer than this, you are probably off on the wrong track. My grep.c for testing is about 15 lines long.

To head off questions on Piazza, note that your regexp.c must not call things like printf nor include non-standard header files, and your driver does not need to use eprintf and similar routines from the original.

Don't get wrapped up in language lawyering; the task is to convert the original code to our definition, not to invent your own definition. Follow the original code. Do not create your own matching algorithm or reinterpret what a regular expression is!

2. Translate your C code into Java

Self-explanatory. Your Java code must behave exactly the same way as your C code does. Naturally both must be correct.

You must submit a file Regexp.java that contains a public class Regexp with a single public static method called match that has this signature:

	public static int match(String regexp, String text);
As with the C version, your Java code must not print anything, it must not throw exceptions, and the only public member is the match function. (Note the different meanings of static in C and Java.) We will compile it with
	javac our_java.java your_Regexp.java 

The C version makes good use of pointers; if there is a match between the current text character and the current regular expression character, it recursively checks the rest of the text against the rest of the regular expression, using pointers to refer to them. Java has no pointers (references are pointers in disguise, but you can't do very much with them), so you will have to use string indices (String.charAt) and/or substrings (String.substring). I used substrings everywhere except for looking at individual characters, but indices work fine too. My code has just under 40 lines in the RE part itself and my tester is 20 lines.

3. Create tests and run them

For the third part of this assignment, we will use a simple variant of the testing language for Awk that will be described in class. Each test case consists of one line with three tab-separated fields. The first field is a regular expression. The second field is a text string. The third field is 1 if the regular expression should match the string; it is 0 if the regular expression should not match the string; and it is -1 if the first field is an invalid RE. Blank lines and fields beyond the third will be ignored so you can comment your tests and group related ones together.

Thus a typical set of tests might read

\	\	-1	illegal RE
\\\	\\\	-1	illegal RE

x       x       1	easy ones
x       y       0
x               0
x       xx      0
x       xy      0

\x      x       1	quotes
\*      *       1
\*      x       0
x?y     xxy     1	? tests
x?y     xy      0
x??y	xxyy	1
x**y	xy	1	* tests
x**y	x***y	1
x**y	xyz	0
x***	y***	0
Be sure you understand these tests. Then create a bunch of tests of your own and use them to mechanically verify that your program works properly.

It's up to you how you arrange to run all your tests so long as it's automated. I wrote a short Awk script that reads through a file of tests and creates a shell script. The shell script reports when something unexpected happens but otherwise prints no output.

We will test your code by compiling and linking your regexp.c and regexp.java with our own drivers, using only the definitions of match above, and using our test cases in addition to yours. Don't forget the ancillary requirements like no visible names, no memory leaks, no exceptions, etc.

Advice

It is always best to attack such a task in stages, making sure each one works before the next one is started. For example, you might split the original grep.c into two files, one of which contains match() and its dependencies, and one that uses match() but does not know about the implementation. Then handle the basics of glob. Write some tests and automate running them so that it's easy to check that your basic code is right. Do easy tests before harder ones. Then add backslash processing and add some more tests. Get the C program working perfectly before undertaking the Java translation.

Talking over the precise meaning of the assignment with friends and the TAs is encouraged. You must write your code yourself, however, so once you start coding, you may no longer discuss programming issues with other students.

Prepare good test cases ahead of time; your tests ought to cover essentially all the lines of code in your program. Test early, test often, know what you are testing. Automate the testing as much as you can. Steve McConnell's Code Complete has a good chapter on testing. (Advertisement: so does TPOP.) I also wrote a short note on how AWK is tested; you might skim it to see if it suggests anything useful.

Submission

Submit your regexp.c, Regexp.java, and retests.txt using the CS dropbox at http://dropbox.cs.princeton.edu/COS333_S2017/asgn1. You can submit multiple times; each submission is checked and we'll grade the last one that arrives. The check script will give you some indication of how well your program appears to work when you submit so you can try again, but it's a sanity check, not a comprehensive test. The grading scripts are significantly more rigorous.

As fair warning, you can expect significant deductions for failing to meet the specification, on the order of 25-50% penalties for failure to compile, infinite loops, spurious output, or failure to follow the API (using the same standards as in COS126 and COS226). Aside from that, you will lose some small number of points for each of our tests that your code fails, each of your tests that is incorrect, and each case we catch that your tests do not cover. You can be assured that you cannot earn a negative score.

The usual lateness penalties will then be assessed if applicable.