COS 333 Assignment 4: The REST of the Courses

Due midnight, Friday, March 10

Wed Mar 1 10:36:03 EST 2017

There are a fair number of potential edge cases in this assignment. Don't get too wrapped up in them; the goal is to learn more Python and see how a web service works.

Preface

JSON (Javascript Object Notation) is a very widely used format for information interchange that has gone from its roots in Javascript to almost universal applicability, with processing libraries available in all languages. Not surprisingly, it is often used to send information to and from web pages.

This assignment is partly to show in minimal form how web services operate, and partly to learn to use Python to manipulate JSON, using a dataset that is of particular interest to Princeton students: the Registrar's data on courses. As with the other assignments, there is a testing component as well.

The data for this assignment is available through Course Offerings but not in a form that is convenient for further processing. A Python program (originally written by Alex Ogier '13 and kept in service by Brian Kernighan and Christopher Moretti) scrapes that web site and produces the information as JSON.

The JSON file is a large array of objects, each of which is the information for a single course. Here are a couple of example courses (formatted and lightly edited to remove extranea):

[
  {
    "area": "",
    "classes": [
      {
        "bldg": "Friend Center", "classnum": "41383",
        "days": "TTh", "endtime": "12:20 pm",
        "enroll": "209", "limit": "230",
        "roomnum": "101", "section": "L01",
        "starttime": "11:00 am"
      }
    ],
    "courseid": "002065",
    "descrip": "This is a course about the practice of programming. [snip]",
    "listings": [{ "dept": "COS", "number": "333" }],
    "prereqs": "COS 217 and COS 226..",
    "profs": [{ "name": "Brian W. Kernighan", "uid": "010043181" },
              { "name": "Christopher M. Moretti", "uid": "960638964" }
    ],
    "title": "Advanced Programming Techniques"
  },
  {
    "area": "STN",
    "classes": [
      {
        "bldg": "Carl C. Icahn Laboratory", "classnum": "43735",
        "days": "T", "endtime": "4:20 pm",
        "enroll": "14", "limit": "15",
        "roomnum": "280", "section": "S01",
        "starttime": "1:30 pm"
      }
    ],
    "courseid": "014106", "descrip": "See website",
    "listings": [{ "dept": "FRS", "number": "146" }], "prereqs": "",
    "profs": [{ "name": "Shirley M. Tilghman", "uid": "010000886" }],
    "title": "What Makes a Great Experiment?"
  }
]

Specification

Your task in this assignment is to make this JSON information easily searchable from a browser. The file courses.json contains the registrar's data for 1228 courses for this semester, as scraped by the program above on February 17. You must write a Python web server that provides a RESTful search interface: queries are encoded in the path components of the URL that requests the information and the server parses the URL and generates its response. These queries have the form:

str1/str2/str3/...

where each str# is a partial query, and the result is all records that satisfy the intersection of the partial queries (that is, the logical AND of the partial queries, or in other words, all records that match all of the queries). This means that order of the sub-queries does not matter in a query. Case also does not matter. For example, the query in the first line below should generate the result in the line below it:

stn/TILGH/frs
FRS 146 STN S01 T 1:30-4:20 What Makes a Great Experiment? Shirley M. Tilghman Carl C. Icahn Laboratory 280

The example above demonstrates the format in which your program must display the information for each matching record. Specifically, on a single line:

CourseNumber Area Section Day Time Title Professors Building Room
There must be at least one space between adjacent result fields. The CourseNumber result field is created by joining the dept and number JSON fields with exactly one space. The Section field is the lecture, class or seminar number, like L02, C02A, S01 or U02. The Time result field is created by joining the starttime and endtime JSON fields with exactly one dash (-), after removing their AM/PM designations and whitespace.

This is a very simple example. Many courses are cross-listed, have multiple professors and/or multiple sections, including lectures, classes, precepts, labs, and more. Many courses have no distribution area, no location, or even no sections at all.

Here are some additional courses demonstrating these complications:

$ curl localhost:33333/cos/498
COS 498    Senior Independent Work (B.S.E. candidates only) Robert S. Fish/Margaret R. Martonosi
$ curl localhost:33333/ent/201
EGR 200/ENT 201  L01 TTh 12:30-1:20 Creativity, Innovation, and Design Derek B. Lidow/Sheila V. Pontis
EGR 201/ENT 200  L01 MW 11:00-12:20 Foundations of Entrepreneurship Rakefet S. Kasdin Friend Center 004
EGR 200/ENT 201  L02 TTh 10:00-10:50 Creativity, Innovation, and Design Derek B. Lidow/Sheila V. Pontis

Although the ordering of the sub-queries does not matter, the order of evaluation of each individual sub-query is important. Each sub-query must be evaluated in this order:

  1. If a sub-query is one of the standard distribution codes (la, sa, ha, em, ec, qr, stl, stn), it selects courses that satisfy this distribution area.
  2. If a sub-query is a set of days corresponding to a common course offering schedule, such as mwf, it selects courses offered on exactly those days. To make this concrete, consider exactly these patterns and no others: m, t, w, th, f, mw, mwf, tth, twth, mtwth and mtwthf. This will miss a few courses but do not try to handle other cases. Note that a course that meets only on Mondays does not match mw, nor mwf, and in the other direction, a course that meets Mondays, Wednesdays, and Fridays does not match m, nor w, nor mw.
  3. If a sub-query consists of exactly 3 alphabetic characters (other than stl, stn, mwf or tth, which should already have been handled above), such as cos or eco, it selects courses from that department or program. You need not check for validity: if the 3-letter combination is not a valid department, it simply will not match any records.
  4. If a sub-query consists of exactly 3 digits, such as 217, it selects courses with that number:
    $ curl localhost:33333/262
    CEE 262/ARC 262/EGR 262/URB 262/ART 262 LA L01 MW 10:00-10:50 Structures and the Urban Environment Maria E. Garlock Thomas Laboratory 003
    CEE 262/ARC 262/EGR 262/URB 262 STL L01 MW 10:00-10:50 Structures and the Urban Environment Maria E. Garlock Thomas Laboratory 003
    VIS 262 LA C01 T 1:30-4:20 How to Make a Film Yaara Sumeruk Nassau Street, 185 304
    VIS 262 LA C01 T 7:30-10:20 How to Make a Film Yaara Sumeruk Nassau Street, 185 304
    
  5. Other sub-query components that are longer than 3 characters are interpreted as regular expressions. A regular expression selects courses where the RE matches the time (in hh:mm-hh:mm format), or the title, or a professor, or the building, or the room. The RE does not match text across fields (e.g. 'Senior.*Robert' in the example above) or across different entries within the Professors field (e.g. 'Fish.*Martonosi'). Nor does it match department, course number, area, section, or days. You must use the Python regular expression module re for this part; do not roll your own matching.

There are certainly categories of sub-queries that do not fit within any of these categories, for example, one-letter queries, two-letter queries that do not match a distribution area, three-letter queries that contain a mix of letters and digits. This is fine -- queries of these types simply should not match any courses, and thus should return no results.

All queries must be case-insensitive. For examples: "mUsIc" matches "music", "Music", and "MUSIC"; and "QR", "qr", "Qr", and "qR" must all be accepted as the quantitative reasoning area.

Your program must be called reg.py. It implements a simple web server using this template:

import SocketServer
import SimpleHTTPServer
class Reply(SimpleHTTPServer.SimpleHTTPRequestHandler):
  def do_GET(self):
    # query arrives in self.path; return anything, e.g.,
    self.wfile.write("query was %s\n" % self.path)

def main():
  # read and parse courses.json
  SocketServer.ForkingTCPServer(('', 8080), Reply).serve_forever()

main()

This starts a server listening on port 8080 and runs it forever. (Note that this call does not return, so it must be the last line of your main function.)

When the server receives a request, it is handled by the Reply class. The version above just prints the value of the path instance variable, which is the query string. Your job is to replace that line with your code to print the search results. Do this one step at a time until you figure out what's going on. You will also have to handle the optional commandline argument.

With the server running on your own computer, you can run tests with the curl command, or by typing that same URL into a browser on your computer:

$ curl localhost:8080/tilgh
FRS 146 STN S01 T 1:30-4:20 What Makes a Great Experiment? Shirley M. Tilghman Carl C. Icahn Laboratory 280

Some systems (e.g., nobel) won't let you open port 8080, but will let you open ports with high numbers, say 30000-60000. Accordingly, your reg.py must accept an optional command line argument for the port number. You can set the default to whatever you like, but we will use the optional argument so we can test without editing your code. The image above comes from port 33333.

The testing component of this assignment is similar to previous assignments: create at least 25 high quality test queries in files named test00, test01, ... , test24, etc. Each file should contain the query on the first line, followed by the matching results, in the same format as the server response, on the following lines. For example:

lin/tth
LIN 303 EC C01 TTh 1:30-2:50 Linguistic Semantics Edwin S. Williams Green Hall 0-S-9
LIN 308/TRA 303 EC C01 TTh 3:00-4:20 Bilingualism Christiane D. Fellbaum Green Hall 0-S-9
LIN 355 SA L01 TTh 12:30-1:20 Field Methods in Linguistics Florian Lionnet Green Hall 1-S-4A

These queries should thoroughly explore critical boundary conditions and other potential trouble spots of the specification. Again, you might find it helpful to read Chapter 6 of The Practice of Programming on testing. It is typically preferable to return a few interesting courses rather than a large pile of results. Please do not submit any queries that would result in more than 100 courses in the results.

Advice

The CS servers use Python 2 by default. Since there are significant incompatibilities between Python 2 and Python 3, you must use Python 2.

$ python
Python 2.7.5 (default, Nov  8 2016, 19:11:39) 
[GCC 4.8.5 20150623 (Red Hat 4.8.5-11)] on linux2

The courses.json file will be present in the directory where your reg.py is (and from where it is executed) so you may hardcode the filename in the program. You should not make any assumptions about the content (though the field names and format will not change) -- we could give you a courses.json with last year's classes or with only COS 333 and your program should still work.

You will have to write the code in reg.py that reads and parses courses.json (importing the module json will provide you with some useful functions to do this), accepts search requests, and sends responses. Start with the server template and add code to read and evaluate the JSON file. Then parse each user query, search the JSON for matching items, then format and return the selected ones. Keep it simple, this program does not need to be fast or the slightest bit clever. As is true with all programming, trying to be fast and/or clever is often a recipe for disaster. My version has about 155 lines, so if your solution is a lot longer, you may be off on the wrong track or doing something the hard way. Take a look at the Python library functions for manipulating lists in section 5.1.

The main complexity in this assignment is handling multiple sections. You will find that the input data contains all the sections for a course in a single record. Thus if a class has multiple sections, you will have to in some way make separate copies of its record, then go through the "classes" part to identify matching sections. It will be easiest to make one pass through the data to identify such sections. If a course has N sections, make N copies of the record, then select only the relevant data for each one. Note that Python objects are references, which means that a statement like

obj2 = obj1
does not make a copy of obj1. You are very likely to need a deep copy; look at the deepcopy member of the copy module.

Talking over the precise meaning of the specification with friends is strongly encouraged, as always, but in particular with this assignment due to its large number of potential corner cases. Use Piazza to garner official interpretations.

We will attempt to maintain a reference server on Davisson, one of the OIT Nobel servers. To use it, you must log in to Davisson:

$ ssh davisson.princeton.edu
$ curl localhost:33333/your/query/strings
There's no guarantee that we can keep this running all the time, so please do your own testing; don't rely on this.

The JSON file contains a modest number of accented or otherwise non-Latin alphabet characters, for example, FRE 245 is a course about Africa and culture taught by Professor André Benhaïm. These characters are sometimes rendered in text as Unicode, sometimes as \u escapes, and sometimes as HTML escapes, and may not print cleanly without special effort, but you do not need to worry about doing anything special with them. We will not focus on such special characters in our testing.

Submission and Evaluation

When you are finished, submit your reg.py and tests.tar using the CS Dropbox link dropbox.cs.princeton.edu/COS333_S2017/asgn4. Create your tests tarball using the same command as from Assignment 3:

tar cf tests.tar test??

We will give you some indication that you have not drastically misinterpreted the specification, by running some tests of our own when you submit. These are not a complete test. Do your own testing; don't rely on our tests to validate your code.

We will test your code primarily by running the same queries through your version and our version and comparing the results. We will sort the output lines and ignore empty lines and whitespace differences, so don't get hung up on minutiae of line formatting aside from the minimal requirements mentioned above. As with prior assignments, we will test your tests for reasonable coverage of expected simple and corner cases.