Christopher Moretti
Department of Computer Science
Princeton University

Assignment 4: The REST of the Courses

Due: 23:59 Friday 3/04/2016

Preface

JSON (Javascript Object Notation) is a very widely used format for information interchange that has gone from its roots in Javascript to almost universal applicability, with processing libraries available in all languages. Not surprisingly, it is particularly often used to send information to and from web pages.

This assignment is partly to show in minimal form how web services operate, and partly to learn to use Python to manipulate JSON, using a dataset that is of particular interest to Princeton students: the Registrar's data on courses. It has, as has been the case with the other assignments, a testing component as well.

The data for this assignment is available Course Offerings but not in a form that is convenient for further processing. A Python program (originally written by Alex Ogier '13 and kept in service by Brian Kernighan and Christopher Moretti) scrapes that web site and produces the information as JSON.

The JSON file is a large array, each of whose elements is the information for a single course. Here are a couple of example courses (formatted and lightly edited to remove extranea):

{"profs": [{"uid": "960638964", "name": "Christopher M. Moretti"}],
 "title": "Advanced Programming Techniques",
 "courseid": "002065",
 "listings": [{"dept": "COS", "number": "333"}],
 "area": "",
 "descrip": "This is a course about the practice of programming ...",
 "classes": [
  {"classnum": "40798", "enroll": "138", "limit": "160",
   "starttime": "11:00 am", "section": "L01", "endtime": "12:20 pm",
   "roomnum": "101", "days": "TTh", "bldg": "Friend Center"
  }
 ]
},
{"profs": [
  {"uid": "960423023", "name": "Bridget A. Alsdorf"},
  {"uid": "910106245", "name": "Denis Feeney"},
  {"uid": "010022721", "name": "Simon A. Morrison"},
  {"uid": "960039380", "name": "Efthymia Rentzou"},
  {"uid": "010000769", "name": "Esther H. Schor"},
  {"uid": "960275842", "name": "Mira L. Siegelberg"}
 ],
 "title": "Interdisciplinary Approaches to Western Culture II: Literature and the Arts",
 "courseid": "003780",
 "listings": [{"dept": "HUM", "number": "218"}],
 "area": "LA",
 "descrip": "... examines European texts, works of art and music from the Renaissance..."
 "classes": [
  {"classnum": "40007", "enroll": "41", "limit": "60",
   "starttime": "10:00 am", "section": "L01", "endtime": "10:50 am",
   "roomnum": "010", "days": "TWTh", "bldg": "East Pyne Building"},
  {"classnum": "40008", "enroll": "15", "limit": "15",
   "starttime": "1:30 pm", "section": "C01", "endtime": "2:50 pm",
   "roomnum": "15", "days": "TTh", "bldg": "Henry House"},
  {"classnum": "40009", "enroll": "14", "limit": "15",
   "starttime": "1:30 pm", "section": "C02", "endtime": "2:50 pm",
   "roomnum": "204", "days": "TTh", "bldg": "Friend Center"},
  ...
 ]  
} 

 

Specification

Your task in this assignment is to make this JSON information easily searchable from a browser. The file courses.json contains the registrar's data for this semester, as scraped by the program above. You must write a Python web server that provides a RESTful search interface: queries are encoded in the path components of the URL that requests the information and the server parses the URL and generates its response. These queries have the form:

str1/str2/str3/...

where each str# above is a partial query, and the result is all records that satisfy the intersection of the partial queries (that is: the logical AND of the partial queries, or in yet other words: all of them). This means order does not matter in the queries. Case also does not matter in the queries. For example, the query in the first line below should generate the result in the line below it:

stn/TILGH/frs
FRS 144 STN T 1:30-4:20 How the Tabby Cat Got Her Stripes Shirley M. Tilghman Blair Hall T5

 

The example above demonstrates the format in which your program must display the information for each matching record. Specifically, on a single line:
CourseNumber Area Day Time Title Professors Building Room
There must be at least one space between adjacent result fields. The CourseNumber result field is created by joining the dept and number JSON fields with exactly one space. The Time result field is created by joining the starttime and endtime JSON fields with exactly one dash (-), after removing their AM/PM designations and whitespace.

 

This was perhaps the simplest possible example. Many courses are cross-listed, have multiple professors, and/or have multiple sections, including lectures, classes, precepts, labs, and more. Many courses have no distribution area, no location, or even no sections at all. For cross-listed courses, you must join the CourseNumber result fields with exactly one slash (/). For courses with multiple professors, you must join the Professors result fields with exactly one slash (/). For courses with more than one section, you should match only on the section with the largest enrollment, and also print only the information for this section, ignoring all others; and in the event of a tie for section with largest enrollment, use the first tied entry (that is, the one appearing first in the list). The intent is to grab the main lecture of the course: this is pretty sketchy and doesn't always do exactly that -- for instance, it only shows one of the two approximately-equal lecture sections in COS126 -- but is sufficient for our purposes. For courses with no data in a field, that field shall be left blank in output. Here are some additional courses demonstrating these complications:

WWS 315/POL 393 SA MW 10:00-10:50 Grand Strategy Aaron L. Friedberg/G. John Ikenberry Robertson Hall 016
ELE 498    Senior Independent Work Paul R. Prucnal
ENG 563/MOD 527  M 1:30-4:20 Poetics - 19thC English and American Poetry: New Tools, New Archives Meredith A. Martin/Meredith L. McGill  

 

Although the ordering of the queries does not matter, the order of evaluation of a given query is important. A query must be evaluated in this order:

  1. If a query is one of the standard distribution codes (la, sa, ha, em, ec, qr, stl, stn), it selects courses that satisfy this distribution area.
  2. If a query is a set of days corresponding to a common course offering schedule, such as mwf, it selects courses offered on exactly those days. To make this concrete, consider exactly these patterns and no others: m, t, w, th, f, mw, mwf, tth, mtwth and mtwthf. This will miss some courses (e.g. HUM 218 meets TWTh), but do not try to handle other cases. Note that a course that meets only on Mondays does not match mw, nor mwf, and in the other direction, a course that meets Mondays, Wednesdays, and Fridays does not match m, nor w, nor mw.
  3. If a query consists of exactly 3 alphabetic characters (other than stl, stn, mwf or tth, which should already have been handled above), such as cos or eco, it selects courses from that department. You need not check for validity: if the 3-letter combination is not a valid department, it simply will not match any records.
  4. If a query consists of exactly 3 digits, such as 217, it selects courses with that number.
  5. Other query components that are longer than 3 characters are interpreted as regular expressions. A regular expression selects courses where the RE matches the title, a professor, the time, the building, or the room number. The RE oes not match text across fields (e.g. 'Strategy.*Aaron' in the example above) or across different entries within a combined field (e.g. 'Aaron.*Ikenberry'). Nor does it match area, department, course number, or days. You must use the Python regular expression module re for this part; do not roll your own.

 

There are certainly categories of queries that do not fit within any of these categories; for example: one-letter queries, two-letter queries that do not match a distribution area, three-letter queries that contain a mix of letters and digits. This is fine -- queries of these types simply should not match any courses, and thus should return no results.

 

All queries must be case-insensitive. For examples: "mUsIc" matches "music", "Music", and "MUSIC"; and "QR", "qr", "Qr", and "qR" must all be accepted as the quantitative reasoning area. Again, you must use the Python regular expression module re for this.

 

Your program must be called reg.py. It implements a simple web server using this template:

import SocketServer
import SimpleHTTPServer
class Reply(SimpleHTTPServer.SimpleHTTPRequestHandler):
  def do_GET(self):
    # query arrives in self.path; return anything, e.g.,
    self.wfile.write("query was %s\n" % self.path)

def main():
  # read and parse courses.json
  SocketServer.ForkingTCPServer(('', 8080), Reply).serve_forever()

main()

With the server running on your own computer, you can run tests with the curl command, or by typing that same URL into a browser on your computer:

$curl localhost:8080/stn/TILGH/frs
FRS 144 STN T 1:30-4:20 How the Tabby Cat Got Her Stripes Shirley M. Tilghman Blair Hall T5
$

Some systems (e.g., nobel) won't let you open port 8080, but will let you open ports with high numbers, say 30000-60000. Accordingly, your reg.py must accept an optional command line argument for the port number. You can set the default to whatever you like, but we will use the optional argument so we can test without editing your code.

 

The testing component of this assignment is similar to that from Assignments 1 and 3: create at least 25 high quality test queries in files named test00, test01, ... , test24, etc. Each file should contain the query on the first line, followed by the matching results, in the same format as the server response. For example:

stn/TILGH/frs
FRS 144 STN T 1:30-4:20 How the Tabby Cat Got Her Stripes Shirley M. Tilghman Blair Hall T5

These queries should thoroughly explore critical boundary conditions and other potential trouble spots of the specification. Again, you might find it helpful to read Chapter 6 of The Practice of Programming on testing. It is typically preferable to return a few interesting courses than a large glut of results. Please do not submit any queries that would result in more than 100 courses in the results.

 

Advice

This is the version of python installed on the CS servers. Note that there are significant incompatibilities between Python 2.* and Python 3. We would strongly prefer that you use Python 2.*

tux:~$ python
Python 2.7.5 (default, Jun 24 2015, 01:06:47) 
[GCC 4.8.3 20140911 (Red Hat 4.8.3-9)] on linux2

 

The courses.json file will be present in the directory where your reg.py is (and from where it is executed). You may hardcode the filename. You should not make any assumptions about the content -- we could give you a courses.json with last year's classes, or with only COS333.

You will have to write the code in reg.py that reads and parses courses.json (importing the module json will provide you with some useful functions to do this), accepts search requests, and sends responses. Start with the server template and add code to read and evaluate the JSON file. Then parse each user query, search the JSON for matching items, then format and return the selected ones. Repeating the advice from all the previous assignments: Keep it simple, this program does not need to be fast or the slightest bit clever. As is true with all programming, trying to be fast and/or clever is often a recipe for disaster. My version has only about 110 non-blank lines, so if your solution is a lot longer, you may be off on the wrong track or not working as surgically.

Talking over the precise meaning of the specification with friends is strongly encouraged, as always, but in particular with this assignment due to its large number of potential corner cases. Use Piazza to garner official interpretations (which may just be my opinions, of course).

 

The JSON file contains a modest number of accented or otherwise non-Latin alphabet characters, for example, FRE 367 is a course about Camus taught by Professor André Benhaïm. These characters are rendered in text as Unicode escapes, and may not print cleanly without special effort (though my implementation seems to handle them), but you do not need to worry about doing anything special with them. We wil not focus on these special characters in our testing.

 

Submission and Evaluation

When you are finished, submit your source and tests tarball using the CS Dropbox link dropbox.cs.princeton.edu/COS333_S2016/Four.

Please create your tests tarball using the same command as from Assignment 3:

tar cf tests.tar test??

We will give you some indication that you have not drastically misinterpreted the specification, by running some tests of our own when you submit. These are not a complete test. Do your own testing; don't rely on our tests to validate your code.

We will test your code primarily by running the same queries through your version and our version and comparing the results; we will sort the output lines and ignore empty lines and whitespace differences, so don't get too hung up on minutiae of line formatting aside from the minimal requirements mentioned above. As with prior assignments, we will test your tests for reasonable coverage of expected simple and corner cases.