COS 333 Assignment 4 (Spring 2018): REST and be Thankful

Due midnight, Friday, March 9

Sat Mar 3 11:08:56 EST 2018

 

There are a fair number of potential edge cases in this assignment. Don't get wrapped up in them; the goal is to learn JSON and some Python, and see how web services and servers work. Grading will not probe very hard at edge cases, just basic functionality.

Preface

JSON (Javascript Object Notation) is a very widely used format for information interchange that has gone from its roots in Javascript to almost universal applicability, with processing libraries available in all languages. Not surprisingly, it is often used to send information to and from web pages.

This assignment is partly to show in minimal form how web services operate, and partly to help you learn to use Python to manipulate JSON, using a dataset that is of particular interest to Princeton students: the Registrar's data on courses. As with the other assignments, there is a testing component as well.

The data for this assignment is available through Course Offerings but not in a form that is convenient for further processing. A Python program scraper.py (originally written by Alex Ogier '13 and kept in service by Brian Kernighan and Christopher Moretti) scrapes that web site and produces the information as JSON. (You do not need scraper.py; it's included so you can see how it works if you're interested.)

The JSON file is a large array of objects, each of which is the information for a single course. Here are a couple of example courses (formatted and lightly edited to remove extranea):

{   "profs": [
        { "uid": "010043181", "name": "Brian W. Kernighan" },
        { "uid": "961139415", "name": "Jeremie Lumbroso" }
    ],
    "title": "Advanced Programming Techniques", "courseid": "002065",
    "listings": [ { "dept": "COS", "number": "333" } ],
    "area": "", "prereqs": "COS 217 and COS 226..",
    "descrip": "This is a course about the practice of programming [...]",
    "classes": [
        {
            "classnum": "40160", "enroll": "132", "limit": "160",
            "starttime": "11:00:00 am", "section": "L01", "endtime": "12:20:00 pm",
            "roomnum": "003", "days": "TTh", "bldg": "Thomas Laboratory"
        }
    ]
},
{  "profs": [
        { "uid": "010000886", "name": "Shirley M. Tilghman" }
    ],
    "title": "What Makes a Great Experiment?", "courseid": "014106",
    "listings": [ { "dept": "FRS", "number": "146" } ],
    "area": "STN", "prereqs": "",
    "descrip": "See website",
    "classes": [
        {
            "classnum": "43183", "enroll": "14", "limit": "15",
            "starttime": "01:30:00 pm", "section": "S01", "endtime": "04:20:00 pm",
            "roomnum": "200", "days": "T", "bldg": "Carl C. Icahn Laboratory"
        }
    ]
}

Specification

Your task in this assignment is to make this JSON information easily searchable from a browser. The file courses.json contains the registrar's data for 1222 courses for this semester, as scraped by the program above on February 18. You must write a Python web server that provides a RESTful search interface: queries are encoded in the path components of the URL that requests the information, and the server parses the URL and generates its response. These queries have the form:

http:example.com/str1/str2/str3/...

where example.com is the server, each strn is a partial query, and the result is all records that satisfy the intersection of the partial queries (that is, the logical AND of the partial queries, or in other words, all records that match all of the queries). This means that order of the sub-queries does not matter in a query. Case also does not matter. For example, the query in the first line below should generate the result in the line below it:

stn/TILGH/frs
FRS 146 STN S01 T 01:30-04:20 What Makes a Great Experiment? Shirley M. Tilghman Carl C. Icahn Laboratory 200

The example above demonstrates the format in which your program must display the information for each matching record. Specifically, on a single line, your program must print the data in this order:

Course Number Area Section Day Time Title Professors Building Room
There must be at least one space between adjacent result fields. The CourseNumber result field is created by joining the dept and number JSON fields with exactly one space. The Section field is the lecture, class or seminar number, like L02, C02A, S01 or U02. The Time result field is created by joining the starttime and endtime JSON fields with exactly one dash (-), after removing seconds, AM/PM designations and whitespace.

This is a very simple example. Many courses are cross-listed, have multiple professors and/or multiple sections, including lectures, classes, precepts, labs, and more. Many courses have no distribution area, no location, or even no sections at all. Here is how to handle such cases.

You can use curl to test your code from the command line:

Here are some additional courses demonstrating these complications:

$ curl localhost:33333/cos/498
COS 498     Senior Independent Work (B.S.E. candidates only) Robert S. Fish/Andrea S. LaPaugh
$ curl localhost:33333/ent/201
EGR 200/ENT 201  L01 Th 11:00-12:20 Creativity, Innovation, and Design Sheila V. Pontis/Rafe H. Steinhauer
EGR 201/ENT 200  L01 TTh 11:00-12:20 Foundations of Entrepreneurship Joy S. Marcus Architecture Building N101
EGR 200/ENT 201  L02 F 11:00-12:20 Creativity, Innovation, and Design Sheila V. Pontis/Rafe H. Steinhauer

Although the ordering of the sub-queries does not matter, the order of evaluation of each individual sub-query is important. Each sub-query must be evaluated in this order, and evaluation stops after the first match, if any:

  1. If a sub-query is one of the standard distribution codes (la, sa, ha, em, ec, qr, stl, stn), it selects courses that satisfy this distribution area.
  2. Else if a sub-query is a set of days corresponding to a common course offering schedule, such as mwf, it selects courses offered on exactly those days. To make this concrete, consider exactly these patterns and no others: m, t, w, th, f, mw, mwf, tth, twth, mtwth and mtwthf. This will miss a few courses but do not try to handle other cases. Note that a course that meets only on Mondays does not match mw or mwf, and in the other direction, a course that meets Mondays, Wednesdays, and Fridays does not match m or w or mw.
  3. Else if a sub-query consists of exactly 3 alphabetic characters (other than stl, stn, mwf or tth, which should already have been handled above), such as cos or eco, it selects courses from that department or program. You need not check for validity: if the 3-letter combination is not a valid department, it simply will not match any records.
  4. Else if a sub-query consists of exactly 3 digits, such as 333, it selects courses with that number:
    $ curl localhost:33333/333
    CHV 333/PHI 344 EM S01 W 01:30-04:20 Bioethics: Clinical and Population-Level Johann D. Frick/Daniel M. Putnam Marx Hall 301
    COS 333  L01 TTh 11:00-12:20 Advanced Programming Techniques Brian W. Kernighan/Jeremie Lumbroso Thomas Laboratory 003
    POL 334/SOC 333/LAO 334 SA L01 TTh 02:30-03:20 Immigration Politics and Policymaking in the U.S. Ali A. Valenzuela Robertson Hall 035
    WWS 333/SOC 326 SA L01 MW 11:00-11:50 Law, Institutions and Public Policy Paul E. Starr Louis A. Simpson International A71
    
  5. Else other sub-query components that are longer than 3 characters are interpreted as regular expressions that might match some field. Such a regular expression selects courses where the RE matches the time (in hh:mm-hh:mm format), or the title, or a professor, or the building, or the room. The RE does not match text across fields (e.g., 'level.*johann' in the example above) or across different entries within the Professors field (e.g., 'brian.*jeremie'). Nor does it match department, course number, area, section, or days. You must use the Python regular expression module re for this part; do not roll your own matching.

There are certainly categories of sub-queries that do not fit within any of these categories, for example, one-letter queries, two-letter queries that do not match a distribution area, three-letter queries that contain a mix of letters and digits. This is fine -- queries of these types simply should not match any courses, and thus should return no results.

All queries must be case-insensitive. For examples: "mUsIc" matches "music", "Music", and "MUSIC"; and "QR", "qr", "Qr", and "qR" must all be accepted as the quantitative reasoning area.

Your program must be called reg.py. It implements a simple web server using this template:

import SocketServer
import SimpleHTTPServer

class Reply(SimpleHTTPServer.SimpleHTTPRequestHandler):
  def do_GET(self):

    # The query arrives in self.path.  You should prepare your 
    # response here, preferably by calling suitable functions.
    # The following line merely echoes the query; replace it
    # with code that generates the desired response.

    self.wfile.write("query was %s\n" % self.path) # replace this line


def main():

  # You must read and parse courses.json here,
  # before starting the server.

  SocketServer.ForkingTCPServer(('', 8080), Reply).serve_forever()

main()

This starts a server listening on port 8080 and runs it forever. (Note that the call to SocketServer.ForkingTCPServer does not return, so it must be the last line of your main function.)

When the server receives a request, it is handled by the function do_GET in the Reply class. The version above just prints the value of the path instance variable, which is the query string. Your job is to replace that line with your code to print the search results. Do this one step at a time until you figure out what's going on. You will also have to handle the optional commandline argument that specifies the JSON file.

With the server running on your own computer, you can run tests with the curl command, or by typing that same URL into a browser on your computer:

$ curl localhost:33333/tilgh
FRS 146 STN S01 T 01:30-04:20 What Makes a Great Experiment? Shirley M. Tilghman Carl C. Icahn Laboratory 200

Some systems (e.g., nobel) won't let you open port 8080, but will let you open ports with high numbers, say 30000-60000. Accordingly, your reg.py must accept an optional command line argument for the port number. You can set the default to whatever you like, but we will use the optional argument so we can test without editing your code. The image above comes from port 33333. Do not use port 33333 in your own code.

The testing component of this assignment is similar to previous assignments: create at least 25 test queries in files named test00, test01, ... , test24, etc. Each file should contain the query on the first line, followed by the matching results, in the same format as the server response, on the following lines. For example:

lin/tth
LIN 205/TRA 205 EC L01 TTh 12:30-01:20 Beginning American Sign Language Noah A. Buchholz Robertson Hall 016
LIN 301 EC L01 TTh 11:00-12:20 Phonetics and Phonology Florian Lionnet Green Hall 1-C-4C
LIN 308/TRA 303 EC C01 TTh 03:00-04:20 Bilingualism Christiane D. Fellbaum Green Hall 0-S-9
LIN 312/TRA 312 EC L01 TTh 01:30-02:50 Linguistics of American Sign Language Eileen M. Forestal
LIN 355 SA L01 TTh 01:30-02:20 Field Methods in Linguistics Florian Lionnet Green Hall 1-S-5

These queries should thoroughly explore critical boundary conditions and other potential trouble spots of the specification. Again, you might find it helpful to read Chapter 6 of The Practice of Programming on testing. It is typically preferable to return a few interesting courses rather than a large pile of results. (You can include the tests above in your collection.) Please do not submit any queries that would result in more than 25 courses in the results.

Advice

The CS servers use Python 2 by default. Since there are significant incompatibilities between Python 2 and Python 3, you must use Python 2.

$ python
Python 2.7.5 (default, Aug  9 2017, 01:27:27) 
[GCC 4.8.5 20150623 (Red Hat 4.8.5-16)] on linux2

The courses.json file will be present in the directory where your reg.py executed so you may hardcode the filename in the program. You should not make any assumptions about the content (though the field names and format will not change) -- we could give you a courses.json with last year's classes or with only COS 333 and your program should still work.

You will have to write the code in reg.py that reads and parses courses.json (importing the module json will provide you with some useful functions to do this), accepts search requests, and sends responses. Start with the server template and add code to read and evaluate the JSON file. Then parse each user query, search the JSON for matching items, then format and return the selected ones. Keep it simple, this program does not need to be fast or the slightest bit clever. As is true with all programming, trying to be fast and/or clever is often a recipe for disaster. My version has about 155 lines, so if your solution is a great deal longer, you may be off on the wrong track or doing something the hard way.

The main complexity in this assignment is handling multiple sections. You will find that the input data contains all the sections for a course in a single record. Thus if a class has multiple sections, you will have to in some way make separate copies of its record, then go through the "classes" part to identify matching sections. It will be easiest to make one pass through the data to identify such sections. If a course has N sections, make N copies of the record, then select only the relevant data for each one. Note that Python objects are references, which means that a statement like

obj2 = obj1
does not make a copy of obj1. You are very likely to need a deep copy; look at the deepcopy member of the copy module.

Talking over the precise meaning of the specification with friends is strongly encouraged, as always, but in particular with this assignment due to its large number of potential corner cases. Use Piazza to garner official interpretations.

We will attempt to maintain reference servers on the Nobel servers at OIT. To use them, you must log in to Nobel:

$ ssh your_netid@nobel.princeton.edu
$ curl localhost:33333/your/query/strings
There's no guarantee that we can keep this running all the time, so please do your own testing; don't rely on this. And please don't use port 33333 on nobel yourself, since it will prevent us from running our servers.

The JSON file contains a number of accented or otherwise non-Latin alphabet characters, for example, FRE 317 is a course about Paris taught by Professor André Benhaïm. These characters are sometimes rendered in text as Unicode, sometimes as \u escapes, and sometimes as HTML escapes, and may not print cleanly without special effort, but you do not need to worry about doing anything special with them. We will not focus on such special characters in our testing.

Submission and Evaluation

When you are finished, submit your reg.py and tests.tar using the CS Dropbox link dropbox.cs.princeton.edu/COS333_S2018/asgn4. Create your tests tarball using the same command as from Assignment 3:

tar cf tests.tar test??

We will give you some indication that you have not drastically misinterpreted the specification, by running some tests of our own when you submit. These are not a complete test. Do your own testing; don't rely on our tests to validate your code.

We will test your code primarily by running the same queries through your version and our version and comparing the results. We will sort the output lines and ignore empty lines and whitespace differences, so don't get hung up on minutiae of line formatting aside from the minimal requirements mentioned above. As with prior assignments, we will test your tests for reasonable coverage of expected simple and corner cases.