COS-461 Assignments: HTTP Proxy


Assignment 3: HTTP Proxy (Caching & Multi-threaded)

The Basics

As in the last assignment, your task is to build a basic web proxy capable of accepting HTTP requests, making requests from remote servers, and returning data to a client. Unlike before, you should be able to accept multiple client requests concurrently. You proxy should achieve concurrency by using the pthread library to spawn a new thread for each new client request. There should be a reasonable cap on the no. of threads your proxy can create e.g., 100. Also, your proxy should be able to cache webpages on disk and all subsequent requests for the same page should be served from the cache instead. When storing webpages on disk, you should ensure that a) the file name does not contain "/", b) the file name is less than 255 characters long, and c) the cache files are created within the working directory of your proxy (e.g., in some subfolder). You are allowed to use external libraries (e.g., openssl for hasing) in this assignment, but you should ensure that you only link libraries available on the Friend 010 machines and submit your makefiles with your submission.

If you want, you can implement other optimizations, such as handle persistent connections from a client (see HTTP's Keep-Alive instructions), or by creating a thread pool for faster processing. A thread pool starts up by creating some fixed number of threads on bootup (say, 20). Then, when receiving a new request, it hands-off the request to one of the existing processes/threads, removing it from the pool. (If none are available, showing a higher degree of concurrency, then it can create a new one.) Upon completing executing a request, the thread is returned to the pool for future requests. Apache and most servers that adopt a multi-threaded style use such pools for lower latency and system load. But again, these optimizations are optional.

This assignment can be completed in either C or C++. It should compile and run (using g++) without errors or warnings on the Friend 010 machines, producing a binary called proxy that takes as an argument a port to listen from. Don't use a hard-coded port number (e.g., port 80). As before, you shouldn't assume that your server will be running on a particular IP address, or that clients will be coming from a pre-determined IP.

Checking the Cache

After determining which web object is being requested (as named by the object's full URL), you should check to see if this object is already cached on the server. If so, you should return the content from the cache. For simplicity, you do not need to implement proper HTTP expiry: You can simply clear your cache on bootup but cache objects indefinitely while the server is alive. You similarly do not need to support conditional GETS (e.g., "If-Modified-Since") to the remote origin server. If desired, however, you can support real cache expiry.

Writing the Cache

After downloading a web object successfully, you should cache the object to disk so that subsequent fetches can use the local copy as opposed to fetching it again remotely. You should not cache the item if it is marked as "no-cache" or "private"; see the RFC. For this assignment, you only need to cache objects for requests that return type 200 (OK); you do not need to worry about other cacheable status codes such as 410 (GONE). Reading from cache need not be thread-safe, but writing to cache should be. If multiple threads simultaneously detect a cache-miss and fetch the same content from the Internet, then it's OK if only one thread writes to cache and others serve the content from the Internet.

Testing Your Proxy

Run your client with the following command:

./proxy -t <port>, where port is the port number that the proxy should listen on. The argument -t specifies that the proxy should run in multi-threaded mode. As a basic test of functionality, try requesting a page using telnet concurrently from two different shells.

Instructions for setting up your browser to access your proxy can be found in the instructions of the previous assignment.

Multi-Thread Programming

In addition to the Berkeley sockets library, there are some functions you will need to use for this assignment

You can find the details of these functions in the Unix man pages:

Links:


Tutorial on caching & threads (03/22/2010)
threads_caching.pdf

Grading

You should submit your completed proxy by the date posted on the course website to Blackboard. Remember to submit after uploading. You will need to submit a tarball file containing the following:

You can complete the assignment in either C or C++. Your tarball should be named cos461_ass3_USERNAME.tgz where USERNAME is your username. The sample Makefile in the skeleton zip file we provide will make this tarball for you with the make tar command.

Your proxy will be graded out of ten points, with the following criteria:

  1. When running make on your assignment, it should compile without errors or warnings on the FC 010 cluster machines and produce a binary named proxy.
  2. Your proxy should run silently- any status messages or diagnostic output should be off by default.
  3. Your proxy should work with both Firefox and Internet Explorer.
  4. We'll first check that your proxy works correctly with a small number of major web pages, using the same script that we've given you to test your proxy. If your proxy passes all of these 'public' tests, you will get 6 of the possible points.
  5. We'll then check a number of additional URLs testing caching. If your proxy passes all of these tests, you will get 1 additional point. These tests will check if you properly handle "no-cache" and "private" cache-control in server response. Also, we'll check large and/or dynamic URLs.
  6. We'll then check the implementation of multi-threading e.g., a reasonable cap on maximum no. of threads, thread-safe cache writes, and avoiding deadlock conditions. This will earn you another 1 point.
  7. Well written (good abstraction, error checking, readability) and well commented code combined with a good design of your proxy (resulting in acceptable raw performance) will get 2 additional points, for a total of 10.
Extra Credit
  1. The first student to submit a proxy that scores a perfect 10 will get extra credit of 1 point.
  2. Also, the student whose proxy gives the best raw performance will get 1 point extra credit.

A Note on Network Programming

Writing code that will interact with other programs on the Internet is a little different than just writing something for your own use. The general guideline often given for network programs is: be lenient about what you accept, but strict about what you send. This is often referred to as Postel's Law. That is, even if a client doesn't do exactly the right thing, you should make a best effort to process their request if it is possible to easily figure out their intent. On the other hand, you should ensure that anything that you send out conforms to the published protocols as closely as possible. If an incoming request has a single field out of whack (such as sending you a request using HTTP 0.9 or 1.1), uses non-standard line terminators (some clients only send \r instead of the standard \r\n), or does something you don't quite expect with HTTP headers, you should still handle the request rather than dropping the request. Pay attention to parts of the RFC that specify areas where not all clients may conform exactly to what you expect. We'll be looking for this kind of interoperability in both the second round of tests that we run and in the style portion of your grade.

When in doubt, try to follow the behavior specified in RFC 1945. Also, check the FAQ for more specific guidelines.


Last updated: Mon Apr 26 13:04:41 -0400 2010