In this assignment, you will implement a simple web proxy that passes requests and data between a web client and a web server. This will give you a chance to get to know one of the most popular application protocols on the Internet- the Hypertext Transfer Protocol (HTTP)v. 1.0- and give you an introduction to the Berkeley sockets API. When you're done with the assignment, you should be able to configure your web browser to use your personal proxy server as a web proxy.
The Hypertext Transfer Protocol or (HTTP) is the protocol used for communication on this web. That is, it is the protocol which defines how your web browser requests resources from a web server and how the server responds. For simplicity, in this assignment we will be dealing only with version 1.0 of the HTTP protocol, defined in detail in RFC 1945. You should read through this RFC and refer back to it when deciding on the behavior of your proxy.
HTTP communications happen in the form of transactions, a transaction consists of a client sending a request to a server and then reading the response. Request and response messages share a common basic format:
For most common HTTP transactions, the protocol boils down to a relatively simple series of steps (important sections of RFC 1945 are in parenthesis):
It's fairly easy to see this process in action without using a web browser. From a Unix prompt, type:
telnet www.yahoo.com 80
This opens a TCP connection to the server at www.yahoo.com listening on port 80- the default HTTP port. You should see something like this:
Trying 188.8.131.52... Connected to www.yahoo.com (184.108.40.206). Escape character is '^]'.
type the following:
GET / HTTP/1.0
and hit enter twice. You should see something like the following:
HTTP/1.1 200 OK Date: Fri, 10 Nov 2006 20:31:19 GMT Connection: close Content-Type: text/html; charset=utf-8 <html><head> <title>Yahoo!</title> (More HTML follows)
There may be some additional pieces of header information as well- setting cookies, instructions to the browser or proxy on caching behavior, etc. What you are seeing is exactly what your web browser sees when it goes to the Yahoo home page: the HTTP status line, the header fields, and finally the HTTP message body- consisting of the HTML that your browser interprets to create a web page. You may notice here that the server responds with HTTP 1.1 even though you requested 1.0. Some web servers refuse to serve HTTP 1.0 content.
Ordinarily, HTTP is a client-server protocol. The client (usually your web browser) communicates directly with the server (the web server software). However, in some circumstances it may be useful to introduce an intermediate entity called a proxy. Conceptually, the proxy sits between the client and the server. In the simplest case, instead of sending requests directly to the server the client sends all its requests to the proxy. The proxy then opens a connection to the server, and passes on the client's request. The proxy receives the reply from the server, and then sends that reply back to the client. Notice that the proxy is essentially acting like both a HTTP client (to the remote server) and a HTTP server (to the initial client).
Why use a proxy? There are a few possible reasons:
Your task is to build a basic web proxy capable of accepting HTTP requests, making requests from remote servers, caching results, and returning data to a client.
This assignment can be completed in either C or C++. It should
compile and run (using g++) without errors or warnings from the penguin
servers, producing a binary called
proxy that takes as
its first argument a port to listen from. Don't use a hard-coded port
number (e.g., port 80).
You shouldn't assume that your server will be running on a particular IP address, or that clients will be coming from a pre-determined IP.
When your proxy starts, the first thing that it will need to do is establish a socket connection that it can use to listen for incoming connections. Your proxy should listen on the port specified from the command line, and wait for incoming client connections.
Once a client has connected, the proxy should read data from the client and then check for a properly-formatted HTTP request. An invalid request from the client should be answered with an appropriate error code.
proxy sees a valid HTTP request, it will need to parse the requested
URL. The proxy needs at most three pieces of information: the
requested host and port, and the requested path. See the
(7) manual page for more info.
After determining which web object is being requested (as named by the object's full URL), you should check to see if this object is already cached on the server. If so, you should return the content from the cache. For simplicity, you do not need to implement proper HTTP expiry: You can simply clear your cache on bootup but cache objects indefinitely while the server is alive. You similarly do not need to support conditional GETS (e.g., "If-Modified-Since") to the remote origin server. If desired, however, you can support real cache expiry.
Once the proxy has parsed the URL, it can make a connection to the requested host (using the appropriate remote port, or the default of 80 if none is specified) and send a HTTP request for the appropriate file. The proxy then sends the HTTP request that it received from the client to the remote server.
After downloading a web object successfully, you should cache the object to disk so that subsequent fetches can use the local copy as opposed to fetching it again remotely. You should not cache the item if it is marked as "no-cache" or "private"; see the RFC. For this assignment, you only need to cache objects for requests that return type 200 (OK); you do not need to worry about other cacheable status codes such as 410 (GONE).
The proxy should send the response message to the client via the appropriate socket. Once the transaction is complete, the proxy should close the connection.
Run your client with the following command:
port is the port number that
the proxy should listen on. As a basic test of functionality, try
requesting a page using telnet:
telnet localhost <port> Trying 127.0.0.1... Connected to localhost.localdomain (127.0.0.1). Escape character is '^]'. GET http://www.google.com/ HTTP/1.0
If your proxy is working correctly, the headers and HTML of the Google homepage should be displayed on your terminal screen. Notice here that we request the full URL (
http://www.google.com/) instead of just the absolute path (
/). Your proxy should support both of these formats.
For a slightly more complex test, you can configure your web browser to use your proxy server as its web proxy. See the section beflow for details.
If you write a single-threaded proxy server, you will probably see some problems when you use your proxy with a standard web browser. Because a web browser like Firefox or IE issues multiple HTTP requests for each URL you request (for instance, to download images and other embedded content), a single-threaded proxy will likely miss some requests, resulting in missing images or other minor errors. That's OK. You are not required to use threading in this assignment. As long as your proxy works correctly for a simple HTML document (like, for instance, this assignment page) and follows the RFC, you can still receive all the points for this assignment.
To stop using the proxy server, select 'No Proxy' in the connection settings dialog.
Because Firefox defaults to using HTTP/1.1 and your proxy speaks HTTP/1.0, there are a couple of minor changes that need to be made to Firefox's configuration. Fortunately, Firefox is smart enough to know when it is connecting through a proxy, and has a few special configuration keys that can be used to tweak the browser's behavior.
keepaliveto false. Set
versionto 1.0. Make sure that
pipeliningis set to false.
Take a look at this page for complete instructions on enabling a proxy for various versions of Internet Explorer.
You should also do the following to make Internet Explorer work in a HTTP 1.0 compatible mode with your proxy:
In order to build your proxy you will need to learn and become comfortable programming sockets. The Berkeley sockets library is the standard method of creating network systems on Unix. There are a number of functions that you will need to use for this assignment:
You can find the details of these functions in the Unix
man pages (most of them are in section 2) and in the Stevens Unix Network Programming book, particularly chapters 3 and 4. Other sections you may want to browse include the client-server example system in chapter 5 (you will need to write both client and server code for this assignment) and the name and address conversion functions in chapter 9.
You should submit your completed proxy by the date posted on the course website. You will need to submit a tarball file containing the following:
Your tarball should be named
USERNAME is your username. The sample Makefile in the skeleton zip file we provide will make this tarball for you with the
make tar command.
Your proxy will be graded out of 8 points, with the following criteria:
makeon your assignment, it should compile without errors or warnings on the FC 010 cluster machines and produce a binary named
proxy. The first command line argument should be the port that the proxy will listen from.
As mentioned above, in this first assignment, you are not to implement a multi-process/threaded proxy. Because this is a single-threaded proxy, you may see errors when using your proxy with a standard web browser, but that's OK. As long as your proxy works correctly for single HTTP transactions (for instance, try telnetting to to the port the proxy is running from and requesting a single HTML document) you can still receive all the possible points for this assignment.
Writing code that will interact with other programs on the Internet is a little different than just writing something for your own use. The general guideline often given for network programs is: be lenient about what you accept, but strict about what you send. This is often referred to as Postel's Law. That is, even if a client doesn't do exactly the right thing, you should make a best effort to process their request if it is possible to easily figure out their intent. On the other hand, you should ensure that anything that you send out conforms to the published protocols as closely as possible. If an incoming request has a single field out of whack (such as sending you a request using HTTP 0.9 or 1.1), uses non-standard line terminators (some clients only send \r instead of the standard \r\n), or does something you don't quite expect with HTTP headers, you should still handle the request rather than dropping the request. Pay attention to parts of the RFC that specify areas where not all clients may conform exactly to what you expect. We'll be looking for this kind of interoperability in both the second round of tests that we run and in the style portion of your grade.
When in doubt, try to follow the behavior specified in RFC 1945. Also, check the FAQ for more specific guidelines.