Proxy FAQ

1 Address in use Errors
2 Broken Images
3 HTTP/1.1 Requests
4 Bad Requests
5 Internal Error (500)?
6 Request/Response Line length?
7 Checking Server Replies
8 External Libraries
9 Test Script Errors
- 9.1 An aside on the details of load balancing

Address in use Errors

Q) Why do I keep getting the error "Address already in use" when I try to run my proxy server?: A) This error typically means that there is already a socket bound to the same port that you are attempting to bind to. If you're testing your proxy on one of the cluster machines, this may just mean that another student is running a proxy on that port already. Otherwise, it could mean that a proxy shutdown without properly closing the port that it was running on; if you bind to, say, port 8000 and then kill your proxy without closing the port, it may take a few minutes for the operating system to notice that the port is no longer in use and make it available again. Try running your proxy again, binding to a different port (this is one reason why running with hard-coded port numbers is a bad idea).

If your proxy is looping forever to serve requests and being stopped with Ctrl-C, you might want to try registering a handler for the SIGINT signal to clean up after your proxy- it can greatly reduce instances of your favorite port number becoming 'stuck'. Take a look at the signal (2) man page for details. Another option is to set the SO_REUSEADDR option to allow the proxy to re-use an address that is already in use; look at the setsockopt man page for more information

Broken Images

Q) My proxy seems to work for web pages, but all of the images on the page are coming up broken. What's going on?: A) Remember that your proxy has to handle both text data- like web pages and HTTP requests- and binary data- like image files and downloads. You can easily run into problems if you try to use functions like fputs to send data to the client- it will work correctly for text data, but will likely fail to work correctly for binary data. Use the read/write fread/fwrite functions instead. Another possible cause for this problem could be that your proxy is running in a single-threaded mode, and it is missing subsequent HTTP requests while it is serving the request for the main page.

HTTP/1.1 Requests

Q) How should I handle requests from a client that uses HTTP/1.1 instead of HTTP/1.0?: A) Unless they are trying to use a method that isn't supported by HTTP/1.0, you should go ahead and process their request to the best of your ability. You're not required to support keeping the connection alive for HTTP/1.1 clients- they are required by the RFC to be able to handle their connection being closed by either end at any time. So parse their request, send out a HTTP/1.0 request to the remote server, and return their data using HTTP/1.0 as normal. You must return error 501 (Not Implemented) for valid HTTP/1.1 methods that are not covered by RFC 1945.

Bad Requests

Q) What error should I return if I get a request from the client that I can't parse?: A) Again, you should send status code 400 for any malformed request that you are unable to process.

Internal Error (500)?

Q) When should I send status code 500 (Internal Error)?: A) You should only send code 500 for problems within the proxy that prevent you from replying to the client- for instance, if you are unable to allocate additional memory or open a file that you need in order to continue. For any error in the client's request, you should be returning 400; for problems from the remote server (such as 404's), you should return that error.

Request/Response Line length?

Q) What is the maximum length for a request or response line that I have to support?: A) The RFC does not specify a maximum value for these elements; you should not simply set a fixed value and then fail for larger requests. At the same time, you don't want to grow the buffer to an unbounded size; doing so sets up your proxy for an easy denial-of-service attack. Instead, you should start with your buffer at a small size, and then grow the buffer to some reasonable limit- say 8-16 KB. Document your decision in your assignment's README file. Take a look at the realloc man page for information on dynamically growing a buffer.

Checking Headers

Q) What sort of checking of HTTP headers in the request message needs to be performed?: A) You need to check that the headers are correctly formatted (e.g.- that the delimiter is correct), but you do not need to do any checking of the header field values. The one exception is the Host header when the request line contains a relative URL that does not include the hostname. In this case, you should parse the Host header to extract the hostname value. Otherwise, say you are given a Date header by the client, you should check that the header line is properly delimited and terminated, but you do not need to check to make sure that the value of the field is a legal date.

Checking Server Replies

Q) What sort of checking should I do for the reply from the web server?: A) You are not required to do any checking of the messages returned from the server. While you're free to implement whatever checking you feel is necessary, you should be very lenient about handing data back from the server. In particular, while you're free to do any fix up that you think would be a good idea for data returned from the server (such as eliminating extra whitespace in the response line, correcting partial line terminators, etc.), you shouldn't discard data intended for the client just because you think that the server is not replying in the correct way. It is preferable to let the client decide how to handle odd responses rather than denying it data from misbehaving servers. The client application can then decide when to make due with the response as-is, and when it is necessary to re-issue the request or discard the response data.

External Libraries

Q) Can we use any external libraries, or other third-party code?: A) For this assignment, the only additional library that you can use (outside of ANSI C) is the ustr string library (details here). However, we'd prefer if you just use the built-in C++ string libraries instead. You can't use any embedded scripting languages, free socket or HTTP libraries, etc., in your solution. Using ustr is optional and discouraged.

Test Script Errors

Q) On some sites (such as Google), the test script says that I failed the test and displays the difference between the proxy data and the direct server data as consisting of a single line containing some sort of header information. What's wrong?: A) In all likelihood, nothing. If the results differ by a single line that contains something like a Date or Etag header, then the difference is almost certainly due to the proxy request and the direct transfer request being directed to two different servers in a load-balanced cluster. In other words, your proxy is working correctly, but the data it retrieved is being compared to data from another machine that is also tasked with handling requests for that URL. The result is that there may be minor differences in the data- the specific machine that handles the request may insert its IP or hostname into the document header, there may be clock drift between the two servers, etc. Look at the returned data; if there is only a single line or two difference, and they consist of machine-dependent information (like cashing instructions or date/time fields), then you are likely passing the test, and your TAs will treat it as such.

An aside on the details of load balancing

Almost all large commercial websites implement some sort of load-balancing or distribution scheme in order to cope with large volumes of traffic and ensure availability in the face of hardware failure. When you send a request to a website like Google, rather than being directed to a single specific server your request is sent to one of several (possibly several thousand!) machines selected by some combination of software and hardware that is looking at inbound requests. One strategy is to keep track of the load on the cluster of servers, and then direct the next incoming request to the machine with the lowest load. Simpler schemes (like round robin balancing) just direct traffic to each server in the cluster in sequence without paying attention to existing load information.

The result is that two requests to the same URL, sent very close together, may be ultimately serviced by two different machines with different hardware and software configurations! In principle, this sort of load distribution is meant to be entirely transparent to the user; all the servers should return the exact same results and the same data. In practice, this is often not exactly the case. One cause of differences can be errors in the cluster setup. If a particular machine if misconfigured, or not properly updated, it may reply to requests with out-of-date data. A machine with a clock that drifts out of sync quickly may return timestamps that are very different from its siblings in the cluster.

Other differences are intentional; in order to debug problems with individual machines in the cluster, the machines may be configured to put an identifying marker in the headers or data of messages that they return, to ease debugging. Finally, there can sometimes be transient errors that occur as the cluster is updated; in order to improve redundancy, clustered machines may cache local copies of data rather than reading it from a single shared storage medium. Updates to the cached copy can either be pushed out to the cluster machines, or pulled off of a central server according to some schedule. It is possible to hit two different machines in a cluster while they are in the process of updating, in which case one machine will return the 'new' data, and another machine will still be serving the older, cached copy. Such windows are hopefully quite small, but can occasionally become visible to users who are engaging in unusual activities (such as making repeated, rapid web queries to a single URL in order to test the web proxy that they wrote!).

Contents