Due midnight, Friday, March 1
Tue Feb 19 14:45:23 EST 2013
Base64 encoding provides a way to encode arbitrary binary data into printable ASCII for mailing or shipment to a web browser, then decode it back into exactly its original form. The encoding takes the input 6 bits at a time and maps each 6-bit chunk into a letter, digit or other character that is guaranteed to travel safely through all systems. This process expands the input by only 33%, so it is acceptably efficient. Decoding reverses the process, taking 8-bit character input four bytes at a time back into properly packed bytes that reproduce the original input exactly.
This assignment is an exercise in writing code to a published standard. The file base64.html contains an excerpt (most of Section 6.8) from RFC 2045, the standard for MIME (Multipurpose Internet Mail Extensions) encoding. Your task is to
Your programs must read from stdin and write to stdout. The encoder output must end with a newline. The programs must not throw exceptions or print any other output. In particular, note the part of the standard that says "Any characters outside of the base64 alphabet are to be ignored in base64-encoded data." This refers to the decoder. Your goal is an encoder that works on all inputs and a decoder that works on valid inputs like those produced by openssl. Your decoder should accept very long lines, ignore invalid characters, and stop when it encounters one or two = signs or the end of the input. Since there is arguably no true meaning to invalid decoder input, just make sure you decode valid input correctly.
There are lots of Base64 libraries around, including a Python module. Do not consult or use them; the purpose of this assignment is for you to write your own code. It is a violation of academic integrity rules to do otherwise.
Create at least 15 short test files named test01, test02, test03, etc., to be used as inputs. These tests should thoroughly explore critical boundary conditions and other potential trouble spots of the standard and your code. No submitted test file should be longer than about 1K bytes, though of course you should test your program on long files as well. You might find it helpful to read Chapter 6 of The Practice of Programming on testing.
How you generate the tests is up to you, but bear in mind that they must include arbitrary binary data, not just ASCII text. Binary data is easier to generate with a program than by hand, and of course there's a lot of binary data around. The program od ("octal dump") is useful for looking at binary data; try od -tx1 to see bytes in hex, or use xxd, where the -b flag shows you the output in bits, or use hexdump.
My Python encoder and decoder are about 50-60 lines each, so if your solutions are a lot bigger, you may be off on the wrong track. Here's a Python help file and the official Python documentation.
The main complexities of implementation are mapping 3 bytes of input to 4 bytes of output, and vice versa, with some tricky boundary conditions. Keep it simple: this program does not need to be fast or the slightest bit clever. As is true with all programming, trying to be fast and/or clever is often a recipe for disaster.
Talking over the precise meaning of the specification with friends is strongly encouraged; this standard is quite clear and compact, but still requires careful attention. You must write your code yourself, however, so once you start coding, you may no longer discuss programming issues with other students. Use the Piazza site for official opinions.
Prepare good test cases ahead of time. Test early, test often, know what you are testing. Automate the testing as much as you can, with a shell script or a makefile. Part of the job here is to mechanize testing but you will also find it helpful to validate your code on real MIME messages like the ones that encode attachments in mail. Grab those with an editor, put them in a file, decode, and see if the result is acceptable to the target application (Word, Excel, MP3 player, etc.).
You can use the program openssl as a reference implementation:
openssl enc -e -base64 <input >base64output # encode openssl enc -d -base64 <base64input >output # decodeThe openssl encoder outputs a newline after every 64 characters. This is not a requirement, but you might find it easier to do direct comparisons if your encoder does the same. Also be aware that in some cases, openssl does not put a newline after the final line of output. This seems like bad implementation; your encoder must put a newline at the end.
Finally, note that this assignment is also an exercise in bit-whacking. In general, avoid constructs that suggest that you do not understand how to manipulate bits as bits. Don't use decimal numbers for bit patterns or ASCII characters. Thus 0x3F is a bit pattern for selecting the bottom 6 bits, and it's not magic (by my definition); 63 is the decimal equivalent and it has to be mentally translated before you can see the bit patterns, so it's magic. Similarly, 97 is magic while 'A' is not. Shift values of 6, 12 and so on are also not magic in this program, since their meaning is clear in context. Don't get up tight about this, especially since we won't look at your code, but you really should know the proper idioms for manipulating bits.
tar cf tests.tar test??and submit your encode.py, decode.py, and tests.tar with this Dropbox link.
We will test your code by a process somewhat like this:
tar xf tests.tar for i in your test cases and ours do python encode.py <$i >x0 # your encoder python decode.py <x0 >y0 # your decoder cmp $i y0 # compare to original python encode.py <$i >x1 # your encoder openssl enc -d -base64 <x1 >y1 # our decoder cmp $i y1 # compare to original openssl enc -e -base64 <$i >x2 # our encoder python decode.py <x2 >y2 # your decoder cmp $i y2 # compare to original doneso you might try a similar script for your own testing. Note that it is not sufficient that your encoder and decoder work together; each must work independently.