COS 333 Assignment 3: All Your Base64 Are Belong To Us

Due Thursday February 28 at 10:00 pm

Sun Feb 17 13:55:02 EST 2019

Base64 encoding provides a way to encode arbitrary binary data into printable ASCII for mailing or shipment to a web browser, then decode it back into exactly its original form. A Base64 encoder reads an input stream of arbitrary 8-bit bytes. It processes 6 bits at a time, mapping each 6-bit chunk into an 8-bit printable letter, digit or other character that is guaranteed to travel end-to-end safely through all systems. This process expands the input by only 33%, so it is acceptably efficient. A decoder reverses the process, converting a stream of 8-bit ASCII characters back into properly packed bytes that reproduce the original input exactly.

This assignment is an exercise in writing code to a published standard; in theory, and we hope in practice, your implementation will interoperate properly with anyone else's implementation. It's also an exercise in bit-whacking: manipulating bits as raw bits, not numbers or characters. That kind of code is often necessary for writing device drivers, processing network packets, and other tasks that require packing bits into words or extracting them.

Keep clearly in mind that your programs must handle arbitrary bytes, not just printable ASCII characters.

Specification

The file base64.html contains an excerpt (most of Section 6.8) from RFC 2045, the standard for MIME (Multipurpose Internet Mail Extensions) encoding. Begin by taking a quick look at the standard to see how such things are organized and written, then carefully read the excerpt that describes Base64 excoding.

Your task is to:

  1. write encoders in Java and Python that encode any stream of bytes into Base64 as specified by the standard.
  2. write decoders in Java and Python that decode a valid Base64-encoded stream back into its original form.
  3. create test files that will exercise your code thoroughly.
  4. write a testing script that will run all your tests with a single command.

Your programs must read from stdin and write to stdout. The encoder output must end with a newline unless the input is empty. The programs must not throw exceptions or print any other output. Your encoder must work on all inputs. Your decoder must work on valid inputs like those produced by openssl, and should ignore whitespace in encoded Base64. Since there is no clear definition of what a decoder should do with invalid input, we will only test your decoder on invalid input to make sure it doesn't crash; it doesn't matter what its output is.

For testing, create at least 15 short test files named test01, test02, test03, etc., to be used as inputs. These tests should thoroughly explore critical boundary conditions and other potential trouble spots of the standard and your code. No submitted test file should be longer than about 1K bytes, though of course you should test your program on long files as well. You might find it helpful to read about testing in Chapter 6 of The Practice of Programming.

The testing script must be called tester.sh; running it as

tester.sh [list of test files]
runs all the tests. See below for a sample implementation.

Keep clearly in mind that your programs must handle arbitrary bytes, not just printable ASCII characters.

Advice

The main complexities of implementation are mapping 3 bytes of input to 4 bytes of output, and vice versa, with some tricky boundary conditions. You will probably want to do the encoder and decoder first in Java, presuming that you know it better. You should also complete your testing suite to help with your Java implementation.

Once all the kinks are worked out with the algorithm and its corner cases, it should be straightforward to transliterate your Java into Python. Here's a Python help file and the official Python 2 documentation.

This is the version of Python installed on the CS servers and the Tigerfile computer. There are significant incompatibilities between Python 2 and Python 3. You must use Python 2.

$ python -V
Python 2.7.5
$ python
Python 2.7.5 (default, Jul  3 2018, 19:30:05) 
[GCC 4.8.5 20150623 (Red Hat 4.8.5-28)] on linux2

Keep it simple: this program does not need to be fast, or save memory, or be the slightest bit clever. As is true with all programming, trying to be fast and/or clever is a recipe for disaster. For calibration, my Python encoder and decoder are about 50-60 lines each, and my Java versions are about 80 lines each. If your solutions are a lot bigger, you are probably off on the wrong track.

Keep clearly in mind that your programs must handle arbitrary bytes, not just printable ASCII characters.

For both languages, find library functions that will let you read and write raw bytes, perhaps one at a time, analogous to getchar and putchar in C. The more levels of functions that are trying to interpret the input, the harder it is to see what the input really was. Use only standard Java libraries; do not use local Princeton libraries from earlier courses. For Python, you should not need to import anything beyond sys.

The Unix tool od ("octal dump") is useful for looking at binary data; try od -tx1 to see bytes in hex, or use xxd, where the -b flag shows you the output in bits.

You can use the program openssl as a reference implementation:

openssl enc -e -base64 <input >base64output   # encode
openssl enc -d -base64 <base64input >output   # decode

The standard says "The encoded output stream must be represented in lines of no more than 76 characters each." The openssl encoder outputs a newline more frequently, after every 64 characters. You will find it easier to do direct comparisons if your encoder also outputs a newline after every 64 characters. Also be aware that in some cases, openssl does not put a newline after the final line of output. This seems like a bad implementation; your encoder must put a newline at the end of its output unless the input is empty.

Prepare good test cases ahead of time: this standard is quite clear and compact, but still requires careful attention. Test early, test often, know what you are testing. Automate the testing as much as you can, with a shell script or a makefile. Part of the job here is to mechanize testing but you will also find it helpful to validate your code on real MIME messages like the ones that encode attachments in mail. Grab those with an editor, put them in a file, decode, and see if the result is acceptable to the target application (Word, Excel, MP3 player, etc.).

There are lots of Base64 libraries and implementations floating around. Do not use or consult them. The purpose of this assignment is for you to write your own code to implement a common standard, using only that standard as a reference. Referencing any other material or implementations related to that standard is a grievous breach of academic integrity.

Submission and Evaluation

When you are finished, submit your four source files encode.java, decode.java, encode.py, decode.py, your testing program tester.sh, and a tarball called tests.tar that contains all of your test files. Use the CS Dropbox link tigerfile.cs.princeton.edu/COS333_S2019/asgn3. You can put your properly-named test files in a tar file with this command:

tar cf tests.tar test??

We will test your code by a process somewhat like this:

tar xf tests.tar
javac encode.java
javac decode.java
for i in $(ls test?? ourtests*)
do
    python encode.py <$i >x0        # your Python encoder
    java decode <x0 >y0             # your Java decoder
    cmp $i y0                       # compare to original
    java encode <$i >x1             # your Java encoder
    openssl enc -d -base64 <x1 >y1  # reference decoder
    cmp $i y1                       # compare to original
    openssl enc -e -base64 <$i >x2  # reference encoder
    python decode.py <x2 >y2        # your Python decoder
    cmp $i y2                       # compare to original
    # etc. for other combinations
done

You can use this as a model for your own tester.sh if you wish.

Note that it is not sufficient that your encoders and decoders work together; each must work independently.