Christopher Moretti
Department of Computer Science
Princeton University

Assignment 3: Your Place or MIME?

Due: 23:59 Friday 2/26/2016

Preface

Base64 encoding provides a way to encode arbitrary binary data into printable ASCII for mailing or shipment to a web browser, then decode it back into exactly its original form. The encoding takes the input 6 bits at a time and maps each 6-bit chunk into a letter, digit or other simple character that is guaranteed to travel end-to-end safely through all systems. This process expands the input by only 33%, so it is acceptably efficient. Decoding reverses the process, taking 8-bit character input four bytes at a time back into properly packed bytes that reproduce the original input exactly.

 

This assignment is an exercise in writing code to a published standard; in theory, and we hope in practice, your implementation will interoperate properly with anyone else's implementation. The file base64.html contains an excerpt (most of Section 6.8) from RFC 2045, the standard for MIME (Multipurpose Internet Mail Extensions) encoding.

 

Specification

Your task is to:

  1. write encoders in Java and Python that encode any input stream into Base64 as specified by the standard.
  2. write decoders in Java and Python that decode valid Base64 back into its original form.
  3. create test files that will exercise your code thoroughly.

Your programs must read from stdin and write to stdout (NB: that's the standard input stream and standard output stream, not the specific classes provided with the 126/226 infrastructure). The encoder output must end with a newline unless the input is empty. The programs must not throw exceptions or print any other output. Your encoder must work on all inputs. Your decoder must work on valid inputs like those produced by openssl, and should ignore whitespace in encoded Base64. Since there is no clear definition of what a decoder should do with invalid input, we will only test your decoder on invalid input to make sure it doesn't crash; it doesn't matter what its output is.

 

The testing component of this assignment is to create at least 15 short test files named test01, test02, test03, etc., to be used as inputs. These tests should thoroughly explore critical boundary conditions and other potential trouble spots of the standard and your code. No submitted test file should be longer than about 1K bytes, though of course you should test your program on long files as well. You might find it helpful to read Chapter 6 of The Practice of Programming on testing.

 

Advice

The main complexities of implementation are mapping 3 bytes of input to 4 bytes of output, and vice versa, with some tricky boundary conditions. You will probably want to do the encoder and decoder first in Java, presuming that you know it better. You should also complete your testing suite to help with your Java implementation. Once all the kinks are worked out with the algorithm and its corner cases, it should be manageable to use the now-straighforward task as your first crack at Python. Here's a Python help file and the official documentation.

 

This is the version of python installed on the CS servers. Note that there are significant incompatibilities between Python 2.* and Python 3. We would strongly prefer that you use Python 2.*

tux:~$ python
Python 2.7.5 (default, Jun 24 2015, 01:06:47) 
[GCC 4.8.3 20140911 (Red Hat 4.8.3-9)] on linux2

 

Keep it simple: this program does not need to be fast, or save memory, or be the slightest bit clever. As is true with all programming, trying to be fast and/or clever is a recipe for disaster. (This text is repeated from Assignment 1's Advice section, in the hopes that it will eventually sink in!)

For calibration, the reference Python encoder and decoder are about 50-60 lines each, and the reference Java versions are about 80 lines each. If your solutions are a lot bigger, you are probably off on the wrong track.

 

Keep clearly in mind that your programs must handle arbitrary bytes, not just ASCII characters. You have to use I/O routines that read and write a byte at a time. The *nix tool od ("octal dump") is useful for looking at binary data; try od -tx1 to see bytes in hex, or use xd, where the -b flag shows you the output in bits, or use hexdump.

You can use the program openssl as a reference implementation:

openssl enc -e -base64 <input >base64output   # encode
openssl enc -d -base64 <base64input >output   # decode

The openssl encoder outputs a newline after every 64 characters. This is not a requirement, but you might find it easier to do direct comparisons if your encoder does the same. Also be aware that in some cases, openssl does not put a newline after the final line of output. This seems like bad implementation; your encoder must put a newline at the end unless the input is empty.

 

Prepare good test cases ahead of time: this standard is quite clear and compact, but still requires careful attention. Test early, test often, know what you are testing. Automate the testing as much as you can, with a shell script or a makefile. Part of the job here is to mechanize testing but you will also find it helpful to validate your code on real MIME messages like the ones that encode attachments in mail. Grab those with an editor, put them in a file, decode, and see if the result is acceptable to the target application (Word, Excel, MP3 player, etc.).

 

There are lots of Base64 libraries and implementations floating around. Do not use or consult them. The purpose of this assignment is for you to write your own code to implement a common standard, using only that standard as a reference. Referencing any other material or implementations related to that standard would be a grievous breach of academic integrity.

 

Submission and Evaluation

When you are finished, submit your four source files and a tarball containing all of your test files tests using the CS Dropbox link dropbox.cs.princeton.edu/COS333_S2016/Three. You can put your properly-named test files in a tar file with this command:

tar cf tests.tar test??

We will test your code by a process somewhat like what is shown below. Note that it is not sufficient that your encoders and decoders work together; each must work independently.

	tar xf tests.tar
	for i in `ls test?? ourtests*`
	do
	    python encode.py <$i >x0		# your Python encoder
	    java decode <x0 >y0   		# your Java decoder
	    cmp $i y0				# compare to original
	    java encode <$i >x1   		# your Java encoder
	    openssl enc -d -base64 <x1 >y1	# reference decoder
	    cmp $i y1				# compare to original
	    openssl enc -e -base64 <$i >x2	# reference encoder
	    python decode.py <x2 >y2		# your Python decoder
	    cmp $i y2				# compare to original
	done