COS 333 Assignment 3: Your Place or MIME

Due midnight, Friday, March 3

Mon Feb 20 08:22:30 EST 2017

Base64 encoding provides a way to encode arbitrary binary data into printable ASCII for mailing or shipment to a web browser, then decode it back into exactly its original form. The encoding takes the input 6 bits at a time and maps each 6-bit chunk into an 8-bit letter, digit or other character that is guaranteed to travel end-to-end safely through all systems. This process expands the input by only 33%, so it is acceptably efficient. Decoding reverses the process, taking 8-bit character input back into properly packed bytes that reproduce the original input exactly.

This assignment is an exercise in writing code to a published standard; in theory, and we hope in practice, your implementation will interoperate properly with anyone else's implementation. The file base64.html contains an excerpt (most of Section 6.8) from RFC 2045, the standard for MIME (Multipurpose Internet Mail Extensions) encoding.

Specification

Your task is to:

  1. write encoders in C and Python that encode any input stream into Base64 as specified by the standard.
  2. write decoders in C and Python that decode valid Base64 back into its original form.
  3. create test files that will exercise your code thoroughly.

Your programs must read from stdin and write to stdout. The encoder output must end with a newline unless the input is empty. The programs must not throw exceptions or print any other output. Your encoder must work on all inputs. Your decoder must work on valid inputs like those produced by openssl, and should ignore whitespace in encoded Base64. Since there is no clear definition of what a decoder should do with invalid input, we will only test your decoder on invalid input to make sure it doesn't crash; it doesn't matter what its output is.

The testing component of this assignment is to create at least 15 short test files named test01, test02, test03, etc., to be used as inputs. These tests should thoroughly explore critical boundary conditions and other potential trouble spots of the standard and your code. No submitted test file should be longer than about 1K bytes, though of course you should test your program on long files as well. You might find it helpful to read Chapter 6 of The Practice of Programming on testing.

Advice

The main complexities of implementation are mapping 3 bytes of input to 4 bytes of output, and vice versa, with some tricky boundary conditions. You will probably want to do the encoder and decoder first in C, presuming that you know it better. You should also complete your testing suite to help with your C implementation. Once all the kinks are worked out with the algorithm and its corner cases, it should be manageable to use the now-straightforward task as your first crack at Python. Here's a Python help file and the official documentation.

This is the version of Python installed on the CS servers. Note that there are significant incompatibilities between Python 2 and Python 3. Please use Python 2.

	$ python
	Python 2.7.5 (default, Nov  8 2016, 19:11:39) 
	[GCC 4.8.5 20150623 (Red Hat 4.8.5-11)] on linux2

Keep it simple: this program does not need to be fast, or save memory, or be the slightest bit clever. As is true with all programming, trying to be fast and/or clever is a recipe for disaster.

For calibration, the reference Python encoder and decoder are about 50-60 lines each, and the reference C versions are about 80 lines each. If your solutions are a lot bigger, you are probably off on the wrong track.

Keep clearly in mind that your programs must handle arbitrary bytes, not just ASCII characters. You have to use I/O routines that read and write a byte at a time. The Unix tool od ("octal dump") is useful for looking at binary data; try od -tx1 to see bytes in hex, or use xxd, where the -b flag shows you the output in bits.

You can use the program openssl as a reference implementation:

	openssl enc -e -base64 <input >base64output   # encode
	openssl enc -d -base64 <base64input >output   # decode

The standard says "The encoded output stream must be represented in lines of no more than 76 characters each." The openssl encoder outputs a newline more frequently, after every 64 characters. You will find it easier to do direct comparisons if your encoder does the same. Also be aware that in some cases, openssl does not put a newline after the final line of output. This seems like a bad implementation; your encoder must put a newline at the end unless the input is empty.

Prepare good test cases ahead of time: this standard is quite clear and compact, but still requires careful attention. Test early, test often, know what you are testing. Automate the testing as much as you can, with a shell script or a makefile. Part of the job here is to mechanize testing but you will also find it helpful to validate your code on real MIME messages like the ones that encode attachments in mail. Grab those with an editor, put them in a file, decode, and see if the result is acceptable to the target application (Word, Excel, MP3 player, etc.).

There are lots of Base64 libraries and implementations floating around. Do not use or consult them. The purpose of this assignment is for you to write your own code to implement a common standard, using only that standard as a reference. Referencing any other material or implementations related to that standard is a grievous breach of academic integrity.

Submission and Evaluation

When you are finished, submit your four source files encode.c, decode.c, encode.py, decode.py, and a tarball called tests.tar that contains all of your test files. Use the CS Dropbox link dropbox.cs.princeton.edu/COS333_S2017/asgn3. You can put your properly-named test files in a tar file with this command:

	tar cf tests.tar test??

We will test your code by a process somewhat like this. Note that it is not sufficient that your encoders and decoders work together; each must work independently.

	tar xf tests.tar
	gcc -o encode encode.c
	gcc -o decode decode.c
	for i in $(ls test?? ourtests*)
	do
	    python encode.py <$i >x0		# your Python encoder
	    decode <x0 >y0   			# your C decoder
	    cmp $i y0				# compare to original
	    encode <$i >x1   			# your C encoder
	    openssl enc -d -base64 <x1 >y1	# reference decoder
	    cmp $i y1				# compare to original
	    openssl enc -e -base64 <$i >x2	# reference encoder
	    python decode.py <x2 >y2		# your Python decoder
	    cmp $i y2				# compare to original
	done