COS 333 Assignment 4: Base64 Encoding and Decoding (Spring 2008)

Due midnight, Friday, March 7

Mon Feb 25 10:25:42 EST 2008

Base64 encoding provides a way to encode arbitrary binary data into printable ASCII for mailing or shipment to a web browser, then decode it again into exactly its original form. The encoding takes the input 6 bits at a time and maps that into a letter, digit or other character that is guaranteed to travel safely through all systems. This process expands the input by only 33%, so it is acceptably efficient. Decoding reverses the process, taking 8-bit character input back into properly packed 6-bit fields that reproduce the original input exactly.

This assignment is an exercise in writing code to a published standard. The file base64.html contains an excerpt (Section 6.8) from RFC 2045, the standard for MIME (Multipurpose Internet Mail Extensions) encoding. Your task in this assignment is to

Your programs must read from stdin and write to stdout. The programs must not throw exceptions or print any other output. In particular, note the part of the standard that says "Any characters outside of the base64 alphabet are to be ignored in base64-encoded data."

Ignore the last paragraph of base64.html that talks about "canonical form"; simply encode your input, and decode it back into its original form. In the interests of uniformity, it would be good if your encoder produced lines of exactly 76 characters (not counting the newline at the end), but this is not required.

Create at least 10 short test files named test.0, test.1, test.2, etc. These tests should thoroughly explore critical boundary conditions and other potential trouble spots of the standard and in your code. You might find it helpful to read Chapter 6 of The Practice of Programming on testing.

How you generate the tests is up to you, but bear in mind that they must include arbitrary bit patterns, not just ASCII text. Binary data is easier to generate with a C or Java program than by hand.

You must use Java for both programs. My Java encoder and decoder are about 80 lines each, with the encoding and decoding part about 40 lines in each, so if your solutions are a lot bigger, you may be off on the wrong track. Here's a Java help file and the official Java documentation.

The main complexities of implementation are mapping 3 bytes of input to 4 bytes of output, and vice versa, with some tricky boundary conditions. It's probably better to define suitable functions rather than writing everything in a main routine. Keep it simple: this program does not need to be fast or the slightest bit clever. As is true with all programming, trying to be fast and/or clever is often a recipe for disaster.

Talking over the precise meaning of the specification with friends is strongly encouraged; this standard is quite clear and compact compared to most, but still requires careful attention. You must write your code yourself, however, so once you start coding, you may no longer discuss programming issues with other students.

You may exchange compiled Java class files with friends if you like, to ensure that your encoder works with someone else's decoder and vice versa. But only compiled classes; no source, and no test cases.

Prepare good test cases ahead of time. Test early, test often, know what you are testing. Automate the testing as much as you can, with a shell script or a makefile. Part of the job here is to mechanize testing but you will also find it helpful to validate your code on real MIME messages like the ones that encode attachments in mail. Grab those with an editor, put them in a file, decode, and see if the result is acceptable to the target application (Word, Excel, MP3 player, etc.).

The program od ("octal dump") is useful for looking at binary data; try od -b.

You can use the program openssl as a reference implementation:

	openssl enc -e -base64 <input >base64output   # encode
	openssl enc -d -base64 <base64input >output   # decode

Finally, please note that this assignment is also an exercise in bit-whacking. In previous years, a surprising number of people apparently did not understand bits, let alone how to whack them. To demonstrate that you do understand, your code MUST use bit operations like & and | and <<, and it MUST NOT use magic numbers like 64 and 192 as masks or 65 and 97 as letters, arrays of size 24, or other constructs that suggest a failure to understand how to manipulate bits as bits.

Submission

When you are finished, submit your two Java files and your test files with the command
	~cos333/bin/submit  4  encode.java decode.java test.*

We will test your code by a process somewhat like this:

	javac encode.java decode.java
	for i in your test cases and ours
	do
		java encode <$i >x0	# your encoder
		java decode <x0 >y0	# your decoder
		cmp $i y0		# compare to original
		java encode <$i >x1	# your encoder
		openssl -d <x1 >y1	# our decoder
		cmp $i y1		# compare to original
		openssl -e <$i >x2	# our encoder
		java decode <x2 >y2	# your decoder
		cmp $i y2		# compare to original
	done
so you might try a similar script for your own testing. Note that it is not sufficient that your encoder and decoder work together; each must work independently.

Please follow the rules on what to submit. As suggested, we will exercise your programs mechanically, so it's a big help if your submission arrives in the right form, and your programs do only what is asked for. Thanks.