Princeton University
COS 217: Introduction to Programming Systems

Assignment 7: A Spam Filter

Purpose

The purpose of this assignment is to help you learn about spam filters and naive Bayesian learning. It also will give you more experience with advanced C programming, especially creating and using ADTs. Finally, it will give you more experience with the xemacs, gdb, and make tools.

Background

A spam filter is a computer program that classifies e-mail messages as either spam (unwanted) or ham (wanted). Many spam filters work in this way...

The spam filter maintains a set of features. A feature is a characteristic of messages that the spam filter can use to distinguish ham from spam. A feature could be simply the presence of a word in the message, for example, "the message contains the word 'lottery'" or "the message contains the word 'winner'". A feature could be much more complex/powerful, for example, "the subject line is all capitals" or "the message body mentions many Internet domains".

Given (1) a set of features, (2) a set of messages known to be ham, and (3) a set of messages known to be spam, the spam filter learns how to probabilistically classify incoming messages as ham or spam. It uses naive Bayesian learning, in the manner described in lectures and precepts, to do the learning and classification.

Your Task

Your task in this assignment is to create a spam filter that uses naive Bayesian learning. More precisely, you should create two programs, one named learn and another named filter.

You should define a set of features. The features could be simply the presence of a word, but we encourage you to define more complex/powerful features.

Your learn program should learn how to distinguish ham from spam by using (1) the set of features that you have defined, (2) the example ham messages from the provided hamtrain file, and (3) the example spam messages from the provided spamtrain file. It should write its "knowledge" to a file named knowledge.

Your filter program should read that "knowledge" from the knowledge file, and thus should know how to distinguish ham from spam. Then it should should read messages from stdin until end-of-file. For each message that your filter program reads from stdin, it should compute the probability that the message is spam, and write that probability to stdout. Your filter program should write exactly one number to stdout for each message that it reads from stdin, one number per line.

You should run your filter program by redirecting its stdin to the provided hamtest file. The hamtest file contains only ham messages. Accordingly, your filter program should classify many of those messages as ham. That is, many of the spam probabilities that your filter program writes to stdout should be less than 0.5. The more small numbers your filter program writes to stdout, the higher the quality of your learn and filter programs.

You also should run your filter program by redirecting its stdin to the provided spamtest file. The spamtest file contains only spam messages. Accordingly, your filter program should classify many of those messages as spam.  That is, many of the spam probabilities that your filter program writes to stdout should be greater than 0.5. The more large numbers your filter program writes to stdout, the higher the quality of your learn and filter programs.

The Details

On hats you are given one code module:

Your job is to create four additional modules:

Your learn and filter programs will need to use dynamically allocated memory. They should contain no memory leaks. That is, your programs should explicitly free all dynamically allocated memory when that memory is no longer required. For each call of the malloc (or calloc) function in your programs, eventually there should be exactly one call of the free function.

If appropriate, you should use your SymTable ADT (from Assignment 3) within the FeatureFinder module. The document The SymTable ADT and the Spam Filter contains some suggestions about how to do that.  If your SymTable ADT is not working, we will provide you with working symtable.h and symtable.o files at your request.

Logistics

You may work with one partner on this assignment. If you do work with a partner, then only one of the partners should submit work.  The readme file should contain your name and your partner's name.

You should develop on hats. Use xemacs to create source code. Use make to automate the build process. Use gdb to debug.

The directory /u/cos217/Assignment7 contains files that are pertinent to the assignment:

All of the message files use the standard UNIX e-mail file format, as the Parser module requires.

The hamtrain, spamtrain, hamtest, and spamtest files are from spamassassin.com. They are so large that you may not have enough disk space to copy them to your project directory. So you should use the files that exist within the /u/cos217/Assignment7 directory instead of using your own copies of them.

The /u/cos217/Assignment7 directory also contains an executable binary file named spamstats (and its source code file, spamstats.c). The spamstats program accepts one command-line argument, which should be the name of a file containing the spam probabilities written by your spam filter. The spamstats program prints summary statistics regarding those probabilities. You might use spamstats in conjunction with your spam filter in this manner:

$ learn /u/cos217/Assignment7/hamtrain /u/cos217/Assignment7/spamtrain
$ filter < /u/cos217/Assignment7/hamtest > hamout
$ spamstats hamout
      2159 messages
      1887 messages considered ham
       272 messages considered spam
  0.874016 percentage of messages considered ham
  0.125984 percentage of messages considered spam
$ filter < /u/cos217/Assignment7/spamtest > spamout
$ spamstats spamout
       578 messages
       248 messages considered ham
       330 messages considered spam
  0.429066 percentage of messages considered ham
  0.570934 percentage of messages considered spam

You should submit:

Your readme file should contain:

You should not submit code for the given Parser module. You should not submit the large data files that contain example and test messages.

Submit your work electronically via the command:

/u/cos217/bin/i686/submit 7 sourcecodefiles datafile(s) makefile readme

Grading

As always, we will grade your work on functionality and design, and will consider understandability to be an important aspect of good design. Function-level comments in both .h and .c files are critical documentation. To encourage good coding practices, we will take off points based on warning messages during compilation.

Approximately 90% of the grade will be based upon design and "basic" functionality, that is, whether your spam filter distinguishes ham from spam. The remaining 10% will be based upon the quality of the results generated by your spam filter.