Princeton University
COS 217: Introduction to Programming Systems

Assignment 7: A Spam Filter

Purpose

The purpose of this assignment is to help you learn about spam filters and naive Bayesian learning. It also will give you more experience with advanced C programming, especially creating and using ADTs. Finally, it will give you more experience with the xemacs, gdb, and make tools.

Background

A spam filter is a computer program that classifies e-mail messages as either spam (unwanted) or ham (wanted). Many spam filters work in this way...

The spam filter maintains a set of features. A feature is a characteristic of messages that the spam filter can use to distinguish ham from spam. A feature could be simply the presence of a word in the message, for example, "the message contains the word 'lottery'" or "the message contains the word 'winner'". A feature could be much more complex/powerful, for example, "the subject line is all capitals" or "the message body mentions many Internet domains".

Given (1) a set of features, (2) a set of messages known to be ham, and (3) a set of messages known to be spam, the spam filter learns how to probabilistically classify incoming messages as ham or spam. It uses naive Bayesian learning, in the manner described in lectures and precepts, to do the learning and classification.

Your Task

Your task in this assignment is to create a spam filter that uses naive Bayesian learning. More precisely, you should create two programs, one named learn and another named filter.

You should define a set of features. The features could be simply the presence of a word, but we encourage you to define more complex/powerful features.

Your learn program should learn how to distinguish ham from spam by using (1) the set of features that you have defined, (2) the example ham messages from the provided hamtrain file, and (3) the example spam messages from the provided spamtrain file. It should write its "knowledge" to a file named knowledge.

Your filter program should read that "knowledge" from the knowledge file, and thus should know how to distinguish ham from spam. Then it should should read messages from stdin until end-of-file. For each message that your filter program reads from stdin, it should compute the probability that the message is spam, and write that probability to stdout. Your filter program should write exactly one number to stdout for each message that it reads from stdin, one number per line.

You should run your filter program by redirecting its stdin to the provided hamtest file. The hamtest file contains only ham messages. Accordingly, your filter program should classify many of those messages as ham. That is, many of the spam probabilities that your filter program writes to stdout should be less than 0.5. The more small numbers your filter program writes to stdout, the higher the quality of your learn and filter programs.

You also should run your filter program by redirecting its stdin to the provided spamtest file. The spamtest file contains only spam messages. Accordingly, your filter program should classify many of those messages as spam. That is, many of the spam probabilities that your filter program writes to stdout should be greater than 0.5. The more large numbers your filter program writes to stdout, the higher the quality of your learn and filter programs.

The Details

On hats you are given one code module:

Parser. Parser is an ADT that is a simple syntactic analyzer. A Parser object accepts a file and an array of characters. It reads the next message from that file into that array.

Your job is to create four additional modules:

FeatureFinder. FeatureFinder should be an ADT. In its simplest form, a FeatureFinder object should read words from a given file, thus defining "features" that the program will use to discriminate ham from spam. (We encourage you to create a FeatureFinder module that defines features in a more substantial way.) Then, given a message, a FeatureFinder object should return to its client an indication of which features that message contains.
Knowledge. Knowledge should be an ADT. The learn program should create a Knowledge object, giving it a file that contains example ham messages and a file that contains example spam messages. The Knowledge object should use the FeatureFinder module to learn how to distinguish ham from spam. It should write its knowledge to the knowledge file. The filter program also should create a Knowledge object, commanding it to read its knowledge from the knowledge file. The Knowledge object should read (presumably newly-arrived) messages from stdin, use its knowledge to compute the probability that each message is spam, and write that probability to stdout. The Knowledge object should use a Parser object to read messages. The Knowledge object should use a FeatureFinder object to learn the features of ham/spam and to classify messages as ham/spam.
A learn module, containing a main function and perhaps other functions. The learn module should call the functions of the Knowledge module to learn how to distinguish ham from spam.
A filter module, containing a main function and perhaps other functions. The filter module should call the functions of the Knowledge module to classify messages as ham or spam.

Your learn and filter programs will need to use dynamically allocated memory. They should contain no memory leaks. That is, your programs should explicitly free all dynamically allocated memory when that memory is no longer required. For each call of the malloc (or calloc) function in your programs, eventually there should be exactly one call of the free function.

If appropriate, you should use your SymTable ADT (from Assignment 3) within the FeatureFinder module. The document The SymTable ADT and the Spam Filter contains some suggestions about how to do that. If your SymTable ADT is not working, we will provide you with working symtable.h and symtable.o files at your request.

Logistics

You may work with one partner on this assignment. If you do work with a partner, then only one of the partners should submit work. The readme file should contain your name and your partner's name.

You should develop on hats. Use xemacs to create source code. Use make to automate the build process. Use gdb to debug.

The directory /u/cos217/Assignment7 contains files that are pertinent to the assignment:

parser.h and parser.c: the Parser module source code files.
hamtrain: A file containing many example ham messages.
spamtrain: A file containing many example spam messages.
hamtest: A file containing many (presumably newly-arrived) messages, each of which happens to be ham.
spamtest: A file containing many (presumably newly-arrived) messages, each of which happens to be spam.
hamtrainsmall: A file containing a few artificial ham messages.
spamtrainsmall: A file containing a few artificial spam messages.

All of the message files use the standard UNIX e-mail file format, as the Parser module requires.

The hamtrain, spamtrain, hamtest, and spamtest files are from spamassassin.com. They are so large that you may not have enough disk space to copy them to your project directory. So you should use the files that exist within the /u/cos217/Assignment7 directory instead of using your own copies of them.

The /u/cos217/Assignment7 directory also contains an executable binary file named spamstats (and its source code file, spamstats.c). The spamstats program accepts one command-line argument, which should be the name of a file containing the spam probabilities written by your spam filter. The spamstats program prints summary statistics regarding those probabilities. You might use spamstats in conjunction with your spam filter in this manner:

$ learn /u/cos217/Assignment7/hamtrain /u/cos217/Assignment7/spamtrain
$ filter < /u/cos217/Assignment7/hamtest > hamout
$ spamstats hamout
      2159 messages
      1887 messages considered ham
       272 messages considered spam
  0.874016 percentage of messages considered ham
  0.125984 percentage of messages considered spam
$ filter < /u/cos217/Assignment7/spamtest > spamout
$ spamstats spamout
       578 messages
       248 messages considered ham
       330 messages considered spam
  0.429066 percentage of messages considered ham
  0.570934 percentage of messages considered spam

You should submit:

The source code files defining your FeatureFinder, Knowledge, learn, and filter modules. You also should submit any additional modules (for example, SymTable) that those modules use.
A makefile. The first rule in the makefile should command make to build files named learn and filter containing the executable versions of your programs. Your makefile should use the "-Wall", "-ansi", and "-pedantic" compiler options. Your makefile should maintain object (.o) files to allow for partial builds. Your makefile should encode the dependencies among the files that comprise your program.
The data file(s) used by your FeatureFinder module, if any.
A readme file.

Your readme file should contain:

Your name. Also include the name of your partner, if any.
A description of whatever help (if any) you received from others while doing the assignment, and the names of any individuals with whom you collaborated, as prescribed by the course "Policies" web page.
If you worked with a partner, a description of the work performed by each partner.
(Optionally) An indication of how much time you spent doing the assignment.
(Optionally) Your assessment of the assignment.
(Optionally) Any information that will help us to grade your work in the most favorable light. In particular you should describe all known bugs.

You should not submit code for the given Parser module. You should not submit the large data files that contain example and test messages.

Submit your work electronically via the command:

/u/cos217/bin/i686/submit 7 sourcecodefiles datafile(s) makefile readme

Grading

As always, we will grade your work on functionality and design, and will consider understandability to be an important aspect of good design. Function-level comments in both .h and .c files are critical documentation. To encourage good coding practices, we will take off points based on warning messages during compilation.

Approximately 90% of the grade will be based upon design and "basic" functionality, that is, whether your spam filter distinguishes ham from spam. The remaining 10% will be based upon the quality of the results generated by your spam filter.

Princeton University COS 217: Introduction to Programming Systems