COS 511 Foundations of Machine Learning Theory: final project

COS 511 FOUNDATIONS OF MACHINE LEARNING

Final Project Information

Proposal due: April 4.
Final report due: May 13.

The final project for this class is completely open ended. You can pick just about any topic you wish so long as there is some connection to machine learning and its mathematical foundations. For your project, you can run an experiment, or you can think about a theoretical problem or algorithm, or you can do a blend of both. You can work individually or in small groups of two or three, although larger groups need to be justified by larger projects.

Please email me (in plain text) a paragraph or two outlining your project as soon as you know what you want to do, but no later than Friday, April 4.

I strongly advise starting early on your project. Running experiments takes time, as does thinking about theoretical problems.

For your project, you should start by doing some reading on a topic, and then you might run an experiment, or try to simplify or improve or extend the result, or you might try applying an algorithm to a particular application, or you might think about how two different approaches or algorithms are related to each other. Or you can do something different from any of these.

In every case, the end result should be a 5-10 page report clearly describing what you did, what results you got and what the results mean. I would prefer to receive your report in hard copy. However, depending on the project, you might also find it appropriate to email me other materials. It is very important that you get everything to me on time so that I can turn in final grades on time.

You also may, at your option, choose to give a short (say around 10-minute) presentation to the class about what you did. (If you want to give a longer presentation, that may also be possible, depending on scheduling constraints.) Most likely, presentations will be scheduled for the last week of classes. Be sure to let me know soon if you want to give a presentation.

Examples of possible types of projects:

Run experiments with a learning algorithm on some datasets. Measure the error rates on separate test sets. Compare performance to another learning algorithm, or to a modification of the same algorithm. Or, compare performance to what is predicted by theory. For instance, for SVM's, try to measure if fewer support vectors really do mean better performance. Or compare the learning curves observed on real data for your favorite learning algorithm to what is predicted by the theory -- see, for instance, Schuurmans' paper on "Characterizing rational versus exponential learning curves".
Read two papers coming from different communities or on different topics that seem related and try to relate them. For instance, you can look at how boosting and on-line learning have been unified using ideas from game theory in this paper, or at Lafferty and Lebanon's work on "Boosting and maximum likelihood for exponential models". Or pick two topics of your own.
Explore the connections between machine learning and another field, such as information theory, cryptography or neuroscience.
Think about what the consequences are of changing one of the standard learning models. Does learning become easier or harder? What are the general properties of the new model? Can you think of new algorithms for interesting problems in the new model?
Create an applet or other graphical program that interactively demonstrates the behavior of a learning algorithm that you have read about. Or, create an interactive game that uses learning, for instance a game that plays something simple like "rock, paper, scissors" or "matching pennies" against a human. On-line algorithms are especially well suited for this -- see, for instance this paper. What insights about the learning algorithm did you gain by playing with your game or demo? (Be sure to keep the emphasis on machine learning rather than, for instance, the graphical interface. Make sure that it is clear that you have learned something about machine learning.)
Build or apply a machine learning program tailored to a particular application coming from some other field such as natural language processing, bioinformatics or information retrieval. See for instance some of the recent work that Michael Collins has been doing. Or look at how maximum entropy is used in natural language processing.
Come up with your own idea! Be original and creative.

Places to look to get ideas for topics:

There are several books on reserve at the library for this course, listed on the main course web page.
The main machine learning journals are Machine Learning (possibly not entirely available on line, but the library should have it), and Journal of Machine Learning Research.
The conference that comes closest to what we have covered in class is the conference on Computational Learning Theory (COLT). Follow links from www.learningtheory.org to find some of the proceedings on line. This web site has a lot of other great materials on it, too.
Other machine learning conferences include the International Conference on Machine Learning (ICML) and Neural Information Processing Systems (NIPS).
Other useful websites include www.kernel-machines.org (all about SVM's and related methods), and www.boosting.org. Tom Dietterich maintains a very nice web site with lots of pointers, tutorials and interesting projects.

You may use software that you find on line. If you do, please note this in your report, and, as with any project, demonstrate in your report that you understand how the underlying learning algorithm works. If you implement code yourself, be aware that it can be tricky to be sure that a machine learning program is actually working properly. Be sure that it is carefully tested before running your experiments. For instance, check the output of the program carefully on tiny datasets where you know what the output should be (for instance, you have computed it by hand, or you have found or implemented another program (say, in another language or using a different technique) that computes it for you). Also keep an eye out for clues that your program might have problems, for instance, if the results violate proven theorems.

One of the best places for obtaining "real" data is the repository at University of California, Irvine (click on "summary page", or follow links to explore some of the other machine learning resources available from this site). Within this repository, the "statlog" datasets have been widely used, as has the "letter recognition" dataset, but there are many datasets to choose from. Some of the datasets have separate test sets. Others only provide a training set. In this case, you can randomly partition the dataset into a training set and test set. If you end up with a rather small test set, you will probably want to repeat this many times to get reliable results. You can also use synthetic data of your own creation, in which case there is no problem generating a large test set. Usually, when evaluating a machine learning algorithm, you will want to see how it performs on several datasets. If you have access to more specialized data (for instance, as part of your regular research), feel free to use it.

If you are doing a theoretical project, it may be that you read a paper, try improving it, and aren't able to make progress. In that case, it is okay to fall back on just explaining the paper as clearly as you can, in your own words.

Your report should follow the general format of a scholarly paper in this area. You should begin by describing the problem you are studying, a bit of background (what's been done before) and the motivation for the problem, i.e., why it's worth studying.

Next, you should clearly explain what you did, both from a high level, and then in more detail. For an experimental paper, you should explain the experiments in enough detail that there is a reasonable possibility that a motivated reader would be able to replicate them. You also should outline some of the theory underlying the algorithms you are studying. State your results clearly, and think about graphical tools you could use to make your results clearer (a table of numbers is usually less compelling than a graphical representation of the same data). Look through published papers for ideas. For a theoretical paper, the learning model and other mathematical details should be explained well enough for the results to be stated with mathematical precision and clarity.

In every case, be sure to explain the meaning of your results. Don't just give a table of results or a dry mathematical formula. Explain what the results mean, and what conclusions can be drawn from them. What did you expect to find? What did you find instead? What are the implications? If you found something surprising, can you think of how it might be explained?

As always, feel free to contact me anytime with questions or difficulties you encounter, or if you have trouble thinking of a topic or finding papers to read.