For the final project, you are asked to explore the application of data analysis techniques to the data problem of your choice. The project is quite open ended. For instance, you can choose to study an algorithm and its variations in depth, making controlled experimental comparisons on various datasets. Or you can choose to study one particular data problem, giving special consideration to the unique properties of the problem domain, and testing one or more methods on it. Every project should involve the experimental application of at least one algorithm or technique to at least one dataset, although more than this minimal requirement will generally be expected. You may work individually or in groups of 2-3. We strongly encourage you to work in groups so as to be able to complete more ambitious projects.


  • By April 13th, please turn in a brief, written description of what you propose to do for your final project. The proposal should be submitted in hard copy just like a regular homework. If you are working as a group, only one proposal needs be turned in for the group (be sure to include everyone's name and email address). Feel free to come talk to us about your ideas. We will notify you once your project has been approved. At that time, we will also assign one of the two course instructors to supervise your project, so that you will have a clear point of contact for getting assistance.
  • The final report is due for May 4th. The final report should be submitted in hard copy just like a regular homework by the beginning of class. If you are working as a group, only one report needs be turned in for the group (be sure to include everyone's name and email address).
  • There will be a poster session on May 4th during the last lecture. Presenting a poster is a required part of this project, although the poster itself will not be graded.

Choosing a topic

Every project must involve at least one algorithm or technique, and at least one dataset; your analysis should go beyond what has already been explored in this course. Your project can focus either on the algorithm or on the data. Every project should have a clearly defined purpose.

Here are some examples of possible types of projects:

  • Select one or more algorithms for detailed study. Test the algorithms on various datasets. Try to determine their relative strengths and weaknesses, and how the algorithms perform under varying conditions, such as a varying number of features. Come up with ways of extending or combining the algorithms, and test these ideas experimentally. The goal of such a project might be to find an “off-the-shelf” algorithm giving the best performance on a range of datasets (where performance is measured in one of the ways discussed in class, such as test accuracy). Alternatively, a similar but different kind of project might focus on computational issues and how to make one particular algorithm as fast as possible without sacrificing performance.
  • Study a particular application domain, such as classification of visual images, clustering of email messages, or automatic recommendation of movie titles. Consider a number of algorithms for your problem, and determine which seems to perform the best. Or, design and test a probabilistic model that is especially appropriate to your problem. Think about the issues that are most relevant to your specific problem; for instance, what features are most appropriate, what independence assumptions are reasonable, and how do the algorithms you are using fit or not fit this problem?

These examples are only meant to provide a starting point. You are strongly encouraged to be creative and original in your choice of topic! You will need to do some background reading on your topic to help you decide how to proceed, and to give context to your work.

Here are some places to look to get ideas for topics, and for background reading:

It is okay to do a project that is related to independent research that you are doing as part of your graduate study, junior project or senior thesis. In this case, you will need to carve out a project that is focused and relevant to this course. If this is part of a junior or senior project, please inform your advisor of this so as to recalibrate expectations. Needless to say, turning in a project based on previously completed research is not appropriate.

Getting data and running experiments

Every project must involve at least one dataset, ideally one which was not already used in the course. There are many interesting and freely available datasets that you can find with Google searches. Here are some of the other places you might look for data:

  • The repository at University of California, Irvine (click on “summary page”, or follow links to explore some of the other machine learning resources available from this site). These datasets are mostly, but not entirely, oriented toward classification problems. Some of the datasets have separate test sets. Others only provide a training set. In this case, you can randomly partition the dataset into a training set and test set. If you end up with a rather small test set, you will probably want to repeat this many times to get reliable results.
  • For text data, you can try the David Lewis dataset page.
  • For image data, a good place to start is CalTech-101.
  • Collect your own data! For instance, you can collect data from the web, or from your own email, pictures, document files, etc. Or you may have access to data as part of some independent research. Do not underestimate the amount of work involved.

Depending on the project, you can create and use your own synthetic data. However, the project should involve at least one “real” dataset.

Although you are encouraged to use data that you find or gather on your own, it should go without saying that if you plan to use data that is private, confidential, classified, copyrighted, controlled, sensitive, etc., it is your responsibility to be sure that it is legally and ethically okay for you to use the data for the purposes of this project (including possibly sharing the data with the graders, should the need arise). Please do not use any data in any way that might be considered illegal, unethical, immoral or inappropriate.

You can implement your project using R or you can use another software environment of your choice. You also may need to pre-process your data to get it into an appropriate format, for instance, using perl or python. The project should involve a number of experiments, and a detailed exploration and analysis of the results using visualization. Your analysis should keep in mind the overall purpose you have chosen for your project.

You can use any of the library functions built into R, or that you download on-line, or you can use other publicly available software packages. If you use any software that you did not write yourself, please note this in your report, and, as with any project, demonstrate in your report that you understand how the underlying algorithm works.

If you implement code yourself, be aware as always that it can be tricky to be sure that this kind of program is actually working properly. Be sure that it is carefully tested before running your experiments. For instance, check the output of the program carefully on tiny datasets where you know what the output should be (for instance, you have computed it by hand, or you have found or implemented another program, say, in another language or using a different technique, that computes it for you). Also keep an eye out for clues that your program might have problems, for instance, if the results violate proven theorems or differ substantially from results in the published literature. Your report should include a brief description of what measures you took to be sure that your program is working properly.

Project proposal

The project proposal should typically be a half page in length, but certainly no more than a single full page. Your proposal, which will not be graded, should describe the following, as best as you can. Many details will have to be worked out as the project proceeds.

  • What is the problem that you will be investigating? Why is it interesting?
  • What data will you use? If you are collecting new datasets, how do you plan to collect them?
  • How do you plan to implement the algorithms that you are studying? What existing implementations will you use?
  • Which reading will you examine to provide context and background?
  • What do you hope to learn and understand from your analysis?
  • How will you evaluate your results, i.e., what are the kinds of plots that you can make and summary statistics that you can compute?

Poster session

We will hold a poster session during the last lecture (FIXME organization, probably in the “banana room” of the Computer Science building, which is the area right outside room 105). This is intended to be a fun “science fair” kind of event to give you a chance to present your own project, and to hear about the projects of others. Participation in the poster session is required, but will not be graded.

You (or your group) will be provided with a 4' x 4' bulletin board on which to present your project. You should prepare material to place in this space so that others can learn about your project. You can either prepare and print out an actual poster (if you can find a printer that handles oversize paper), or you can simply prepare powerpoint-style slides which you can then print out on ordinary paper and tack to your bulletin board. Push-pins will be provided. Your poster should describe at a high level what you did and what results you got. During the poster session, you (or others in your group) should spend at least half the time physically at your poster so that you can explain it in a one-on-one fashion to anyone who is interested. The rest of your time can be spent looking at your classmates' posters.

You should be finished attaching your poster to your bulletin board before the poster session begins at 11am. The bulletin boards will be in place no later than 10am, so you can come to set up your poster any time between 10am and 11am on the morning of the poster session. Also, at the end of the poster session, please be sure to take down your poster, and return any push-pins that you used. Any materials that are left behind after the end of the poster session will be discarded.

Writing a final report

The end result of your project is a report that clearly and concisely describes what you did, the results you obtained, and what they mean. The report should be submitted in hard copy just like homework assignments. If you are working individually, your report should be 3-5 pages long. If you are working as part of a small group, your group should submit one report, which is 5-7 pages long. The report should use 12pt, 1-inch margins, and single spacing. The page length limits do not include figures.

Your report should follow the general outline of a scholarly paper in this area. You should write your report as clearly as possible in a manner that would be understandable to a fellow COS424 student. In other words, you should not assume that the reader has background beyond what has been covered in class (as well as a general computer science background).

You should begin by describing the problem you are studying, some background (what's been done before) and the motivation for the problem, (why it is worth studying). Previous work and outside sources should be cited throughout your report in a scholarly fashion following the style of academic papers in this area. (You can find examples by looking in the journals or conferences listed above.)

Next, you should clearly explain what you did, both from a high level, and then in more detail. You should explain the reason you chose the experiments that you did, and you should explain how the experiments were conducted in enough detail that a motivated reader can replicate them. Be sure to also describe the data you are using. You also should outline some of the theory or motivation underlying the algorithms you are using or studying. State the results of your experiments clearly, and think about visualizations that you can use to make your results clearer. (A table of numbers is usually less compelling than a well-chosen plot of the same data.) All reports must include at least one informative visualization.

In every case, be sure to discuss your results. Do not just give a table of results. Explain what the results mean, and what conclusions can be drawn from them. Again, do all this in a way that would be understandable and interesting to a fellow COS424 student. What did you expect to find? What did you find instead? What are the implications? If you found something surprising, can you think of how it might be explained? Be thoughtful, observant and critical.

What you will be graded on

Projects will be graded along the following dimensions:

  • originality and creativity
  • background material
  • data gathering and preparation
  • experimental design and execution
  • discussion and interpretation of results
  • poster presentation
  • writing of the final report, including clarity, completeness and conciseness
  • overall effort

As always, feel free to contact us at anytime with questions or difficulties you encounter, or if you have trouble thinking of a topic or finding papers to read.

project.txt · Last modified: 2010/04/19 21:37 by sgerrish
Recent changes RSS feed Creative Commons License DjVu Enabled Powered by PHP Valid XHTML 1.0 Valid CSS Driven by DokuWiki