For the final project, you are asked to explore the application of data analysis techniques to the data problem of your choice. The project is quite open ended. For instance, you can choose to study an algorithm and its variations in depth, making controlled experimental comparisons on various datasets. Or you can choose to study one particular data problem, giving special consideration to the unique properties of the problem domain, and testing one or more methods on it. Every project should involve the experimental application of at least one algorithm or technique to at least one dataset, although more than this minimal requirement will generally be expected. You may work individually or in groups of 2-3. We strongly encourage you to work in groups so as to be able to complete more ambitious projects.
Every project must involve at least one algorithm or technique, and at least one dataset; your analysis should go beyond what has already been explored in this course. Your project can focus either on the algorithm or on the data. Every project should have a clearly defined purpose.
Here are some examples of possible types of projects:
These examples are only meant to provide a starting point. You are strongly encouraged to be creative and original in your choice of topic! You will need to do some background reading on your topic to help you decide how to proceed, and to give context to your work.
Here are some places to look to get ideas for topics, and for background reading:
It is okay to do a project that is related to independent research that you are doing as part of your graduate study, junior project or senior thesis. In this case, you will need to carve out a project that is focused and relevant to this course. If this is part of a junior or senior project, please inform your advisor of this so as to recalibrate expectations. Needless to say, turning in a project based on previously completed research is not appropriate.
Every project must involve at least one dataset, ideally one which was not already used in the course. There are many interesting and freely available datasets that you can find with Google searches. Here are some of the other places you might look for data:
Depending on the project, you can create and use your own synthetic data. However, the project should involve at least one “real” dataset.
Although you are encouraged to use data that you find or gather on your own, it should go without saying that if you plan to use data that is private, confidential, classified, copyrighted, controlled, sensitive, etc., it is your responsibility to be sure that it is legally and ethically okay for you to use the data for the purposes of this project (including possibly sharing the data with the graders, should the need arise). Please do not use any data in any way that might be considered illegal, unethical, immoral or inappropriate.
You can implement your project using R or you can use another software environment of your choice. You also may need to pre-process your data to get it into an appropriate format, for instance, using perl or python. The project should involve a number of experiments, and a detailed exploration and analysis of the results using visualization. Your analysis should keep in mind the overall purpose you have chosen for your project.
You can use any of the library functions built into R, or that you download on-line, or you can use other publicly available software packages. If you use any software that you did not write yourself, please note this in your report, and, as with any project, demonstrate in your report that you understand how the underlying algorithm works.
If you implement code yourself, be aware as always that it can be tricky to be sure that this kind of program is actually working properly. Be sure that it is carefully tested before running your experiments. For instance, check the output of the program carefully on tiny datasets where you know what the output should be (for instance, you have computed it by hand, or you have found or implemented another program, say, in another language or using a different technique, that computes it for you). Also keep an eye out for clues that your program might have problems, for instance, if the results violate proven theorems or differ substantially from results in the published literature. Your report should include a brief description of what measures you took to be sure that your program is working properly.
The project proposal should typically be a half page in length, but certainly no more than a single full page. Your proposal, which will not be graded, should describe the following, as best as you can. Many details will have to be worked out as the project proceeds.
We will hold a poster session during the last lecture ( organization, probably in the “banana room” of the Computer Science building, which is the area right outside room 105). This is intended to be a fun “science fair” kind of event to give you a chance to present your own project, and to hear about the projects of others. Participation in the poster session is required, but will not be graded.
You (or your group) will be provided with a 4' x 4' bulletin board on which to present your project. You should prepare material to place in this space so that others can learn about your project. You can either prepare and print out an actual poster (if you can find a printer that handles oversize paper), or you can simply prepare powerpoint-style slides which you can then print out on ordinary paper and tack to your bulletin board. Push-pins will be provided. Your poster should describe at a high level what you did and what results you got. During the poster session, you (or others in your group) should spend at least half the time physically at your poster so that you can explain it in a one-on-one fashion to anyone who is interested. The rest of your time can be spent looking at your classmates' posters.
You should be finished attaching your poster to your bulletin board before the poster session begins at 11am. The bulletin boards will be in place no later than 10am, so you can come to set up your poster any time between 10am and 11am on the morning of the poster session. Also, at the end of the poster session, please be sure to take down your poster, and return any push-pins that you used. Any materials that are left behind after the end of the poster session will be discarded.
The end result of your project is a report that clearly and concisely describes what you did, the results you obtained, and what they mean. The report should be submitted in hard copy just like homework assignments. If you are working individually, your report should be 3-5 pages long. If you are working as part of a small group, your group should submit one report, which is 5-7 pages long. The report should use 12pt, 1-inch margins, and single spacing. The page length limits do not include figures.
Your report should follow the general outline of a scholarly paper in this area. You should write your report as clearly as possible in a manner that would be understandable to a fellow COS424 student. In other words, you should not assume that the reader has background beyond what has been covered in class (as well as a general computer science background).
You should begin by describing the problem you are studying, some background (what's been done before) and the motivation for the problem, (why it is worth studying). Previous work and outside sources should be cited throughout your report in a scholarly fashion following the style of academic papers in this area. (You can find examples by looking in the journals or conferences listed above.)
Next, you should clearly explain what you did, both from a high level, and then in more detail. You should explain the reason you chose the experiments that you did, and you should explain how the experiments were conducted in enough detail that a motivated reader can replicate them. Be sure to also describe the data you are using. You also should outline some of the theory or motivation underlying the algorithms you are using or studying. State the results of your experiments clearly, and think about visualizations that you can use to make your results clearer. (A table of numbers is usually less compelling than a well-chosen plot of the same data.) All reports must include at least one informative visualization.
In every case, be sure to discuss your results. Do not just give a table of results. Explain what the results mean, and what conclusions can be drawn from them. Again, do all this in a way that would be understandable and interesting to a fellow COS424 student. What did you expect to find? What did you find instead? What are the implications? If you found something surprising, can you think of how it might be explained? Be thoughtful, observant and critical.
Projects will be graded along the following dimensions:
As always, feel free to contact us at anytime with questions or difficulties you encounter, or if you have trouble thinking of a topic or finding papers to read.