COS Independent Work Seminar:
A Brave New Data World
Thomas Funkhouser (office hours Thurs 2-4PM, Fri 1:30-2:30PM)
Meeting time and place:
Fridays 3:00-4:20PM, Room 301
This seminar is an opportunity for students to develop individual or small-team software projects in the still new and emerging area of Data Science, which combines concepts, methods, and tools typically taught in Probability and Statistics, Data Mining, and Programming Languages courses, among others. The projects will use the Python stack for data crunching and analysis, including iPython, pandas, matplotlib, SciPy, NumPy, and scikit-learn.
Each team is expected to develop a relatively complete data processing and analysis pipeline, including components for data acquisition, extraction, cleaning, normalization, transformation, aggregation, statistical analysis, classification, and interactive visualization. Students may choose their own data sets and project objectives, in consultation with the instructor. Objectives may include contributions to the open source software libraries used in the seminar.
The end goal of each project is to draw relevant conclusions based on sound methods and real-world data -- which sometimes can be noisy, inconsistent, and insufficient -- processed and analyzed using the Python stack. Conclusions may consist of explanatory theories, predictions, or proposed courses of action regarding the phenomenon that generated the data. Alternative end goals may include demonstrations of the limitations, or unintended consequences (e.g., privacy concerns) of the availability of increasingly large and inter-related data sets in recent years.
Frequently Asked/Anticipated Questions:
- How do I learn Python?
Start with the Python 2.7 tutorial (Note the 2.7 version. That is the one to learn for this seminar). Completing the tutorial will take a couple of days, if you already know a programming language in the C++/Java family, or a bit longer otherwise. After the tutorial, think about a program that you've already developed in another programming language, or you've always wanted to do that. It could be a little game, or some simple task automation tool, or just your favorite sorting algorithm. Write it in Python and compare. You might need to use the Python Standard Library Reference -- that will be your most valuable Python resource until you become an expert. Then develop a couple of more programs, to learn how typical things are done in Python, such as file and directoty manipulation, simple GUI creation, mathematical functions, random numbers, DB access, processing standard data formats such as XML, CSV, HTML, JSON, and so on. If you like Python and would like to become an expert, grab a thorough (though not entertaining) book like Mark Lutz's Learning Python and read about what's going on under the hood, while you keep developing simple applications to sharpen your skills.
- I already know Python and would like a practical intro to Data Science using Python. Where do I start?
See the three books in the Intro to Data Science with Python section of our library.
- How do I install on my computer all the Python software libraries for this seminar?
See the first two books (Python for Data Analysis and Practical Data Science Cookbook) in the main section of our library. Chapter 1, in either book, deals with installation and configuration issues.
- I already know Python and the basics of Data Science, and I'm ready to look for data sets and project ideas for this seminar. Where can I find them?
Several books in our library provide links to data sources, as well as project ideas and Python code to start with.
Since this is an independent work seminar, ideally you should try to come up with your own project idea, using one of the data sources mentioned above or another one of your choice.
- Chapter 2 Introductory Examples in Python for Data Analysis has a few data sources and use cases.
- Chapters 6 to 10 in Practical Data Science Cookbook present (each) a project in more detail: analyzing tax data, analyzing automobile data, analyzing social network data, recommending movies, and analyzing Twitter data. The second book also presents some data science projects in R (Chapters 2 to 5) that analyze American Football data, stock market data, and employment data. Use all these projects as examples or starting points.
- Several chapters in Beautiful data can inspire good projects.
- In the library sections on finance, biology, geospatial systems, and astronomy, you can find books that link to data sets.
- Python Playground: Geeky Projects for the Curious Programmer has projects such as translating image pixels into ASCII art, autostereogram generator that creates patterns with 3D images, Conway's Game of Life simulator, creating laser patterns based on audio input with Ardunio, and others -- not necessarily data-science projects.
- I'm already working on my project. I've just realized that I need to learn more about numpy (or some other Python tool). Where do I start?
For each major Python tool you will find at least three books in the main section of our library. You can also take a look at the Python 2.7 documentation home page and the Python Books wiki page.
- What are some good sources for the latest Data Science news, videos, conferences, talks, presentations, and such?
The Twitter #datascience, #analytics, #machinelearning and #bigdata feeds are good places to start.