STA613/CBB540: Statistical methods in computational biology: Spring 2013

Prof:Barbara Engelhardt barbara.engelhardt@duke.edu OH: Fridays 2:15-3:15pm, 223C Old Chem
Class:Tu/Thu 8:30-9:45am 063 BioSci

Description

This course is based on case studies of statistical approaches to problems in computational biology. We will learn about statistical modeling in computational biology by formulating biological questions and repeating the following steps:
  1. formalize the question as a probabilistic model (typically via a likelihood);
  2. clarify the interpretation of model parameters and the model assumptions;
  3. develop methods for parameter estimation;
  4. quantify uncertainty in parameter estimation;
  5. interpret the parameters to address the biological question.

Statistics at the level of STA611 (Introduction to Statistical Methods) is expected, along with knowledge of linear algebra and multivariate calculus.

Course grade is based on homeworks (45%), in-class midterm (15%), a final project (30%), and class participation and scribe notes (10%). The project can be either a reanalysis of the data in one of the case studies covered during the semester or a project of interest to the student (rotation projects are great). Homeworks are due to me exactly one week after they are handed out at the beginning of class. Programs can be emailed to me (barbara.engelhardt -at- duke.edu) before class on the due date. Late homeworks will not be accepted, although you are allowed one late homework (maximum one week) for the course. Students may (and should) collaboratively discuss the homework assignments; however, I expect each student to program and write up their own homework solutions. Please write the names of the students you discussed the homework assignment with at the top of your solutions.

Each lecture will have a scribe, who will type up notes in the LaTeX template. Within a week of class, the scribe should send me the LaTeX file, at which point I will read them over and post them to the website. If you have never used LaTeX before, there are online tutorials, Mac GUIs, and even online compilers that might help you. I'll put up an example of an uncompiled scribe note file when I have one.

A second set of references for R will also be useful. First, you can download R from the CRAN website. There are many resources, such as R Studio, that can help with the programming interface, and tutorials on R are all over the place. If you are getting bored with the standard graphics package, I really like using ggplot2 for beautiful graphics and figures. Finally, you can integrate R code and output with plain text using KNITR, but that might be going a bit too far for beginners.

We will have daily readings for the course, but there is no formal text for this class. However, some texts and notes that may be useful include:

  1. Michael Lavine, Introduction to Statistical Thought (an introductory statistical textbook with plenty of R examples, and it's online too)
  2. Ewans and Grant, Statistical Methods in Bioinformatics
  3. Cristianini and Hahn, Introduction to Computational Genomics
  4. Sayan Mukherjee, Statistical methods for computational biology
  5. Kevin Murphy, Machine Learning: a probabilistic perspective
  6. Durbin, Eddy, Krogh, Mitchison, Biological Sequence Analysis
  7. Joseph Felsenstein, Inferring phylogenies

This syllabus is tentative, and will almost surely be superceded. Reload your browser for the current version.

Note: as of 2/7/13, I'm switching homework hand-out days and due dates to Tuesdays, mostly because it will give you more time to look at the homework before office hours on Friday.

Note: Scribes, please email me for a LaTeX template, and please follow the symbols and notation within. I expect both the statistical model aspect of the lecture and the case study to be described clearly and neatly. Please return the scribe notes to me within a week of the lecture.

Note: The final project TeX template and final project style file should be used in preparation of your final project report. Please follow the instructions and let me know if you have questions. Presentations will be on April 11th and 16th; the reports will be due on April 19th (Friday).


Syllabus

WeekTopicHomeworkScribe
Jan 10 Introductory statistics [Fisher, 1918]
Jan 15 Introductory statistics [Lander & Botstein, 1989]
Jan 17 Linear regression [Stranger et al., 2007], hw1 gene expression matrix genotype matrix Ethan Hada
Jan 22 Hypothesis testing[Storey et al., 2003] Lisa Cervia
Jan 24 Logistic regression[Sladek et al., 2007], hw2 case control status Dinesh Manandhar
Jan 29 Generalized linear models[Marioni et al., 2008] Yangxiaolu Go
Jan 31 Bayesian regression[Stephens & Balding, 2009], hw3 Amanda Lea
Feb 5 Sparse regression[Tibshirani, 1996] Rumen Stamatov
Feb 7 Mixed effects models[Segura et al., 2012] Shiwen Zhao
Feb 12 Mixture models[Bailey et al., 1995], data set hw4 Brittany Stokes
Feb 14 Admixture models[Pritchard et al., 2000]Qinglong Zeng
Feb 19 PCA[Patterson et al., 2006] Renjie Tan
Feb 21 Factor analysis[Engelhardt & Stephens, 2010] Peter Tonner
Feb 26 Markov chains[Der et al., 2011], hw5 Population A Population B Ryan Muraglia
Feb 28 Continuous time Markov models[Suchard et al., 2001] Kayla Hudson
Mar 5 Hidden Markov models[Burge & Karlin, 1997] Meng He
Mar 7 In-class Midterm
Mar 19 Trees[Siepel & Haussler, 2004] Ning Shen
Mar 21 Coalescent processes[Li & Durbin, 2010] Florian Wagner
Mar 26 Clustering[Eisen et al., 1998] Goke Ojewole
Mar 28 Classification[Diaz-Uriarte et al., 2006] Colbert Sesanker
Apr 2 Support vector machines[Saigo et al., 2004], hw6 classification data
Apr 4 Gaussian graphical models[Schafer & Strimmer, 2005]
Apr 9 Infinite mixture models[Medvedovic & Sivaganesan, 2002], hw9
Apr 11 Gaussian processesFinal project presentations
Apr 16 Bayesian nonparametric modelsFinal project presentations, projects due