STA613/CBB540, Spring 2013

STA613/CBB540: Statistical methods in computational biology: Spring 2013

Prof:	Barbara Engelhardt		barbara.engelhardt@duke.edu		OH: Fridays 2:15-3:15pm, 223C Old Chem
Class:	Tu/Thu 8:30-9:45am				063 BioSci

Description

This course is based on case studies of statistical approaches to problems in computational biology. We will learn about statistical modeling in computational biology by formulating biological questions and repeating the following steps:

formalize the question as a probabilistic model (typically via a likelihood);
clarify the interpretation of model parameters and the model assumptions;
develop methods for parameter estimation;
quantify uncertainty in parameter estimation;
interpret the parameters to address the biological question.

Statistics at the level of STA611 (Introduction to Statistical Methods) is expected, along with knowledge of linear algebra and multivariate calculus.

Course grade is based on homeworks (45%), in-class midterm (15%), a final project (30%), and class participation and scribe notes (10%). The project can be either a reanalysis of the data in one of the case studies covered during the semester or a project of interest to the student (rotation projects are great). Homeworks are due to me exactly one week after they are handed out at the beginning of class. Programs can be emailed to me (barbara.engelhardt -at- duke.edu) before class on the due date. Late homeworks will not be accepted, although you are allowed one late homework (maximum one week) for the course. Students may (and should) collaboratively discuss the homework assignments; however, I expect each student to program and write up their own homework solutions. Please write the names of the students you discussed the homework assignment with at the top of your solutions.

Each lecture will have a scribe, who will type up notes in the LaTeX template. Within a week of class, the scribe should send me the LaTeX file, at which point I will read them over and post them to the website. If you have never used LaTeX before, there are online tutorials, Mac GUIs, and even online compilers that might help you. I'll put up an example of an uncompiled scribe note file when I have one.

A second set of references for R will also be useful. First, you can download R from the CRAN website. There are many resources, such as R Studio, that can help with the programming interface, and tutorials on R are all over the place. If you are getting bored with the standard graphics package, I really like using ggplot2 for beautiful graphics and figures. Finally, you can integrate R code and output with plain text using KNITR, but that might be going a bit too far for beginners.

We will have daily readings for the course, but there is no formal text for this class. However, some texts and notes that may be useful include:

Michael Lavine, Introduction to Statistical Thought (an introductory statistical textbook with plenty of R examples, and it's online too)
Ewans and Grant, Statistical Methods in Bioinformatics
Cristianini and Hahn, Introduction to Computational Genomics
Sayan Mukherjee, Statistical methods for computational biology
Kevin Murphy, Machine Learning: a probabilistic perspective
Durbin, Eddy, Krogh, Mitchison, Biological Sequence Analysis
Joseph Felsenstein, Inferring phylogenies

This syllabus is tentative, and will almost surely be superceded. Reload your browser for the current version.

Note: as of 2/7/13, I'm switching homework hand-out days and due dates to Tuesdays, mostly because it will give you more time to look at the homework before office hours on Friday.

Note: Scribes, please email me for a LaTeX template, and please follow the symbols and notation within. I expect both the statistical model aspect of the lecture and the case study to be described clearly and neatly. Please return the scribe notes to me within a week of the lecture.

Note: The final project TeX template and final project style file should be used in preparation of your final project report. Please follow the instructions and let me know if you have questions. Presentations will be on April 11th and 16th; the reports will be due on April 19th (Friday).

Syllabus

Week Topic Homework Scribe

Jan 10 Introductory statistics [Fisher, 1918]

Jan 15 Introductory statistics [Lander & Botstein, 1989]

Jan 17 Linear regression [Stranger et al., 2007], hw1 gene expression matrix genotype matrix Ethan Hada

Jan 22 Hypothesis testing [Storey et al., 2003] Lisa Cervia

Jan 24 Logistic regression [Sladek et al., 2007], hw2 case control status Dinesh Manandhar

Jan 29 Generalized linear models [Marioni et al., 2008] Yangxiaolu Go

Jan 31 Bayesian regression [Stephens & Balding, 2009], hw3 Amanda Lea

Feb 5 Sparse regression [Tibshirani, 1996] Rumen Stamatov

Feb 7 Mixed effects models [Segura et al., 2012] Shiwen Zhao

Feb 12 Mixture models [Bailey et al., 1995], data set hw4 Brittany Stokes

Feb 14 Admixture models [Pritchard et al., 2000] Qinglong Zeng

Feb 19 PCA [Patterson et al., 2006] Renjie Tan

Feb 21 Factor analysis [Engelhardt & Stephens, 2010] Peter Tonner

Feb 26 Markov chains [Der et al., 2011], hw5 Population A Population B Ryan Muraglia

Feb 28 Continuous time Markov models [Suchard et al., 2001] Kayla Hudson

Mar 5 Hidden Markov models [Burge & Karlin, 1997] Meng He

Mar 7 In-class Midterm

Mar 19 Trees [Siepel & Haussler, 2004] Ning Shen

Mar 21 Coalescent processes [Li & Durbin, 2010] Florian Wagner

Mar 26 Clustering [Eisen et al., 1998] Goke Ojewole

Mar 28 Classification [Diaz-Uriarte et al., 2006] Colbert Sesanker

Apr 2 Support vector machines [Saigo et al., 2004], hw6 classification data

Apr 4 Gaussian graphical models [Schafer & Strimmer, 2005]

Apr 9 Infinite mixture models [Medvedovic & Sivaganesan, 2002], hw9

Apr 11 Gaussian processes Final project presentations

Apr 16 Bayesian nonparametric models Final project presentations, projects due

Week	Topic	Homework

Jan 10	Introductory statistics	[Fisher, 1918]
Jan 15	Introductory statistics	[Lander & Botstein, 1989]
Jan 17	Linear regression	[Stranger et al., 2007], hw1 gene expression matrix genotype matrix	Ethan Hada
Jan 22	Hypothesis testing	[Storey et al., 2003]	Lisa Cervia
Jan 24	Logistic regression	[Sladek et al., 2007], hw2 case control status	Dinesh Manandhar
Jan 29	Generalized linear models	[Marioni et al., 2008]	Yangxiaolu Go
Jan 31	Bayesian regression	[Stephens & Balding, 2009], hw3	Amanda Lea
Feb 5	Sparse regression	[Tibshirani, 1996]	Rumen Stamatov
Feb 7	Mixed effects models	[Segura et al., 2012]	Shiwen Zhao
Feb 12	Mixture models	[Bailey et al., 1995], data set hw4	Brittany Stokes
Feb 14	Admixture models	[Pritchard et al., 2000]	Qinglong Zeng
Feb 19	PCA	[Patterson et al., 2006]	Renjie Tan
Feb 21	Factor analysis	[Engelhardt & Stephens, 2010]	Peter Tonner
Feb 26	Markov chains	[Der et al., 2011], hw5 Population A Population B	Ryan Muraglia
Feb 28	Continuous time Markov models	[Suchard et al., 2001]	Kayla Hudson
Mar 5	Hidden Markov models	[Burge & Karlin, 1997]	Meng He
Mar 7	In-class Midterm
Mar 19	Trees	[Siepel & Haussler, 2004]	Ning Shen
Mar 21	Coalescent processes	[Li & Durbin, 2010]	Florian Wagner
Mar 26	Clustering	[Eisen et al., 1998]	Goke Ojewole
Mar 28	Classification	[Diaz-Uriarte et al., 2006]	Colbert Sesanker
Apr 2	Support vector machines	[Saigo et al., 2004], hw6 classification data
Apr 4	Gaussian graphical models	[Schafer & Strimmer, 2005]
Apr 9	Infinite mixture models	[Medvedovic & Sivaganesan, 2002], hw9
Apr 11	Gaussian processes	Final project presentations
Apr 16	Bayesian nonparametric models	Final project presentations, projects due