Prof: | Barbara Engelhardt |
barbara.engelhardt@duke.edu | OH: Fridays 2:15-3:15pm, 223C Old Chem | ||

Class: | Tu/Thu 8:30-9:45am | 063 BioSci |

- formalize the question as a probabilistic model (typically via a likelihood);
- clarify the interpretation of model parameters and the model assumptions;
- develop methods for parameter estimation;
- quantify uncertainty in parameter estimation;
- interpret the parameters to address the biological question.

Statistics at the level of STA611 (Introduction to Statistical Methods) is expected, along with knowledge of linear algebra and multivariate calculus.

Course grade is based on homeworks (45%), in-class midterm (15%), a final project (30%), and class participation and scribe notes (10%). The project can be either a reanalysis of the data in one of the case studies covered during the semester or a project of interest to the student (rotation projects are great). Homeworks are due to me exactly one week after they are handed out at the beginning of class. Programs can be emailed to me (barbara.engelhardt -at- duke.edu) before class on the due date. Late homeworks will not be accepted, although you are allowed one late homework (maximum one week) for the course. Students may (and should) collaboratively discuss the homework assignments; however, I expect each student to program and write up their own homework solutions. Please write the names of the students you discussed the homework assignment with at the top of your solutions.

Each lecture will have a scribe, who will type up notes in the LaTeX template. Within a week of class, the scribe should send me the LaTeX file, at which point I will read them over and post them to the website. If you have never used LaTeX before, there are online tutorials, Mac GUIs, and even online compilers that might help you. I'll put up an example of an uncompiled scribe note file when I have one.

A second set of references for R will also be useful. First, you can download R from the CRAN website. There are many resources, such as R Studio, that can help with the programming interface, and tutorials on R are all over the place. If you are getting bored with the standard graphics package, I really like using ggplot2 for beautiful graphics and figures. Finally, you can integrate R code and output with plain text using KNITR, but that might be going a bit too far for beginners.

We will have daily readings for the course, but there is no formal text for this class. However, some texts and notes that may be useful include:

- Michael Lavine, Introduction to Statistical Thought (an introductory statistical textbook with plenty of R examples, and it's online too)
- Ewans and Grant, Statistical Methods in Bioinformatics
- Cristianini and Hahn, Introduction to Computational Genomics
- Sayan Mukherjee, Statistical methods for computational biology
- Kevin Murphy, Machine Learning: a probabilistic perspective
- Durbin, Eddy, Krogh, Mitchison, Biological Sequence Analysis
- Joseph Felsenstein, Inferring phylogenies

This syllabus is *tentative*, and will almost surely be superceded. Reload your browser for the current version.

Note: as of 2/7/13, I'm switching homework hand-out days and due dates to Tuesdays, mostly because it will give you more time to look at the homework before office hours on Friday.

Note: Scribes, please email me for a LaTeX template, and please follow the symbols and notation within. I expect both the statistical model aspect of the lecture and the case study to be described clearly and neatly. Please return the scribe notes to me within a week of the lecture.

Note: The final project TeX template and final project style file should be used in preparation of your final project report. Please follow the instructions and let me know if you have questions. Presentations will be on April 11th and 16th; the reports will be due on April 19th (Friday).

Week | Topic | Homework | Scribe | |
---|---|---|---|---|

Jan 10 | Introductory statistics | [Fisher, 1918] | ||

Jan 15 | Introductory statistics | [Lander & Botstein, 1989] | ||

Jan 17 | Linear regression | [Stranger et al., 2007], hw1 gene expression matrix genotype matrix | Ethan Hada | |

Jan 22 | Hypothesis testing | [Storey et al., 2003] | Lisa Cervia | |

Jan 24 | Logistic regression | [Sladek et al., 2007], hw2 case control status | Dinesh Manandhar | |

Jan 29 | Generalized linear models | [Marioni et al., 2008] | Yangxiaolu Go | |

Jan 31 | Bayesian regression | [Stephens & Balding, 2009], hw3 | Amanda Lea | |

Feb 5 | Sparse regression | [Tibshirani, 1996] | Rumen Stamatov | |

Feb 7 | Mixed effects models | [Segura et al., 2012] | Shiwen Zhao | |

Feb 12 | Mixture models | [Bailey et al., 1995], data set hw4 | Brittany Stokes | |

Feb 14 | Admixture models | [Pritchard et al., 2000] | Qinglong Zeng | |

Feb 19 | PCA | [Patterson et al., 2006] | Renjie Tan | |

Feb 21 | Factor analysis | [Engelhardt & Stephens, 2010] | Peter Tonner | |

Feb 26 | Markov chains | [Der et al., 2011], hw5 Population A Population B | Ryan Muraglia | |

Feb 28 | Continuous time Markov models | [Suchard et al., 2001] | Kayla Hudson | |

Mar 5 | Hidden Markov models | [Burge & Karlin, 1997] | Meng He | |

Mar 7 | In-class Midterm | |||

Mar 19 | Trees | [Siepel & Haussler, 2004] | Ning Shen | |

Mar 21 | Coalescent processes | [Li & Durbin, 2010] | Florian Wagner | |

Mar 26 | Clustering | [Eisen et al., 1998] | Goke Ojewole | |

Mar 28 | Classification | [Diaz-Uriarte et al., 2006] | Colbert Sesanker | |

Apr 2 | Support vector machines | [Saigo et al., 2004], hw6 classification data | ||

Apr 4 | Gaussian graphical models | [Schafer & Strimmer, 2005] | ||

Apr 9 | Infinite mixture models | [Medvedovic & Sivaganesan, 2002], hw9 | ||

Apr 11 | Gaussian processes | Final project presentations | ||

Apr 16 | Bayesian nonparametric models | Final project presentations, projects due |