Analysis & Visualization of Large Scale Genomic Data Sets 


Tuesdays 1:00pm-3:30pm

Rm. 280 in CIL (2nd floor of the genomics building)


Course Info

The goal of this course is to introduce students to computational issues involved in analysis and display of large-scale biological data sets.  Techniques covered will include clustering and machine learning techniques for gene expression microarrays and proteomics data analysis, biological networks and pathways modeling, data integration in genomics, and visualization issues for large-scale data sets.


A short introduction to the field of bioinformatics and the nature of biological data will be provided, no prior knowledge of biology is required. In depth knowledge of computer science is not required, but students must have some understanding of computation (though no need to know programming).


The course will be taught in a mixed lectures and seminar format, and will involve completing a project and a final exam.  The course is open to graduate and advanced undergraduate students from all departments.


Administrative info:

Level:              Graduate and upper level undergraduate

Background:   Some understanding of computation, basic understanding of molecular biology can be acquired through suggested readings below

Format:            Mixed lectures and seminar-style

Grading:          40% presentations

15% quizzes

15% participation (including attendance and participation in discussions)

30% final project (10% project proposal, 20% final project report)

Auditors:         Auditors are welcome, must participate in presentations and discussions (but do not need to do the final project).   


Please register on blackboard asap (even if you are an auditor).  That’s where I’ll post the lecture slides and that’s where I’ll email announcements to the class.


When you access papers below, make sure you are doing so from the PU domain (you can use VPN if you are doing so from off-campus). 




For more admin info, see syllabus):



There is no required book for this class.  Material will be presented in lectures, and readings will be based on current literature.  However, here are a few recommendations for the curious.



Some suggested readings:

If you need to catch up on molecular biology and genetics: 

DOE primer on human genetics

R. Brent. Genomic Biology. Cell 100:169-183, 2000.

L. Hunter. Molecular Biology for Computer Scientists. In Artificial Intelligence and Molecular Biology, L. Hunter editor, 1993, AAAI Press.


Introduction to bioinformatics:

NCBI bioinformatics primer

NCBI primer on microarray analysis


Some primers on computational techniques in bioinformatics:

What is principal component analysis?

What is the expectation maximization algorithm?

Getting started in probabilistic graphical models





Each presentation should be 30mins, with 15min discussion afterwards.  Presentations should be in power point (or another slides format), and you must e-mail me the power point after your presentation before I can grade it.


A good presentation would include:

-a brief overview of the paper

-outline of major methods and findings, with background of important concepts (e.g. if the paper uses Dynamic Bayesian Networks, give an intro of what they are)

-critically evaluate the paper: what the paper did well, *what are problems/issues with the approach*, what puzzled you

-what should be the future of this method (don’t just retype the “future work” section, we’re looking for your analysis here)


Course Announcements (check here often):


PLEASE sign up for the course on blackboard, or you won’t get any of the course-related emails, which are important.

If you are auditing, sing up for audit.  If you are a postdoc and can’t officially sign up, let me know, and I’ll make sure

to copy you on e-mails.



Course schedule:








1 (2/3)


Introduction to the course and bio


Intro to the course and introduction to biology and bioinformatics

Reading: “Systems biology 101-what you need to know




2 (2/10)




Microarray analysis introduction and overview

Hand and Heard “Finding groups in gene expression data” (a very nice review of clustering microarray data, present general concepts and choose 2-3 methods (not hierarchical or kmeans or SOM) to describe in a bit more detail)



Suggested readings:

Lockhart et al  "Genomics, gene expression, and DNA microarrays" (general microarray)

Kaminski N et al "Practical approaches to analyzing results of microarray experiments" (review)

Ehrenreich A.  “DNA microarray technology for the microbiologist: an overview.” (a nice intro to types of microarrays and how microarray experiments work)



Chris C._


3 (2/17)




Microarray data analysis: from disarray to consolidation and consensus

Generalized singular value decomposition for comparative analysis of genome-scale expression data sets of two different organisms

Global survey of organ and organelle protein expression in mouse: Combined proteomic and transcriptomic profiling (bio)












4 (2/24)


Regulation (modules and pathways)


A factor graph nested effects model to identify networks from genetic perturbations

Inferring transcriptional modules from ChIP-chip, motif and microarray data.

A modular approach for integrative analysis of large-scale gene-expression and drug-response data






Chris P



Jesse F

5  (3/3)


Next Generation Sequencing


Introduction to Next generation sequencing

Computational methods for Next Generation sequencing

What would you do if you could sequence everything?



Lecture: Lars




6 (3/10)


Bayesian methods in biology and medicine


Guest Lecture

Required reading:

Inference in Bayesian Networks


Guest Lecture

Guest Lecture


7 (3/24)

note that 3/27 is spring break

Data integration

Introduction and overview of data integration and networks prediction

Guest Lecture: Intro to metabolomics



Chris C.


8  (3/31)


Interactions and Networks


Activity motifs reveal principles of timing in transcriptional control of the yeast metabolic network

A genomewide functional network for the laboratory mouse

Analysis of the human protein interactome and comparison with yeast, worm and fly interaction datasets (bio + analysis).


Alice Z






9 (4/7)


Project proposal presentations

Students present proposals for their final projects, graded on the quality of the proposed project, related prior work investigation, and quality of the presentation.  Feedback on projects will be provided.

All students taking the course for credit


10 (4/14)


Interactions, networks, and pathways


Global mapping of pharmacological space

Refinement and expansion of signaling pathways: The osmotic response network in yeast

Inferring gene networks from time series microarray data using dynamic Bayesian networks.


Joe Irgon


Matt Rich


Chong W


11 (4/21)


Function prediction & Visualization


Predicting gene function in a hierarchical context with an ensemble of classifiers

Dynamic querying for pattern identification in microarray and genomic data

Click and Expander: a system for clustering and visualizing gene expression data


Timothy L


Wei H_


David S.


12 (4/28)


Project in-progress presentations


Presentation of progress on the final project.  Graded on progress and presentation, students will receive feedback to help them proceed.

All students taking the course for credit



Additional articles of interest (not required and won't be discussed in class)


Inferring pathways and networks with a Bayesian framework

Click and Expander: a system for clustering and visualizing gene expression data

Cluster stability and the use of noise in interpretation of clustering. (Interesting clustering algorithm + visualization)

Factorgrams: A tool for visualizing multi-way associations in biological data V Cheung, I Givoni, D Dueck, BJ Frey
University of Toronto Technical Report PSI-2006-44, May 15, 2006.

Normalization of Microarray Data: Single-Labeled and Dual-Labeled Arrays

Getting started in tiling microarray analysis