CS Department Colloquium Series
In 2003, the Human Genome Project marked a major scientific milestone by releasing the first consensus DNA sequence of the human genome. The ENCODE Project (Encyclopedia of DNA elements) was launched to pick up where the Human Genome Project left off, with the ambitious goal of systematically deciphering the potential function of every base (letter) in the genome. ENCODE has generated the largest collection of functional genomic data in humans to date, measuring the cellular activity of thousands of cellular moieties in a variety of normal and diseased cellular contexts. In this talk, I will describe novel computational and machine learning approaches that I developed for integrative analysis of massive compendia of diverse biological data such as ENCODE to unravel the functional heterogeneity and variation of regulatory elements in the human genome and their implications in human disease.
I will begin with a gentle introduction to the diversity and scale of ENCODE data and a brief overview of robust, statistical methods that we developed for automated detection of DNA binding sites of hundreds of regulatory proteins from noisy, experimental data. Regulatory proteins can perform multiple functions by interacting with and co-binding DNA with different combinations of other regulatory proteins. I developed a novel discriminative machine learning formulation based on regularized Rule-based ensembles that was able to sort through the combinatorial complexity of possible regulatory interactions and learn statistically significant item-sets of co-binding events at an unprecedented level of detail. I found extensive evidence that regulatory proteins could switch partners at different sets of genomic domains within a single cell-type and across different cell-types affecting structural and chemical properties of DNA and regulating different functional categories of target genes. Using regulatory elements discovered from ENCODE data, we were also able to provide putative functional interpretations for up to 81% of all publicly available sequence variants (mutations) identified in large-scale disease studies and generate new hypotheses by integrating multiple sources of data.
Finally, I will present a brief overview of my recent efforts on using multivariate Hidden Markov models to analyze the dynamics of various chemical modifications to DNA across three key axes of variation - across multiple species, across different cell-types in a single species (human), and across multiple human individuals for the same cell-type. Our results indicate a remarkable universality of chemical modifications defining hidden regulatory states across the animal kingdom with dramatic differences in the variation and functional impact of these regulatory elements between cell-types and individuals.
Together, these efforts take us one step closer to learning comprehensive models of gene regulation in humans in order to improve our system-level understanding of cellular processes and complex diseases.