The Engelhardt Group is involved in developing innovative statistical models and methods in order to elucidate biological mechanisms of complex phenotypes and disease. Measurements of biological systems have both noise and systematic bias, and often the analytical goal is to identify low-dimensional substructure within a high-dimensional space. These qualities are well-addressed by model-based analyses. But the high dimension and scale of biological data makes parameter estimation in sophisticated models challenging. We address these challenges by developing hierarchical statistical models and approximate parameter estimation methods to gain access to interesting biological phenomena.

Statistical Analysis of Genetic Association Studies

High-dimensional GWAS. Genome-wide association studies (GWAS) identify genetic variants that are associated with the occurrence of a complex phenotype or disease in a set of individuals. Many phenotypes are difficult to quantify with a single measure. I am building methods for conducting GWAS using survey data as the phenotype. Standard dimensionality reduction techniques are not effective for scaling down the size of the data because the resulting phenotype summaries were not interpretable. In prior work, we applied SFA and found that the sparse solution had phenotypic interpretations for all of the factors, and genetic associatons for a number of phenotypes. Our current work goes beyond this model for greater robustness and inference of the number of factors from the underlying data.
Publications: [Hart, Engelhardt et al., 2012], [Zhao et al, 2014]

Epistatic QTLs. Although it is straightforward to determine whether a SNP impacts transcription of a gene, it is less clear how to test whether a SNP regulates transcription of a gene differently in the presence of a chemical modifier. With collaborators from the Childrens Hospital Oakland Research Institute (CHORI), I am applying a Bayesian test based on regression with multiple correlated responses to determine whether statins change how a SNP modulates transcription. Currently we have found several differential eQTLs affecting genes in a cholesterol pathway, along with thousands of eQTLs; one differential eQTL was shown to be protective of a toxic side effect of statins in two clinical cohorts. We are currently developing methods for considering different types of epistasis beyond GxE.
Publications: [Mangravite, Engelhardt, et al., 2013]

some text
Bayesian tests for association. We are developing models for Bayesian tests of association between multiple genotypes and a phenotype that take into account local structure on SNPs to improve statistical power of associations. We are extending these ideas to methods that produce better estimates of effect size, are faster, and can handle binary traits.
Publications: [Engelhardt & Adams 2014]

Studying the mechanistic underpinnings of functional SNPs

In order for SNPs associated with complex traits and disease to be medically actionable, it is essential that we understand how they work. As part of the GTEx consortium, and in collaboration with Casey Brown, we conducted large-scale replication studies across eleven studies in seven tissue types. We have overlaid these results onto regulatory element data to enable a much more profound mechanistic understanding of eQTL data by studying where eQTLs and cell type specific eQTLs are co-located with specific cis-regulatory elements. In collaboration with Tim Reddy, we studied long intergenic non-coding RNA (lincRNA) and, using protein-coding RNA as a control, we found no evidence that lincRNA ubiquitously affect gene transcription, in contrast to their protein-coding counterparts. We are currently developing statistical models for understanding eQTLs and variants that influence mRNA isoform levels in RNA-seq data. We are also working on predictive models for eQTLs across tissue types and models that consider replication in trans-eQTLs.
Publications: [Brown, Mangravite, Engelhardt 2013], [McDowell et al. 2015]

Sparse latent factor models for recovering latent structure in genomic data

Sparse latent factor models applied to genomic data have the ability to recover interpretable latent linear structure. Applied to genotype data from individuals with discrete population structure, we can recover the underlying ancestral populations; applied to individuals with continuous population structure, we find a recapitulation of their geographic ancestry.

We developed latent factor models for application to gene expression data, adapting flexible continuous sparsity-inducing priors to support an overcomplete represetation and recovering a large number of sparse latent components. We also added a two component mixture model to support recovery of non-sparse, low rank structure, which captures variance effects due to confounding such as population structure and technical effects. Using this general framework, we have developed canonical correlation analysis and group factor analysis models to jointly reduce dimension across multiple data observations (e.g., genotype and gene expression data) and biclustering models with sparsity on both the genes and the samples. By interpreting the latent structure as regularized covariance matrix estimation, we build ubiquitous, subset specific, and subset differential Gaussian graphical models (Gaussian Markov random fields, gene co-expression networks). some text
We have validated these approaches by recovering trans-eQTLs that cannot be detected using standard methods. We are extending this work in a number of ways.
Publications: [Engelhardt & Stephens, 2010], [Gao, Brown, Engelhardt 2013], [Zhao et al. 2014], [Srivastava, Engelhardt, Dunson 2014], [Gao et al. 2014]

Epigenome-wide association studies

We are currently developing methods for performing epigenome-wide scans for association of methylation status with phenotypes of interest. Current developments involve developing and applying methods for causal inference to unravel the relationships between epigenetic effects.

Publications: [Zhang et al. 2015]

some text

Protein molecular function prediction

As a graduate student with Dr. Michael Jordan, collaborating with Dr. Steven Brenner, I created a statistical methodology, SIFTER (Statistical Inference of Function Through Evolutionary Relationships), to capture how protein molecular function evolves within a phylogeny in order to accurately predict function for unannotated proteins, improving over existing methods that use pairwise sequence comparisons. We relied on the assumption that function evolves in parallel with sequence evolution, implying that phylogenetic distance is the natural measure of functional divergence. In SIFTER, molecular function evolves as a first-order Markov chain within a phylogenetic tree. Posterior probabilities are computed exactly using message-passing, with an approximate method for large or functionally diverse protein families; model parameters are estimated using generalized expectation maximization. Functional predictions are extracted from protein-specific posterior probabilities for each function. I applied SIFTER to a genome-scale fungal data set, which included families of proteins from 46 fully-sequenced fungal genomes, and SIFTER substantially outperformed state-of-the-art methods in producing correct and specific predictions.

Publications: [Engelhardt et al., 2006], [Engelhardt et al., 2007], [Engelhardt et al., 2011]