My long term research goal is to change the way scientists analyze high dimensional biomedical data for the goal of scientific discovery. The rate of change in technology that advances our ability to collect observations about DNA, single cells, and tissue samples rapidly makes analytic methods for these observations obsolete. Furthermore, the complexity of the biological phenomena we attempt to quantify and understand overwhelms current methods. General approaches to data analysis, including principal component analysis and linear regression, are not sufficient for the intricacy of modern biomedical data; new approaches using statistical models that include analysis- and technology-specific structure must be developed for many types of studies.

Multiple hypothesis testing frameworks for quantitative genetics

The field of quantitative genetics aims to understand the genetic basis of complex traits; in order to accomplish this aim, statistical tests must be developed to identify associations in studies with limited sample sizes between genotypes and quantitative or binary traits. These studies often include multiple complex traits and whole genome analyses, often pushing the number of statistical tests into the trillions and requiring reimagination of the corrections for multiple hypothesis testing methods. Moreover, sample procurement for studies of complex traits often include study and technology related artifacts, such as batch effects, population structure among the samples, or biases in age, sex, or body mass index based on biased sample acquisition.

Statistical tests for functional genomics. Expression quantitative trait loci (eQTLs) are genetic mutations that regulate gene transcription and often drive disease risk. As a postdoc, I was involved in an early eQTL paper using data from RNA-sequencing; my contribution was in the statistical methodologies for eQTL discovery in the presence of population structure. Later, I developed a statistical model for differential eQTLs, or eQTLs that are regulatory under one condition but not another. We found six differential eQTLs in our study. The strongest differential eQTL, which was regulatory after exposure to statins but not in their absence, was found to be protective of muscular myopathy in two separate studies of the myopathic toxicity of statins
Publications: [Pickrell et al., 2010], [Mangravite, Engelhardt, et al., 2013]

High dimensional GWAS Genome-wide association studies (GWAS) identify genetic variants that are associated with the occurrence of a complex phenotype or disease in a set of individuals. Many phenotypes are difficult to quantify with a single measure. I am building methods for conducting GWAS using survey data as the phenotype. Standard dimensionality reduction techniques are not effective for scaling down the size of the data because the resulting phenotype summaries were not interpretable. In prior work, we applied SFA and found that the sparse solution had phenotypic interpretations for all of the factors, and genetic associatons for a number of phenotypes. Our current work goes beyond this model for greater robustness and inference of the number of factors from the underlying data.
Publications: [Hart, Engelhardt et al., 2012], [Zhao et al, 2014]

some text
Bayesian tests for association. We are developing models for Bayesian tests of association between multiple genotypes and a phenotype that take into account local structure on SNPs to improve statistical power of associations. We are extending these ideas to methods that produce better estimates of effect size, are faster, and can handle binary traits.
Publications: [Engelhardt & Adams 2014]

Studying the mechanistic underpinnings of functional SNPs

In order for SNPs associated with complex traits and disease to be medically actionable, it is essential that we understand how they work. As part of the GTEx consortium, and in collaboration with Casey Brown, we conducted large-scale replication studies across eleven studies in seven tissue types. We have overlaid these results onto regulatory element data to enable a much more profound mechanistic understanding of eQTL data by studying where eQTLs and cell type specific eQTLs are co-located with specific cis-regulatory elements. In collaboration with Tim Reddy, we studied long intergenic non-coding RNA (lincRNA) and, using protein-coding RNA as a control, we found no evidence that lincRNA ubiquitously affect gene transcription, in contrast to their protein-coding counterparts. We are currently developing statistical models for understanding eQTLs and variants that influence mRNA isoform levels in RNA-seq data. We are also working on predictive models for eQTLs across tissue types and models that consider replication in trans-eQTLs.
Publications: [Brown, Mangravite, Engelhardt 2013], [McDowell et al. 2015]

Sparse latent factor models for recovering latent structure in genomic data

Sparse latent factor models applied to genomic data have the ability to recover interpretable latent linear structure. Applied to genotype data from individuals with discrete population structure, we can recover the underlying ancestral populations; applied to individuals with continuous population structure, we find a recapitulation of their geographic ancestry.

We developed latent factor models for application to gene expression data, adapting flexible continuous sparsity-inducing priors to support an overcomplete represetation and recovering a large number of sparse latent components. We also added a two component mixture model to support recovery of non-sparse, low rank structure, which captures variance effects due to confounding such as population structure and technical effects. Using this general framework, we have developed canonical correlation analysis and group factor analysis models to jointly reduce dimension across multiple data observations (e.g., genotype and gene expression data) and biclustering models with sparsity on both the genes and the samples. By interpreting the latent structure as regularized covariance matrix estimation, we build ubiquitous, subset specific, and subset differential Gaussian graphical models (Gaussian Markov random fields, gene co-expression networks). some text
We have validated these approaches by recovering trans-eQTLs that cannot be detected using standard methods. We are extending this work in a number of ways.
Publications: [Engelhardt & Stephens, 2010], [Gao, Brown, Engelhardt 2013], [Zhao et al. 2014], [Srivastava, Engelhardt, Dunson 2014], [Gao et al. 2014]

Epigenome-wide association studies

We are currently developing methods for performing epigenome-wide scans for association of methylation status with phenotypes of interest. Current developments involve developing and applying methods for causal inference to unravel the relationships between epigenetic effects.

Publications: [Zhang et al. 2015]

some text

Protein molecular function prediction

As a graduate student with Dr. Michael Jordan, collaborating with Dr. Steven Brenner, I created a statistical methodology, SIFTER (Statistical Inference of Function Through Evolutionary Relationships), to capture how protein molecular function evolves within a phylogeny in order to accurately predict function for unannotated proteins, improving over existing methods that use pairwise sequence comparisons. We relied on the assumption that function evolves in parallel with sequence evolution, implying that phylogenetic distance is the natural measure of functional divergence. In SIFTER, molecular function evolves as a first-order Markov chain within a phylogenetic tree. Posterior probabilities are computed exactly using message-passing, with an approximate method for large or functionally diverse protein families; model parameters are estimated using generalized expectation maximization. Functional predictions are extracted from protein-specific posterior probabilities for each function. I applied SIFTER to a genome-scale fungal data set, which included families of proteins from 46 fully-sequenced fungal genomes, and SIFTER substantially outperformed state-of-the-art methods in producing correct and specific predictions.

Publications: [Engelhardt et al., 2006], [Engelhardt et al., 2007], [Engelhardt et al., 2011]