My long term research goal is to change the way scientists analyze high dimensional biomedical data for the goal of scientific discovery. The rate of change in technology that advances our ability to collect observations about DNA, single cells, and tissue samples rapidly makes analytic methods for these observations obsolete. Furthermore, the complexity of the biological phenomena we attempt to quantify and understand overwhelms current methods. General approaches to data analysis, including principal component analysis and linear regression, are not sufficient for the intricacy of modern biomedical data; new approaches using statistical models that include analysis- and technology-specific structure must be developed for many types of studies.
Statistical tests for functional genomics. Expression quantitative trait loci (eQTLs) are genetic mutations that regulate gene transcription and often drive disease risk. As a postdoc, I was involved in an early eQTL paper using data from RNA-sequencing; my contribution was in the statistical methodologies for eQTL discovery in the presence of population structure. Later, I developed a statistical model for differential eQTLs, or eQTLs that are regulatory under one condition but not another. We found six differential eQTLs in our study. The strongest differential eQTL, which was regulatory after exposure to statins but not in their absence, was found to be protective of muscular myopathy in two separate studies of the myopathic toxicity of statins
Publications: [Pickrell et al., 2010], [Mangravite, Engelhardt, et al., 2013]
High dimensional GWAS Genome-wide association studies (GWAS)
identify genetic variants that are associated with the occurrence of a
complex phenotype or disease in a set of individuals. Many phenotypes
are difficult to quantify with a single measure. I am building methods
for conducting GWAS using survey data as the phenotype. Standard
dimensionality reduction techniques are not effective for scaling down
the size of the data because the resulting phenotype summaries were
not interpretable. In prior work, we applied SFA and found that the
sparse solution had phenotypic interpretations for all of the factors,
and genetic associatons for a number of phenotypes. Our current work
goes beyond this model for greater robustness and inference of
the number of factors from the underlying data.
Publications: [Hart, Engelhardt et al., 2012], [Zhao et al, 2014]
In order for SNPs associated with complex traits and disease to be medically actionable, it is essential that we understand how they work. As part of the GTEx consortium, and in collaboration with Casey Brown, we conducted large-scale replication studies across eleven studies in seven tissue types. We have overlaid these results onto regulatory element data to enable a much more profound mechanistic understanding of eQTL data by studying where eQTLs and cell type specific eQTLs are co-located with specific cis-regulatory elements. In collaboration with Tim Reddy, we studied long intergenic non-coding RNA (lincRNA) and, using protein-coding RNA as a control, we found no evidence that lincRNA ubiquitously affect gene transcription, in contrast to their protein-coding counterparts.
We are currently developing statistical models for understanding eQTLs and variants that influence mRNA isoform levels in RNA-seq data. We are also working on predictive models for eQTLs across tissue types and models that consider replication in trans-eQTLs.
Publications: [Brown, Mangravite, Engelhardt 2013], [McDowell et al. 2015]
Sparse latent factor models applied to genomic data have the ability to recover interpretable latent linear structure. Applied to genotype data from individuals with discrete population structure, we can recover the underlying ancestral populations; applied to individuals with continuous population structure, we find a recapitulation of their geographic ancestry.
We developed latent factor models for application to gene expression
data, adapting flexible continuous sparsity-inducing priors to support
an overcomplete represetation and recovering a large number of sparse
latent components. We also added a two component mixture model to
support recovery of non-sparse, low rank structure, which captures
variance effects due to confounding such as population structure and
technical effects. Using this general framework, we have developed
canonical correlation analysis and group factor analysis models to
jointly reduce dimension across multiple data observations (e.g.,
genotype and gene expression data) and biclustering models with
sparsity on both the genes and the samples. By interpreting the latent
structure as regularized covariance matrix estimation, we build
ubiquitous, subset specific, and subset differential Gaussian
graphical models (Gaussian Markov random fields, gene co-expression
We have validated these approaches by recovering trans-eQTLs that cannot be detected using standard methods. We are extending this work in a number of ways.
Publications: [Engelhardt & Stephens, 2010], [Gao, Brown, Engelhardt 2013], [Zhao et al. 2014], [Srivastava, Engelhardt, Dunson 2014], [Gao et al. 2014]
Publications: [Zhang et al. 2015]