My long term research goal is to change the way scientists analyze high dimensional biomedical data for the goal of scientific discovery. The rate of change in technology that advances our ability to collect observations about genomic data, including DNA, single cells, and tissue samples, rapidly makes analytic methods for these observations obsolete. Furthermore, the complexity of the biological phenomena we attempt to quantify and understand overwhelms current methods that oversimplify the complexity in order to scale to the data magnitude. General approaches to data analysis, including principal component analysis and linear regression, are insufficient for the intricacy of modern biomedical data; new approaches using statistical models and machine learning methods that include analysis- and technology-specific structure must be developed for many types of genomic studies.

My group builds and applies structured hierarchical models and approximate methods for the analysis of high-dimensional genomic data. Our work in developing methods for modern genomic technologies and scientific questions requires three types of innovations. First, statistical models need to be adapted to capture the complexity of the data. Second, inference algorithms for these structured models need to scale to the size of the data. Third, software infrastructure must be usable by the biomedical community. The impact of addressing these issues is that the pace of discovery and actionable results from biomedical research is accelerated, because the analytic solutions from advanced platforms are broadly available and immediately applicable. The development of these frameworks is specific to technology and analytic goals, and is not easily generalized. To this end, our work has broadly focused on innovations in two types of statistical analyses: structured regression models for hypothesis testing, and hierarchical latent variable models for dimension reduction and exploratory data analysis, as detailed below.

To this end, my work has broadly focused on innovations in two types of statistical analyses: structured regression models for hypothesis testing, and hierarchical latent variable models for dimension reduction and exploratory data analysis. Along with the development of these frameworks comes adaptations of inference methods for robust and tractable posterior inference in these models by using ideas from machine learning, and validation of the latent structure and hypothesis testing using experimental validation.

* Statistical tests for functional genomics.*

- Expression quantitative trait loci (eQTLs) are genetic mutations that regulate gene transcription and often drive disease risk. As a postdoc, I was involved in an early eQTL paper using data from RNA-sequencing; my contribution was in the statistical methodologies for eQTL discovery in the presence of population structure [Pickrell et al., 2010].
- I developed a statistical model for differential eQTLs, or eQTLs that are regulatory under one condition but not another. We found six differential eQTLs in our study. The strongest differential eQTL, which was regulatory after exposure to statins but not in their absence, was found to be protective of muscular myopathy in two separate studies of the myopathic toxicity of statins [Mangravite, Engelhardt, et al., 2013].
- With Julien Ayroles, we have developed a test for variance QTLs using a Bayesian heteroskedastic model and using iterated Laplace approximations for computation of the Bayes factor [Dumitrascu et al., 2015].

- I was involved in work to understand the genetic basis of response to
*d*-amphetamine from high-dimensional and noisy time series survey data. In this work, we developed a sparse latent factor model to capture each subject's unique response using ten interpretable response types. With this approach, we identified a novel genetic association with the low-dosage, positive response to \emph{d}-amphetamine. Importantly, we found that our genetic association was also protective of schizophrenia and attention deficit hyperactivity disorder. As a part of the Genetics of Personality Consortium, I was involved in two meta-analysis papers across multiple GWAS that study the genetic basis of neuroticism and found a relationship with major depressive disorder risk, and the genetic basis of extraversion [Hart, Engelhardt et al., 2012] [Hart et al., 2014] [de Moor et al., 2015] [van den Berg et al., 2016] - We developed a canonical correlation analysis with a structured prior to identify associations between multiple observation types from the same set of samples, and we used this to identify collections of genetic variants associated with collections of genes [Zhao et al., 2016].
- We developed an association test between covariates, including genotypes, and traits collected over time, such as gene expression response to perturbation, using a Gaussian process regression model. This model naturally allows the incorporation of covariates [Tonner et al., 2016].

- Sparse latent factor models applied to genomic data have the ability to recover interpretable latent linear structure. Applied to genotype data from individuals with discrete population structure, we can recover the underlying ancestral populations; applied to individuals with continuous population structure, we find a recapitulation of their geographic ancestry [Engelhardt & Stephens, 2010].
- We developed latent factor models for application to gene expression data, adapting flexible continuous sparsity-inducing priors to support an overcomplete represetation and recovering a large number of sparse latent components. We also added a two component mixture model to support recovery of non-sparse, low rank structure, which captures variance effects due to confounding such as population structure and technical effects [Gao, Brown, Engelhardt 2013].
- Using this sparsity prior, we have developed
biclustering models with
sparsity on both the genes and the samples. By interpreting the latent
structure as regularized covariance matrix estimation, we build
ubiquitous, subset specific, and subset differential Gaussian
graphical models (Gaussian Markov random fields, gene co-expression
networks) [Gao et al., 2016].

- Often these observations contain a high proportion of observations that are missing non-uniformly at random (single cell RNA-seq; Netflix ratings). We developed the hierarchical compound poisson factorization (HCPF) model to address this issue, explicitly modeling the missingness structure of the data and sharing information between the model of the observations and the missingness model [Basbug and Engelhardt, 2016]
- We extended the HCPF to allow observations across time using a gamma-Markov chain [Jerfel et al., 2017]
- Sparse latent factor models generally require a choice of the number of latent components. We developed a nonparametric latent variable model using a generalized double Pareto prior and fast inference to address this problem in a computationally tractable expandable factor analysis model [Srivastava et al., 2017]
- We are very interested in methods for model checking for latent variable models. We proposed an approach using posterior predictive checks for the admixture model of genomic data, where the discrepancy functions correspond to population genetic statistics of interest .

In order for SNPs associated with complex traits and disease to be medically actionable, it is essential that we understand how they work.

- We conducted large-scale replication studies across eleven studies in seven tissue types. We have overlaid these results onto regulatory element data to enable a much more profound mechanistic understanding of eQTL data by studying where eQTLs and cell type specific eQTLs are co-located with specific cis-regulatory elements [Brown et al., 2013].
- As a PI in the GTEx Consortium, my group is involved in the association mapping for trans-eQTLs and the downstream functional analyses [Jo et al., 2016].

- With UPenn Hospitals, we are working with records from over 260,000 inpatients, including patient demographics, vital signs, lab test results, and major events. We have designed a multivariate Gaussian process regression model to smooth, impute, and predict future patient state for doctors in a hospital setting using a structured kernel [Cheng et al., 2017].
- We developed an off-policy reinforcement learning approach to identify the optimal policy to wean patients from ventilators. This is particularly difficult as we assumed the doctors' policies were suboptimal; validating our learned policy is challenging in this situation [Prasad et al., 2017]
- To ensure that statistical inference in these high-dimensional Gaussian processes was feasible, we developed Large Linear GP (LLGP), an extension of the linear model of co-regionalization to multi-output GPs that exploits the structure of the Gram matrix and naturally allows nonstationary processes [Feinberg et al., 2017].

**Publications:** [Zhang et al. 2015]

**Publications:** [Engelhardt et al., 2006], [Engelhardt et al., 2007], [Engelhardt et al., 2011]