MELD: Fast moment estimation for generalized latent Dirichlet models

In this work, we integrate over the latent components of a latent Dirichlet model and then develop a generalized method of moments for fast parameter estimation of the integrated model. This approach allows the model to behave in an agnostic way to the distribution of the observations. The work is described in:

Zhao S, Engelhardt BE, Mukherjee S, and Dunson DB. "Fast moment estimation for generalized latent Dirichlet models" Journal of the American Statistical Association (JASA) [pdf]

The MELD software, written and maintained by Shiwen Zhao, is publicly available: [Software]

ARSVD: Adaptive randomized dimension reduction on massive data, with application for linear mixed models

This approach used an adaptive random matrix projection to accellerate computation of principal components of a large matrix. We show the implicit regularization properties of this approach, and evaluate its impact on estimation for linear mixed models in genomic applications. The work is described in:

Darnell G, Georgiev S, Mukherjee S, and Engelhardt, BE. "Adaptive randomized dimension reduction on massive data" Journal of Machine Learning Research (JMLR) [pdf]

The ARSVD software, written and maintained by Gregory Darnell, is publicly available: [Software]

BASS: Bayesian canonical correlation analysis and group factor analysis

Given two or more paired observation matrices, BGFA finds sparse and dense latent components corresponding to observation specific covariances or covariance terms shared across observations. In the case of m=2 observations, this model is the canonical correlation model. The linear latent space is the linear projection that maximizes the correlation across the two observations. The work is described in:

Zhao S, Gao C, Mukherjee S, and Engelhardt BE. "Bayesian group latent factor analysis with structured sparsity" Journal of Machine Learning Research (JMLR) [pdf]

The BASS software, written and maintained by Shiwen Zhao, is publicly available: [Software]

BicMix: Bayesian biclustering via a doubly-sparse latent factor model

This software finds two sparse low dimensional matrices that capture sparse covariance structure in the response matrix. The work is described in:

Gao C, Zhao S, McDowell IC, Brown CD, and Engelhardt BE. "Differential gene co-expression networks via Bayesian biclustering models" (submitted) [arXiv]

The BicMix software, written and maintained Dr. Chuan Gao, is publicly available: [Software]; send questions and comments to:

Bayesian structured sparse regression

This software computes the posterior probability of inclusion for each covariate given a set of predictors (and a positive definite matrix describing their similarity) and a quantitative response. The work is described in:

Engelhardt BE, and Adams RP. "Bayesian structured sparsity from Gaussian fields" (in review) [ArXiV]

The software is available on GitHub [Software]

Posterior predictive checks (PPCs) for admixture models

This software fits the original admixture model to genomic data and encodes the process of performing a posterior predictive check with five possible discrepancy functions. The work is described in:

Mimno D, Blei DM, and Engelhardt BE. "Posterior predictive checks to quantify lack-of-fit in admixture models of latent population structure" (in review) [ArXiV]

The software is available on GitHub [Software]

Methylation data analysis

Given a set of methylation data across the human genome, we have written software that will perform single-CpG site methylation level predictions. These methods are written in R, and include all code required to regenerate each of the figures and tables in the manuscript including the related classifiers that we compared. [Code]

These methods and results are described in the paper:

Zhang W, Spector TD, Deloukas P, Bell JT, and Engelhardt BE. "Predicting genome-wide DNA methylation using methylation marks, genomic position, and DNA regulatory element" (accepted) [ArXiv]

The methylation data from the Methylation 450K array on 100 individuals from the TwinsUK consortium described and analyzed in this paper have also been posted online at Gene Expression Omnibus with ID GSE62992: [Data]

Sparse and dense factor analysis (SFAmix)

This software computes a low-rank matrix factorization with a combination of both sparse and dense factor loadings for a given matrix, as described in

Gao C, Brown CD, and Engelhardt BE. "A latent factor model with a mixture of sparse and dense factors to model gene expression data with confounding effects" Submitted. [ArXiV]

Download C++ code, instructions, and documentation for SFAmix 1.0.

Data: publicly available eQTL study data with a uniform processing pipeline

These data sets have been processed through a single pipeline for gene expression and genotype data as described in

Brown CD, Mangravite LM, Engelhardt BE (2013). "Integrative modeling of eQTLs and cis-regulatory elements suggests mechanisms underlying cell type specificity of eQTLs" PLoS Genetics 9(8): e1003649. [PDF]

One change from the pipeline noted above is that we include genotypes imputed using Impute2 software with prephasing, and we impute up to the 1000 Genomes reference data from March 2012, and we do not filter low MAF SNPs. Note that the resulting imputed genotype files are in CHIAMO format.

[HapMap 3]

Sparse factor analysis (SFA)

This software uses ECME to compute a sparse, low-rank matrix factorization for a given matrix, as described in

Engelhardt BE, Stephens M (2010) "Analysis of population structure: a unifying framework and novel methods based on sparse factor analysis." PLoS Genetics 6(9):e1001117.

Download C++ code and instructions for SFA 1.0 and further documentation for the SFA model.

SIFTER: Statistical Inference of Function Through Evolutionary Relationships

SIFTER software and instructions reside at the Brenner Lab at UC Berkeley, although I am still actively maintaining the code. This software uses a statistical model to predict protein molecular function for unannotated proteins using functional annotations from a set of homologous proteins, described in:

Engelhardt BE, Jordan MI, Srouji JR, and Brenner SE (2011) Genome-scale phylogenetic function annotation of large and diverse protein families. Genome Research (in press).