Scalable inference of discrete data: user behavior, networks and genetic variation | Computer Science Department at Princeton University

Report ID:

TR-976-15

Authors:

Gopalan, Prem

Date:

January 9, 2015

Pages:

157

Download Formats:

[PDF]

Abstract:

Recent years have seen explosive growth in data, models and
computation. Massive data sets and sophisticated probabilistic models
are increasingly used in the fields of high-energy physics, biology,
genetics and in personalization applications; however, many
statistical algorithms remain inefficient, impeding scientific
progress.

In this thesis, we present several efficient statistical algorithms
for learning from massive discrete data sets. We focus on discrete
data because complex and structured activity such as chromosome
folding in three dimensions, variations in human genotype, social
network interactions and product ratings are often encoded as simple
matrices of discrete numerical observations. Our algorithms derive
from a Bayesian perspective and lie in the framework of directed
graphical models and mean-field variational inference. Situated in
this framework, we gain computational and statistical efficiency
through modeling insights and through subsampling informative data
during inference.

We begin with additive Poisson factorization models for recommending
items to users based on user consumption or ratings. These models
provide sparse latent representations of users and items, and capture
the long-tailed distributions of user consumption. We use them as
building blocks for article recommendation models by sharing latent
spaces across readership and article text. We demonstrate that our
algorithms scale to massive data sets, are easy to implement and
provide competitive user recommendations. Then, we develop a
Bayesian nonparametric model in which the latent representations of
users and items grow to accommodate new data.

In the second part of the thesis, we develop novel algorithms for
discovering overlapping communities in large networks. These
algorithms interleave non-uniform subsampling of the network with
model estimation. Our network models capture the basic ways in which
nodes connect to each other, through similarity and popularity, using
mixed-memberships representations and generalized linear model
formulation.

Finally, we present the TeraStructure algorithm to fit Bayesian models
of genetic variation in human populations on tera-sample-sized data
sets ($10^{12}$ observed genotypes, e.g, 1M individuals at 1M SNPs).
On real genomic data collected from thousands of individuals,
TeraStructure is faster than existing methods and recovers the latent
population structure with equal accuracy. On genomic data simulated at
the tera-sample-size scales, TeraStructure is highly accurate and is
the only method that can complete its analysis.