Rong Ge princeton
rongge

I got my Ph.D. from the Computer Science Department of Princeton University. My advisor is Sanjeev Arora. I was a post-doc at Microsoft Research, New England.Starting in August I will be an assistant professor at the Computer Science Department of Duke University.

I am interested in Theoretical Computer Science and Machine Learning.

My microsoft email is no longer available, please send all future emails to rongge AT cs.duke.edu

Research Interest

I'm interested in applying techniques from theoretical computer science to problems in machine learning, with the hope of developing new algorithms with provable guarantees. The meta problem is: machine learning people have been applying heuristic algorithms such as local search or EM type algorithm to find solutions to many hard problems. Are these problems really easy or hard? If they are easy can we use algorithms with provable guarantees; if they are hard is there any reasonable model where they become easy or even the known heuristic algorithms can be proved to work? 

My thesis: Provable Algorithms for Machine Learning Problems


Publications

Un-regularizing: approximate proximal point and faster stochastic algorithms for empirical risk minimization Arxiv
with Roy Frostig, Sham M. Kakade and Aaron Sidford. In ICML 2015. [Abstract]

We develop a family of accelerated stochastic algorithms that minimize sums of convex functions. Our algorithms improve upon the fastest running time for empirical risk minimization (ERM), and in particular linear least-squares regression, across a wide range of problem settings. To achieve this, we establish a framework based on the classical proximal point algorithm. Namely, we provide several algorithms that reduce the minimization of a strongly convex function to approximate minimizations of regularizations of the function. Using these results, we accelerate recent fast stochastic algorithms in a black-box fashion. Empirically, we demonstrate that the resulting algorithms exhibit notions of stability that are advantageous in practice. Both in theory and in practice, the provided algorithms reap the computational benefits of adding a large strongly convex regularization term, without incurring a corresponding bias to the original problem.


Escaping From Saddle Points – Online Stochastic Gradient for Tensor Decomposition Arxiv
with Furong Huang, Chi Jin and Yang Yuan. In COLT 2015. [Abstract]

We analyze stochastic gradient descent for optimizing non-convex functions. In many cases for non-convex functions the goal is to find a reasonable local minimum, and the main concern is that gradient updates are trapped in saddle points. In this paper we identify strict saddle property for non-convex problem that allows for efficient optimization. Using this property we show that stochastic gradient descent converges to a local minimum in a polynomial number of iterations. To the best of our knowledge this is the first work that gives global convergence guarantees for stochastic gradient descent on non-convex functions with exponentially many local minima and saddle points.

Our analysis can be applied to orthogonal tensor decomposition, which is widely used in learning a rich class of latent variable models. We propose a new optimization formulation for the tensor decomposition problem that has strict saddle property. As a result we get the first online algorithm for orthogonal tensor decomposition with global convergence guarantee.


Learning Mixtures of Gaussians in High Dimensions Arxiv
with Qingqing Huang and Sham M. Kakade. In STOC 2015. [Abstract]

Efficiently learning mixture of Gaussians is a fundamental problem in statistics and learning theory. Given samples coming from a random one out of k Gaussian distributions in R^n, the learning problem asks to estimate the means and the covariance matrices of these Gaussians. This learning problem arises in many areas ranging from the natural sciences to the social sciences, and has also found many machine learning applications.

Unfortunately, learning mixture of Gaussians is an information theoretically hard problem: in order to learn the parameters up to a reasonable accuracy, the number of samples required is exponential in the number of Gaussian components in the worst case. In this work, we show that provided we are in high enough dimensions, the class of Gaussian mixtures is learnable in its most general form under a smoothed analysis framework, where the parameters are randomly perturbed from an adversarial starting point. In particular, given samples from a mixture of Gaussians with randomly perturbed parameters, when n > {\Omega}(k^2), we give an algorithm that learns the parameters with polynomial running time and using polynomial number of samples.

The central algorithmic ideas consist of new ways to decompose the moment tensor of the Gaussian mixture by exploiting its structural properties. The symmetries of this tensor are derived from the combinatorial structure of higher order moments of Gaussian distributions (sometimes referred to as Isserlis' theorem or Wick's theorem). We also develop new tools for bounding smallest singular values of structured random matrices, which could be useful in other smoothed analysis settings.


Simple, Efficient, and Neural Algorithms for Sparse Coding Arxiv
with Sanjeev Arora, Tengyu Ma and Ankur Moitra. In COLT 2015. [Abstract]

Sparse coding is a basic task in many fields including signal processing, neuroscience and machine learning where the goal is to learn a basis that enables a sparse representation of a given set of data, if one exists. Its standard formulation is as a non-convex optimization problem which is solved in practice by heuristics based on alternating minimization. Re- cent work has resulted in several algorithms for sparse coding with provable guarantees, but somewhat surprisingly these are outperformed by the simple alternating minimization heuristics. Here we give a general framework for understanding alternating minimization which we leverage to analyze existing heuristics and to design new ones also with provable guarantees. Some of these algorithms seem implementable on simple neural architectures, which was the original motivation of Olshausen and Field (1997a) in introducing sparse coding. We also give the first efficient algorithm for sparse coding that works almost up to the information theoretic limit for sparse recovery on incoherent dictionaries. All previous algorithms that approached or surpassed this limit run in time exponential in some natural parameter. Finally, our algorithms improve upon the sample complexity of existing approaches. We believe that our analysis framework will have applications in other settings where simple iterative algorithms are used.

Competing with the Empirical Risk Minimizer in a Single Pass Arxiv
with Roy Frostig, Sham M. Kakade and Aaron Sidford. In COLT 2015. [Abstract]

Many optimization problems that arise in science and engineering are those in which we only have a stochastic approximation to the underlying objective (e.g. estimation problems such as linear regression). That is, given some distribution D over functions f(x), we wish to minimize P(x)=E[f(x)], using as few samples from D as possible. In the absence of computational constraints, the empirical risk minimizer (ERM) -- the minimizer on a sample average of observed data -- is widely regarded as the estimation strategy of choice due to its desirable statistical convergence properties. Our goal is to do as well as the empirical risk minimizer on every problem while minimizing the use of computational resources such as running time and space usage.

This work provides a simple streaming algorithm for solving this problem, with performance guarantees competitive with that of the ERM. In particular, under standard regularity assumptions on D, our algorithm enjoys the following properties:

  • The algorithm can be implemented in linear time with a single pass of the observed data, using space linear in the size of a single sample.
  • The algorithm achieves the same statistical rate of convergence as the empirical risk minimizer on every problem D (even considering constant factors).
  • The algorithm's performance depends on the initial error at a rate that decreases super-polynomially.

Moreover, we quantify the rate of convergence of both our algorithm and that of the ERM, showing that, after a number of samples that depend only on the regularity parameters, the streaming algorithm rapidly approaches the leading order rate of ERM in expectation.


Understanding Overcomplete 3rd Order Tensors
with Anima Anandkumar and Majid Janzamin. Short version in COLT 2015. [Details]

This is a series of ongoing works where we try to understand how to decompose overcomplete third-order tensors. In general, decomposing third order tensors can be useful in learning many latent variable models (multi-view model, mixture of Gaussians, etc.) Previous algorithms usually can either only handle the undercomplete regime where the number of components is smaller than the number of of dimensions, or require access of higher order tensors which often results in higher sample complexity and running time. Surprisingly, many heuristics (such as Tensor Power Method or Alternating Least Squares) work very well in practice. In these works we try to understand when is decomposing third-order tensors tractable. Several parts are available on arxiv.

The first part [Arxiv] tries to analyze Tensor Power Method. In this paper we show if there is a good initialization (that is constant-close to some component), Tensor Power Method will converge to vectors that are very close to the true component. We also give an algorithm that converges to the exact true component given the (already reasonably close) results of the tensor power method. Good initializations can be found if there are some supervised information, or by spectral method if ther number of component k is only a constant factor larger than the dimension d.

The second part [Arxiv] tries to understand how many samples do we need in order to estimate the moment tensor accurately enough for several models. We show surprisingly in many natural cases the sample complexity is nearly linear in the number of components. In general the bounds in this paper is tighter than many previous works.

The third part (more independent from the first two) [Arxiv] tries to understand how well does Tensor Power Method perform without really good initialization. We use techniques from Approximate Message Passing to analyze the dynamics of the tensor power method, and show that the initialization does not need to be very close to the true component for tensor power method to perform well. This partly explains why tensor power method still works in the mildly overcomplete settings.

See also Majid Janzamin's page about this project for more information.


Provable Bounds for Learning Some Deep Representations Arxiv
with Sanjeev Arora, Aditya Bhaskara and Tengyu Ma. In ICML 2014. [Abstract]

We give algorithms with provable guarantees that learn a class of deep nets in the generative model view popularized by Hinton and others. Our generative model is an n node multilayer neural net that has degree at most n^{\gamma} for some \gamma < 1 and each edge has a random edge weight in [-1,1]. Our algorithm learns almost all networks in this class with polynomial running time. The sample complexity is quadratic or cubic depending upon the details of the model. The algorithm uses layerwise learning. It is based upon a novel idea of observing correlations among features and using these to infer the underlying edge structure via a global graph recovery procedure. The analysis of the algorithm reveals interesting structure of neural networks with random edge weights.

New Algorithms for Learning Incoherent and Overcomplete Dictionaries Arxiv
with Sanjeev Arora and Ankur Moitra. In COLT 2014. [Abstract]

A m*n matrix A is said to be mu-incoherent if each pair of columns has inner product at most mu/\sqrt{n}. Starting with the pioneering work of Donoho and Huo such matrices (often called dictionaries) have played a central role in signal processing, statistics and machine learning. They allow sparse recovery: there are efficient algorithms for representing a given vector as a sparse linear combination of the columns of A (if such a combination exists). However, in many applications ranging from sparse coding in machine learning to image denoising, the dictionary is unknown and has to be learned from random examples of the form Y=AX where X is drawn from an appropriate distribution --- this is the dictionary learning problem. Existing proposed solutions such as the Method of Optimal Directions (MOD) or K-SVD do not provide any guarantees on their performance nor do they necessarily learn a dictionary for which one can solve sparse recovery problems. The only exception is the recent work of Spielman, Wang and Wright which gives a polynomial time algorithm for dictionary learning when A has full column rank (in particular m is at most n). However, in most settings of interest, dictionaries need to be {\em overcomplete} (i.e., m is larger than n).

Here we give the first polynomial time algorithm for dictionary learning when A is overcomplete. It succeeds under natural conditions on how X is generated, provided that X has at most

k=Cmin(\sqrt{n}/mu logn,m^{1/2-eps})

non-zero entries (for any eps>0). In other words it can handle almost as many non-zeros as the best sparse recovery algorithms could tolerate even if one knew the dictionary A exactly.


Towards a better approximation for sparsest cut? Arxiv
with Sanjeev Arora and Ali Kemal Sinop. In FOCS 2013 [Abstract]

We give a new (1+eps)-approximation for sparsest cut problem on graphs where small sets expand significantly more than the sparsest cut (sets of size n/r expand by a factor \sqrt{\log n\log r} bigger, for some small r; this condition holds for many natural graph families). We give two different algorithms. One involves Guruswami-Sinop rounding on the level-r Lasserre relaxation. The other is combinatorial and involves a new notion called Small Set Expander Flows (inspired by the expander flows of ARV) which we show exists in the input graph. Both algorithms run in time 2^{O(r)} poly(n). We also show similar approximation algorithms in graphs with genus g with an analogous local expansion condition. This is the first algorithm we know of that achieves (1+eps)-approximation on such general family of graphs.

A Tensor Approach to Learning Mixed Membership Community Models Arxiv
with Anima Anandkumar, Daniel Hsu and Sham M. Kakade. In COLT 2013. Also JMLR Vol 15. [Abstract]

Modeling community formation and detecting hidden communities in networks is a well studied problem. However, theoretical analysis of community detection has been mostly limited to models with non-overlapping communities such as the stochastic block model. In this paper, we remove this restriction, and consider a family of probabilistic network models with overlapping communities, termed as the mixed membership Dirichlet model, first introduced in Airoldi et al. This model allows for nodes to have fractional memberships in multiple communities and assumes that the community memberships are drawn from a Dirichlet distribution. We propose a unified approach to learning these models via a tensor spectral decomposition method. Our estimator is based on low-order moment tensor of the observed network, consisting of 3-star counts. Our learning method is fast and is based on simple linear algebra operations, e.g. singular value decomposition and tensor power iterations. We provide guaranteed recovery of community memberships and model parameters and present a careful finite sample analysis of our learning method. Additionally, our results match the best known scaling requirements in the special case of the stochastic block model.

A Practical Algorithm for Topic Modeling with Provable Guarantees Arxiv
with Sanjeev Arora, Yoni Halpern, David Mimno, Ankur Moitra, David Sontag, Yichen Wu, Michael Zhu,
in ICML 2013 [Abstract]

Topic models provide a useful method for dimensionality reduction and exploratory data analysis in large text corpora. Most approaches to topic model inference have been based on a maximum likelihood objective. Efficient algorithms exist that approximate this objective, but they have no provable guarantees. Recently, algorithms have been introduced that provide provable bounds, but these algorithms are not practical because they are inefficient and not robust to violations of model assumptions. In this paper we present an algorithm for topic model inference that is both provable and practical. The algorithm produces results comparable to the best MCMC implementations while running orders of magnitude faster.

Tensor decompositions for learning latent variable models Arxiv
with Anima Anandkumar, Daniel Hsu, Sham M. Kakade, Matus Telgarsky. In JMLR Vol 15. [Abstract]

This work considers a computationally and statistically efficient parameter estimation method
for a wide class of latent variable models|including Gaussian mixture models, hidden Markov
models, and latent Dirichlet allocation|which exploits a certain tensor structure in their loworder
observable moments (typically, of second- and third-order). Specifically, parameter estimation
is reduced to the problem of extracting a certain (orthogonal) decomposition of a symmetric
tensor derived from the moments; this decomposition can be viewed as a natural generalization
of the singular value decomposition for matrices. Although tensor decompositions are generally
intractable to compute, the decomposition of these specially structured tensors can be efficiently
obtained by a variety of approaches, including power iterations and maximization approaches
(similar to the case of matrices). A detailed analysis of a robust tensor power method is provided,
establishing an analogue of Wedin's perturbation theorem for the singular vectors of
matrices. This implies a robust and computationally tractable estimation approach for several
popular latent variable models.

Provable ICA with Unknown Gaussian Noise, and Implications for Gaussian Mixtures and Autoencoders Arxiv
with Sanjeev Arora, Ankur Moitra and Sushant Sachdeva. in NIPS 2012[Abstract]

We present a new algorithm for Independent Component Analysis (ICA) which has provable performance guarantees. In particular, suppose we are given samples of the form y = Ax + \eta where A is an unknown n*n matrix and x is a random variable whose components are independent and have a fourth moment strictly less than that of a standard Gaussian random variable, \eta is an n-dimensional Gaussian random variable with unknown covariance \Sigma: We give an algorithm that provable recovers A and \Sigma up to an additive \epsilon whose running time and sample complexity are polynomial in n and 1 / \epsilon. To accomplish this, we introduce a novel "quasi-whitening" step that may be useful in other contexts in which the covariance of Gaussian noise is not known in advance. We also give a general framework for finding all local optima of a function (given an oracle for approximately finding just one) and this is a crucial step in our algorithm, one that has been overlooked in previous attempts, and allows us to control the accumulation of error when we find the columns of A one by one via local search.

Learning Topic Models - Going beyond SVD Arxiv
with Sanjeev Arora and Ankur Moitra. In FOCS 2012.[Abstract]

In this work we revisit the topic modeling problem, which is used for automatic comprehension and classification of data in a variety of settings. A number of foundational works both in machine learning and in theory have suggested a probabilistic model for documents, whereby documents arise as a convex combination of (i.e. distribution on) a small number of topic vectors, each topic vector being a distribution on words (i.e. a vector of word-frequencies).

Theoretical studies of topic modeling focus on learning the model's parameters assuming the data is actually generated from it. Existing approaches for the most part rely on Singular Value Decomposition(SVD), and consequently have one of two limitations: these works need to either assume that each document contains only one topic, or else can only recover the span of the topic vectors instead of the topic vectors themselves.

This paper formally justifies Nonnegative Matrix Factorization(NMF) as a main tool in this context, which is an analog of SVD where all vectors are nonnegative. Using this tool we give the first polynomial-time algorithm for learning topic models without the above two limitations. The algorithm uses a fairly mild assumption about the underlying topic matrix called separability, which is usually found to hold in real-life data.

We hope that this paper will motivate further theoretical results that use NMF as a replacement for SVD - just as NMF has come to replace SVD in many applications.

Computing a Nonnegative Matrix Factorization --- Provably Arxiv
with Sanjeev Arora, Ravi Kannan and Ankur Moitra. In STOC 2012. [Abstract]

. The nonnegative matrix factorization problem asks for factorizing an n*m matrix M into AW where A is n*r and W is r*m, further the entries of the two matrices A and W are required to be nonnegative. This problem has a extremely broad range of applications, from machine learning to communication complexity. We give algorithms for exact and approximate versions of the problem when r is small (which is the case for most applications) and a hardness result that shows our results cannot be substancially improved.

Finding Overlapping Communities in Social Networks: Toward a Rigorous Approach Arxiv
with Sanjeev Arora, Sushant Sachdeva and Grant Schoenebeck, In EC 2012.[Abstract]

A community in a social network is usually understood to be a group of nodes more densely connected with each other than with the rest of the network. Many existing methods for finding them use heuristic algorithms where the solution concept is defined in terms of an NP-complete problem such as CLIQUE or HIERARCHICAL CLUSTERING. We introduce a more formal approach whereby we make rigorous definitions that sidestep the pitfall of high computational complexity and allow efficient algorithms for finding communities.

Computational Complexity and Information Asymmetry in Financial Products PDF
with Sanjeev Arora, Boaz Barak and Markus Brunnermeier. In ICS 2010[More Info]

For some frequently asked questions about the paper, see our FAQ page.

A short version appeared in Communications of the ACM, 2011 Issue 5.
Full text PDF

See some blog comments on this paper: Intractability Center, Freedom to Tinker, Boing Boing, Daily Kos, In Theory, Healthy Algorithms, Lipton's blog

We put forward the issue of Computational Complexity in the analysis of Financial Derivatives, and show that there is a fundamental difficulty in pricing financial derivatives even in very simple settings.

This is still a working paper. The latest version (updated Jan. 2012) can be found here:
Computational Complexity and Information Asymmetry in
Financial Products

New Tools for Graph Coloring PDF
with Sanjeev Arora, In APPROX 2011[Abstract]

We explore the possibility of better coloring algorithm using the new concept threshold rank. We are able to analyze high level Lasserre Relaxations for low threshold rank graphs and distance transitive graphs, and our algorithm makes significantly better progress (towards O(log n) coloring) in these cases compared to previous algorithms (where the number of colors is polynomial). We also purpose a plausible conjecture, that if true would yield good sub-exponential time algorithm for graph coloring.

New Algorithms for Learning in Presence of Errors PDF
with Sanjeev Arora, In ICALP 2011[Abstract]

We give new algorithms for a variety of randomly-generated instances of computational problems using a linearization technique that reduces to solving a system of linear equations. Our techniques can be applied to the low noise case of the well-known LWE (Learning With Errors) problem, giving a slightly subexponential time algorithm which suggests known hardness results are almost tight.

New Results on Simple Stochastic Games PDF
with Decheng Dai. In ISAAC 2009[Abstract]

We give a new algorithm for Simple Stochastic Games when the number of RANDOM vertices is small. We also show that coarse approximation is as hard as fine approximation for Simple Stochastic Games.

Contact
Email: rongge AT cs DOT duke DOT edu