|
Learning Topic Models - Going beyond SVD manuscript |
|
with Sanjeev Arora and
Ankur
Moitra. Submitted. In this work we
revisit the topic modeling problem, which is used for automatic
comprehension and classification of data in a variety of settings. A
number of foundational works both in machine learning and in theory
have suggested a probabilistic model for documents, whereby documents
arise as a convex combination of (i.e. distribution on) a small number
of topic vectors, each topic vector being a distribution on words (i.e.
a vector of word-frequencies).
Theoretical studies of topic
modeling focus on learning the model's parameters assuming the data is
actually generated from it. Existing approaches for the most part rely
on Singular Value Decomposition(SVD), and consequently have one of two
limitations: these works need to either assume that each document
contains only one topic, or else can only recover the span of the topic
vectors instead of the topic vectors themselves.
This paper
formally justifies Nonnegative Matrix Factorization(NMF) as a main tool
in this context, which is an analog of SVD where all vectors are
nonnegative. Using this tool we give the first polynomial-time
algorithm for learning topic models without the above two limitations.
The algorithm uses a fairly mild assumption about the underlying topic
matrix called separability, which is usually found to hold in real-life
data.
We hope that this paper will motivate further theoretical
results that use NMF as a replacement for SVD - just as NMF has come to
replace SVD in many applications.
|
Computing
a Nonnegative Matrix Factorization --- Provably manuscript |
|
with Sanjeev Arora, Ravi Kannan and
Ankur
Moitra. In STOC
2012, to appear.
The nonnegative matrix factorization
problem asks for factorizing an n*m matrix M into AW where A is n*r and
W is r*m, further the entries of the two matrices A and W are required
to be nonnegative. This problem has a extremely broad range of
applications, from machine learning to communication complexity. We
give algorithms for exact and approximate versions of the problem when
r is small (which is the case for most applications) and a hardness
result that shows our results cannot be substancially improved.
|
Finding
Overlapping Communities in Social Networks: Toward a
Rigorous Approach
manuscript |
|
with Sanjeev Arora, Sushant Sachdeva and
Grant Schoenebeck, In EC 2012, to appear.
A community in a social network is usually understood to be a group of
nodes more densely connected with each other than with the rest of the
network. Many existing methods for finding them use heuristic
algorithms where the solution concept is defined in terms of an
NP-complete problem such as CLIQUE or HIERARCHICAL CLUSTERING. We
introduce a more formal approach whereby we make rigorous definitions
that sidestep the pitfall of high computational complexity and allow
efficient algorithms for finding communities.
|
Computational
Complexity and Information Asymmetry in
Financial Products |
|
with Sanjeev Arora, Boaz Barak and Markus
Brunnermeier. In ICS
2010
A short version appeared in Communications
of the ACM,
2011 Issue 5.
Full text PDF
See some blog comments on this paper: Intractability Center, Freedom
to Tinker,
Boing
Boing, Daily
Kos, In
Theory, Healthy
Algorithms, Lipton's
blog
We put forward the issue of Computational
Complexity in the analysis of Financial Derivatives, and show that
there is a fundamental difficulty in pricing financial derivatives even
in very simple settings.
This is still a working paper. The latest version (updated Jan. 2012)
can be found here:
Computational
Complexity and Information Asymmetry in
Financial Products
For some frequently asked questions about the paper, see our FAQ page. |
New
Tools for Graph Coloring PDF |
|
with Sanjeev Arora, In APPROX
2011
We explore the possibility of better coloring algorithm using the new
concept threshold rank. We are able to analyze high level
Lasserre Relaxations for low threshold rank graphs and distance
transitive graphs, and our algorithm makes significantly better
progress
(towards O(log n) coloring) in these cases compared to previous
algorithms (where the number of colors is polynomial). We also purpose
a plausible conjecture, that if true would yield good sub-exponential
time algorithm for graph coloring.
|
New
Algorithms for Learning in Presence of Errors PDF |
|
with Sanjeev Arora, In ICALP 2011
We give new algorithms for a variety of randomly-generated instances of
computational
problems using a linearization technique that reduces to solving a
system of linear equations. Our techniques can be applied to the low
noise case of the well-known LWE
(Learning With Errors) problem, giving a slightly
subexponential time algorithm which suggests known hardness results are
almost tight.
|
New
Results on Simple Stochastic Games PDF |
|
with Decheng Dai. In ISAAC 2009
We give a new algorithm for Simple Stochastic Games when the number of
RANDOM vertices is small. We also show that coarse approximation
is as hard as fine approximation for Simple Stochastic Games.
|