COS 598C, Spring 2012: Home Page

Princeton University
Computer Science Dept.

Spring 2012: COS598C
Advanced Topics in CS: Is Learning Easy?

Sanjeev Arora

Directory
General Information | Assignments | Handouts | Interesting Links|

Course Summary

Machine learning is a flourishing field with a plethora of algorithmic techniques. In almost all cases, we have no provable guarantees on the performance of these algorithms/heuristics ---neither on their running time, nor on the quality of solutions they return. In fact in many settings, the underlying algorithmic problems are provably intractable (e.g., NP-hard or worse)! This course is an attempt to bring more rigor to machine learning. Given known intractability results, theoretical progress and theorems will probably come about from a variety of new tacks: subtle reformulation of the problem, average case approaches, approaches in between worst-case and average case, etc. We will study some early results of this kind that have been proven, and identify other open problems. The last few classes will be devoted to formulating open problems and thinking about them.

Of course, machine learning has large overlap with mathematical fields such as statistics and linear algebra, and we will bring the same viewpoint to problems in those fields.

Prerequisites: Prior exposure to algorithms theory and AI/Learning, e.g. at the level of COS 423 and COS 402. Fluency in linear algebra (matrix determinants, inverses, eigenvalues/eigenvectors, norms) and probabilities.

Interested undergrads should contact the instructor. Undergrads and grads will be graded separately. Auditors are welcome with prior permission.

Course Information

Professor: Sanjeev Arora - 307 CS Building - 258-3869 arora@cs.princeton.edu Office hrs: Mon 3-4pm, or by appointment. (Or try dropping by any afternoon.)

Fridays 1:30pm-4:30pm. (Room 402, CS Bldg).

Important: Due to monthly meetings at the Intractability Center there will be no class on Mar 9 and Apr 6. To compensate we will hold one class during reading period, and also may go a little beyond 4:30pm on Fridays when class is held.

Grading: Based upon occasional assignments, class participation and presentations, and final project. By late March students will have split into research teams of 2-3, and work on a final project, to be presented in class during reading period.

Lecture Notes and Readings

Lecture topics	More details and main readings.	Additional readings.
Lecture 1: The basic problem in machine learning. Intro to themes and methodology of the rest of the course.	Max Likelihood, Maximum Entropy Principle, and their duality. Philosophical musings. Bayesian Learning. Examples. The simplest cases are NP-hard! What does it mean to have provable guarantees? Survey on max entropy by K. Conrad. (Almost everything you need to know, except the duality with MLE, which is discussed in this brief note by J.Coughlan. ) Johnson-Lindenstrauss Dimension Reduction (Old Princeton lecture notes; read Section 4).	Lecture notes on ME and MLE by S. Mahadevan. An article that describes the philosophical difficulties of Max-Entropy principle by J. Uffink. Andrew Moore's slides on learning Gaussian Mixtures and EM (just to get a taste of how such problems are currently solved)
Lec 2: How "Clustering" problems arise in learning. A Swiss-army knife for provably solving various "clustering" and "learning" problems.	Singular Value Decomposition and its properties. Read this extract from a manuscript of a forthcoming book by Hopcroft and Kannan. Its use as a Swiss-army knife for clustering is introduced in this extract from the same book. (The version done in today's lecture.)	The recent paper of Kannan and Kumar that describes the more general Swiss Army Knife. An older paper of McSherry that shows how to learn "planted" patterns in matrices (which the above paper solves with the Swiss Knife). Lecture notes (by D. Spielman) on eigenvalues of random matrices and an intro to McSherry's results. Paper on algorithms for k-means clustering by Jaiswal, Kumar, Sen. (One of many papers out there.)
Lec 3: Part 1: Ravi Kannan's guest lecture on his Swiss-Army Knife Algorithm (SVD followed by Lloyd's k-means algorithm). Part 2: Algorithms for nonnegative matrix factorization (NMF). Techniques introduced: matrix pseudoinverses, limited exponential search, epsilon nets.	Part 1: recent paper of Kannan and Kumar. Part 2: Paper of Arora, Ge, Kannan, Moitra.
Lec 4: Part 1: Solving NMF under the separability condition. Applications to Blind Source Separation, Learning LDA's (linear dirichlet allocations) and other related matrix models. Part 2: Intro to graphical models. Markov random fields. The two representations and their equivalence via Clifford-Hammersley Theorem. Examples of uses in vision, statistical physics, neural nets, etc.	Part 1: Manuscript by Arora, Ge, Moitra. (Handed out in class; available by email request.) Part 2: Chapter on Gibbs State from the online book "Probabability on Graphs" by G. Grimmett.	Paper by Geman and Geman introducing MRF's for computer vision.
Mar 9: No class due to CCI group meeting.
Mar 16: Provably learning mixtures of Gaussians. (Guest lecture by Ankur.)	Paper by Moitra and Valiant.	Shorter CACM paper (introduces the main ideas)
Mar 30: Graphical models. The inference problem. Variable elimination, junction tree theorem, message passing. Examples: error-correcting codes, Discrete Boltzmann Machines (DBMs).	Brief lecture notes on junction trees by Mark Paskin Survey article on graphical models by Michael Jordan. Article on inference by Jordan and Weiss.	How sensitive are graphical models to imprecision in the model specification? An early analysis of Pathfinder
Apr 6: No class due to CCI group meeting.
April 13: Part 1: (a) Who's afraid of hardness results (in PAC learning)? (b) Sushant talks about Introduction to variational methods in ML. (c) Brief student presentations about planned course projects.	Tutorial by T. Jaakola on variational methods. Class handout on PAC and possible modifications.
April 20: Nonlinear optimization gone wild: Autoencoding, dimension reduction, MDS (multidimensional scaling), sparse coding, RBMs, deep learning, etc. etc. A plethora of open problems. (b) The issue of noise-stability. (c) Batting around ideas on student projects.	Freund and Haussler's paper on training DBMs (v. cleanly written). The autoencoding view of deep learning is explained in this paper by Ranzato et al..	On the irrelevance of Kolmogorov's theorem (A note on the difficulty of encoding arbitrary multivariate distributions by a fixed basis of well-behaved functions) (Girosi & Poggio) Many many papers and slides on Dasgupta's course homepage (link below).
April 27: (a) Problems in natural language processing (NLP). Digram, Trigram, and Loglinear models. Perplexity, a measure of a model's goodness. Probabilistic Context Free Grammers. (b) Speculation about ubiquity of loglinear notions in ML. (c) Ankur talks about various Klivans-O'Donnel-Servedio papers on learning intersections of halfspaces and possible relevance to things we've been discussing, including loglinear type models. (d) Continued discussion of student projects.	Tutorial notes by Michael Collins: Introduction to NLP , Loglinear Models , and PCFGs.

Homework

Feb 9. (a) Do a literature search for maximum entropy/max likelihood and read an article on these ideas applied to an interesting topic (statistical mechanics, philosophy, computer vision,....). Write a 1-page report and be prepared to discuss your findings in next class. (b) Compute the entropy of the n-dimensional Gaussian distribution with variance-covariance matrix Q.
Feb 17 (a) Take an image (viewed as a matrix of data) and compute its rank k approximation for various k using matlab or any other tool. How large a k do you need to get a fairly accurate representation of the image? Include printouts with your answer. Try to explain at a plausible level (using today's lecture) why this method even works. (b) Suppose we try to learn mixtures of spherical gaussians of equal radii. The datapoints are generated using the gaussians, but are perturbed arbitarily by some amount \epsilon. Sketch conditions under which you can (sort of) learn the gaussians. Try to be succinct and precise; work out the solution separately and then write it. (c) Look up the bounds on eigenvalues of random matrices and finish the analysis of the planted graph partitioning problem described in lecture. What what values of p, \epsilon does it find the planted bisection, and how many points are misclassified.
Feb 24: (a) Come up with a couple other approaches to formulating the netflix problem. You don't have to solve the problem; just suggest a formulation and speculate about algorithmic approaches. (b) Try to run the Kannan-Kumar algorithm on the image matrix from the previous homework. What do the clusters correspond to? (You can use any reasonable k-means starting point: eg greedy, or using some other code you find on the web.) Try on a couple other images.
Mar 30 (a) Look at the paper on metric labeling by Kleinberg and Tardos and formulate another algorithmic question that tries to capture image segmentation. (b) Identify a problem in ML that you would like to work on for the rest of the term and write in a page or two what you plan to do with it. A final 10p paper on your work will be due at the end of term.

Interesting links

Lec 1: D. Aldous Lecture notes on how to view random graph distributions via Max Entropy.
Stuart Russell's ugrad course; introduces basic notions in ML.
Blei's lecture notes on probabilistic inference. Unfortunately, no listing by topic. Lectures 3-7 concern inference on graphical models, junction trees etc.
Andrea Montanari's course website. (Lots more theorems than is usual in ML..)
Sanjoy Dasgupta's course website. (Esp. relevant to April 20 lecture.)
The comprehensive survey of graphical models and their inference algorithms by Jordan and Wainwright. (Essentially a book.)

Open Problem list

Prove NP-hardness of MLE learning for mixture of k spherical gaussians (ie fix the proof from class that is slightly incomplete).
Alternatives to Max Entropy/MLE?
A version of clustering a la Mcsherry when the blocks are allowed to overlap.
Try to use the Swiss Army Knife to prove graph inequalities such as Cheeger's inequality.
Fast algorithms for O(1)-approximation to k-means (to make the Swiss Army Knife practical). For starters assume k =O(1).
Extend the NP-hardness result for exact NMF in the AGKM paper to approximate NMF.
Come up with a good model for a "random image." We discussed some ideas in class.
Make the Moitra-Valiant type algorithm for learning gaussians resistant to noise. (The Kannan-Kumar Swiss Army knife does this for simple distance-based clustering.)
Show that variational methods of some kind provably work for separable nonnegative factorization or topic learning. (i.e., reprove the results of Arora et al. using a specific algorithm based upon variational methods)
Take any of the autoencoding methods from the lecture of April 20. Prove a statement analogous to what has been done for Gaussian mixtures and topic models: if the data is actually generated by the generative model in question
can we recover the parameters of the model?

Honor Code for this class

Collaborating with your classmates on assignments is encouraged (and may be essential to get the most out of the course). You must, however, list your collaborators for each problem.

Princeton University Computer Science Dept.

Spring 2012: COS598C Advanced Topics in CS: Is Learning Easy? Sanjeev Arora