%
% This is the LaTeX template file for lecture notes for CS294-8,
% Computational Biology for Computer Scientists. When preparing
% LaTeX notes for this class, please use this template.
%
% To familiarize yourself with this template, the body contains
% some examples of its use. Look them over. Then you can
% run LaTeX on this file. After you have LaTeXed this file then
% you can look over the result either by printing it out with
% dvips or using xdvi.
%
% This template is based on the template for Prof. Sinclair's CS 270.
\documentclass[twoside]{article}
\usepackage{amsmath, amssymb, amsthm, amsfonts}
\usepackage{fullpage}
\usepackage{times}
\usepackage{scribe_style}
\usepackage{graphics}
\setlength{\oddsidemargin}{0.25 in}
\setlength{\evensidemargin}{-0.25 in}
\setlength{\topmargin}{-0.6 in}
\setlength{\textwidth}{6.5 in}
\setlength{\textheight}{8.5 in}
\setlength{\headsep}{0.75 in}
\setlength{\parindent}{0 in}
\setlength{\parskip}{0.1 in}
\usepackage{mathrsfs}
%
% The following commands set up the lecnum (lecture number)
% counter and make various numbering schemes work relative
% to the lecture number.
%
\newcounter{lecnum}
\renewcommand{\thepage}{\thelecnum-\arabic{page}}
\renewcommand{\thesection}{\thelecnum.\arabic{section}}
\renewcommand{\theequation}{\thelecnum.\arabic{equation}}
\renewcommand{\thefigure}{\thelecnum.\arabic{figure}}
\renewcommand{\thetable}{\thelecnum.\arabic{table}}
%
% The following macro is used to generate the header.
%
\newcommand{\lecture}[4]{
\pagestyle{myheadings}
\thispagestyle{plain}
\newpage
\setcounter{lecnum}{#1}
\setcounter{page}{1}
\noindent
\begin{center}
\framebox{
\vbox{\vspace{2mm}
\hbox to 6.28in { {\bf COS 597G: Toward Theoretical Understanding of Deep Learning
\hfill Fall 2018} }
\vspace{4mm}
\hbox to 6.28in { {\Large \hfill Lecture #1: #2 \hfill} }
\vspace{2mm}
\hbox to 6.28in { {\it Lecturer: #3 \hfill Scribe: #4} }
\vspace{2mm}}
}
\end{center}
\markboth{Lecture #1: #2}{Lecture #1: #2}
%{\bf Note}: {\it LaTeX template courtesy of UC Berkeley EECS dept.}
\vspace*{1mm}
}
%
% Convention for citations is authors' initials followed by the year.
% For example, to cite a paper by Leighton and Maggs you would type
% \cite{LM89}, and to cite a paper by Strassen you would type \cite{S69}.
% (To avoid bibliography problems, for now we redefine the \cite command.)
% Also commands that create a suitable format for the reference list.
\renewcommand{\cite}[1]{[#1]}
\def\beginrefs{\begin{list}%
{[\arabic{equation}]}{\usecounter{equation}
\setlength{\leftmargin}{2.0truecm}\setlength{\labelsep}{0.4truecm}%
\setlength{\labelwidth}{1.6truecm}}}
\def\endrefs{\end{list}}
\def\bibentry#1{\item[\hbox{[#1]}]}
%Use this command for a figure; it puts a figure in wherever you want it.
%usage: \fig{NUMBER}{SPACE-IN-INCHES}{CAPTION}
\newcommand{\fig}[3]{
\vspace{#2}
\begin{center}
Figure \thelecnum.#1:~#3
\end{center}
}
% **** IF YOU WANT TO DEFINE ADDITIONAL MACROS FOR YOURSELF, PUT THEM HERE:
\begin{document}
%FILL IN THE RIGHT INFO.
%\lecture{**LECTURE-NUMBER**}{**DATE**}{**LECTURER**}{**SCRIBE**}
\lecture{2}{19 September 2018}{Sanjeev Arora}{Mikhail Khodak}
% **** YOUR NOTES GO HERE:
\section{Gradient Descent}
Our goal in this lecture is to explore what can proved about (non-stochastic) {\em gradient descent} (GD) when faced with optimizing a nonconvex function $f:\mathbb{R}^d\mapsto\mathbb{R}$.
It is well-known that if $f$ is strongly convex then the standard GD iteration $x_{k+1}=x_k-\eta\nabla f(x_k)$ converges quickly to a global optimum for appropriate choice of $\eta$.
However, what can we say about the nonconvex case?
Can we even enforce that the iteration will decrease?
Will it at least arrive to a local minimum?
The answer to the first question turns out to be yes (Lemma~\ref{lem:decrease}).
Proving the second turns out to require the introduction of random jumps into the gradient algorithm that allow it to escape from saddle points (Theorem~\ref{thm:escape}).
In this lecture we'll show both of these facts under common regularity assumptions on $f$, which together show that {\em perturbed gradient descent} (PGD) reaches an approximate {\em second-order local minimum} in polynomial time.
Although these results do not guarantee global optimality for all problems, there are multiple nonconvex settings where any second-order local minimum is enough for global optimality, such as matrix completion or phase retrieval.
Results based on the work of Jin-Ge-Lee-Jordan (2017).
\section{Reaching a Stationary Point}
We consider functions $f$ that satisfy the following two properties:
\begin{enumerate}
\item $\rho$-Hessian Lipschitzness: $\|\nabla^2f(x)-\nabla^2f(x')\|\le\rho\|x-x'\|$
\item $\ell$-smoothness: $\|\nabla f(x)-\nabla f(x')\|\le\ell\|x-x'\|$
\end{enumerate}
Let's first see that the function value decreases using GD with appropriate step-size:
\begin{lemma}
\label{lem:decrease}If $\eta\le\frac{1}{\ell}$ then taking a GD step from $x_t$ to $x_{t+1}$ results in $f(x_{t+1})-f(x_t)\le-\frac{1}{2\eta}\|x_{t+1}-x_t\|^2$.
\end{lemma}
\begin{proof}
We apply the $\ell$-smoothness of $f$ to get
\begin{align*}
f(x_{t+1})&\le f(x_t)+\nabla f(x_t)\cdot(x_{t+1}-x_t)+\frac{\ell}{2}\|x_{t+1}-x_t\|^2\\
&=f(x_t)-\eta\|\nabla f(x_t)\|^2+\frac{\eta^2\ell}{2}\|\nabla f(x_t)\|^2\\
&\le f(x_t)-\frac{\eta}{2}\|\nabla f(x_t)\|^2\\
&=f(x_t)-\frac{1}{2\eta}\|x_{t+1}-x_t\|^2
\end{align*}
\end{proof}
Thus despite the nonconvexity we know that for nice functions we can always decrease the function value using gradient descent.
However, this doesn't tell us what kind of stationary point ($\|\nabla f\|<\varepsilon$ for small $\varepsilon$) we will reach after iterating.
Assuming random initialization, with probability 1 we won't reach a local maximum, as we can't descend into it and so can only get stuck there by starting there.
We could also reach a local minimum, which is likely the best we can do.
Finally, we could also reach a saddle point - a point that is maximal in some directions and minimal in others and has zero gradient.
Visualizing this problem in two dimensions, it seems trivial to escape it - just try a couple random directions and one of them will likely lead you to descending again.
However, the number of directions we'd need to try might be exponentially large in the dimension $d$.
A standard approach here is to compute the local Hessian $\nabla^2f$: if there is an eigenvector with a negative eigenvalue then take that descent direction.
However, second-order methods are expensive, and we would like to show that a first-order algorithm can also succeed.
To gain an understanding of what to do next, let's consider what happens geometrically when we reach a stationary point.
By a simple application of Cauchy-Schwarz we have the following lemma, which says that if the function value doesn't change after several iterations then gradient descent must be stuck in a ball of small radius:
\begin{lemma}
\label{lem:stuck}If $f(x_T)-f(x_0)\ge-\mathscr{F}$ then $\forall~t\le T$ we have $\|x_t-x_0\|\le\sqrt{2\eta T\mathscr{F}}$.
\end{lemma}
\begin{proof}
We apply Cauchy-Schwarz followed by the inequality from Lemma~\ref{lem:decrease}:
$$\|x_t-x_0\|=\sum\limits_{\tau=1}^t\|x_\tau-x_{\tau-1}\|\le\sqrt{T\sum\limits_{\tau=1}^t\|x_\tau-x_{\tau-1}\|^2}\le\sqrt{T2\eta(f(x_0)-f(x_T))}\le\sqrt{2\eta T\mathscr{F}}$$
\end{proof}
\section{Escaping from Saddle Points}
Lemma~\ref{lem:stuck} shows that escaping from a saddle point will require escaping a ball of small radius in which the iteration is stuck.
As mentioned before, the second-order solution is to take the Hessian and find the descent direction via the smallest eigenvalue.
However, we can do just as well in polynomially many tries by just picking a random direction:
\begin{enumerate}
\item for $t=1,\dots,T$:
\item\quad if no progress in a fixed number of steps: $x_t\gets x_t+\xi_t$ for $x_t\sim B_0(r)$
\item\quad $x_{t+1}\gets x_t-\eta\nabla f(x_t)$
\end{enumerate}
where $B_0(r)$ is the uniform distribution over the ball of radius $r$.
For this {\em perturbed gradient descent} (PGD) algorithm Theorem~\ref{thm:escape} provides a guarantee for reaching an {\em $\epsilon$-approximate second order stationary point}, which is defined as a point $x$ with $\|\nabla f(x)\|\le\varepsilon$ and $\lambda_\textrm{min}(\nabla^2f(x))\ge-\sqrt{\rho\varepsilon}$ (recall that when the Hessian is positive semi-definite, i.e. has all nonnegative eigenvalues, then the stationary point is a local minimum).
\begin{theorem}
\label{thm:escape}PGD with appropriate parameters for $\eta,r$, and the number of steps between perturbations will reach an $\varepsilon$-approximate second order stationary point with $1-\delta$ probability in $\tilde{\mathcal{O}}\left(\frac{1}{\varepsilon^2}\log\frac{1}{\delta}\right)$ iterations.
\end{theorem}
\textbf{Note:} Here the parameters are $\eta=\frac{1}{\ell}$ and $r=\frac{1}{200\chi^3\sqrt{\kappa}}\sqrt{\frac{\varepsilon}{\rho}}$, where $\kappa=\frac{\ell}{\sqrt{\varepsilon\rho}}$ and $\chi=\Omega\left(\log\frac{\Delta_fd\sqrt{\kappa}}{\eta\varepsilon^2\delta}\right)$ for $\chi\kappa$ the number of steps between perturbations and $\Delta_f=f(x_0)-f(x^*)$.
The $\tilde{O}$ in the statement hides log factors in $\frac{1}{\varepsilon}$; specifically the number of iterations is $\mathcal{O}\left(\frac{\ell\Delta_f}{\varepsilon^2}\chi^4\right)$.
We prove this theorem by showing that when the iteration is stuck near a saddle point then the region in the ball of sufficient radius from which regular GD won't make progress (i.e. start decreasing again) is small, so making a random jump will work after a small number of attempts.
Combined with the fact the GD decreases even for non-convex functions (Lemma~\ref{lem:decrease}) this will complete the proof.
We first show that if GD cannot escape from a point $x_0$ near a saddle point $\tilde{x}$ then for large enough $r_0$ it can escape from all points $x_0+re_1$, where $e_1$ is the eigenvector of the smallest eigenvalue of $\nabla^2f(\tilde{x})$.
The approach here is essentially to take two sequences starting $x_0,x_0+re_1$ and show that after some time $T$ the distance between them is large, and hence at least one must have escaped from $\tilde{x}$.
\begin{lemma}
\label{lem:escape}
Consider $\tilde{x}$ such that $\lambda_\textrm{min}(\nabla^2f(\tilde{x}))\le-\sqrt{\rho\varepsilon}$ and $x_0,x_0'$ at most distance $r$ away from $\tilde{x}$ and $x_0'=x_0+r_0e_1$ for $e_1$ the minimum eigendirection of $\nabla^2f(\tilde{x})$.
Then for appropriate choice of $T$ and $r$ (depending on $\varepsilon,\delta$, and $f$) and $\mathscr{F}=\tilde{\Omega}\left(\varepsilon^\frac{3}{2}\right)$ we have
$$\min\{f(x_T)-f(x_0),f(x_T')-f(x_0')\}\le-\mathscr{F}$$
where $\{x_t\}_{t=1}^T$ and $\{x_t'\}_{t=1}^T$ are sequences of GD steps starting from $x_0$ and $x_0'$, respectively.
\end{lemma}
\begin{proof}
We prove by contradiction, using which we have by Lemma~\ref{lem:stuck} that $\forall~t\le T$
$$\max\{\|x_t-\tilde{x}\|,\|x_t'-\tilde{x}\|\}\le\{\|x_t-x_0\|+\|x_0-\tilde{x}\|,\|x_t'-x_0'\|+\|x_0'-\tilde{x}\|\}\le\sqrt{2\eta T\mathscr{F}}+r$$
Denoting $\mathcal{H}=\nabla^2f(\tilde{x})$ and $\mathscr{S}=\sqrt{2\eta T\mathscr{F}}+r$ we can track the difference $w_t=x_t-x_t'$ between the two GD sequences using the gradient update:
$$w_{t+1}=w_t-\eta\left[\nabla f(x_t)-\nabla f(x_t')\right]=(I-\eta\mathcal{H})w_t-\eta\Delta_tw_t=(I-\eta\mathcal{H})^{t+1}w_0-\eta\sum\limits_{\tau=0}^t(I-\eta\mathcal{H})^{t-\tau}\Delta_\tau w_\tau$$
where $\Delta_t=\int_0^t\left[\nabla^2f(x_t'+\theta(x_t-x_t'))-\mathcal{H}\right]d\theta$, which has norm $\|\Delta_t\|\le\rho\max\{\|x_t-\tilde{x}\|,\|x_t'-\tilde{x}\|\}\le\rho\mathscr{S}$ using the Hessian Lipschitzness of $f$.
We now consider the following statement, which bounds the second quantity above in terms of the first:
$$\left\|\eta\sum\limits_{\tau=0}^{t-1}(I-\eta\mathcal{H})^{t-1-\tau}\Delta_\tau w_\tau\right\|\le\frac{1}{2}\|(I-\eta\mathcal{H})^tw_0\|$$
For $t=0$ this is obvious.
We therefore assume it holds for $t'\le t$ and prove inductively, which gives
$$\|w_t'\|=\|(I-\eta\mathcal{H})^{t'}w_0\|+\left\|\eta\sum\limits_{\tau=0}^{t'-1}(I-\eta\mathcal{H})^{t'-1-\tau}\Delta_\tau w_\tau\right\|\le 2\|(I-\eta\mathcal{H})^{t'}w_0\|$$
Letting $\gamma=\lambda_\textrm{min}(\nabla^2f(\tilde{x}))$ we have for $t