least squares with least thinking

Least squares with least thinking

We use the least squares problem to illustrate the approach of applied machine learning.

Given set of pairs $\{(x_1,y_1), \dots, (x_n,y_n)\}$ with $x_i, y_i \in \mathbb R$ , we would like to obtain $w \in \mathbb R$ that minimizes the squared error,

\begin{align} L(w) = \sum_{i=1}^n (x_i \cdot w - y_i)^2 \end{align}

18th century math

Method: derive a formula

It is easy to derive a closed form solution for this problem:

$\begin{align} \nabla L(w) & = \sum_{i=1}^n 2 x_i (x_i w - y_i) = 0 \\ \hat{w} & = \frac{\sum_i x_i y_i}{\sum_i x_i^2} \end{align}$

If $x_i$ is a vector in $\mathbb R^d$ . the problem can be written as $L(w) = || Xw - y ||^2$ and following rules of vector calculus, we can derive the solution $\boldsymbol{\hat w} = ( X^TX)^{-1}X^T \boldsymbol y.$

Computer science

Method: theoretically justified algorithms

However, inverting $( X^TX)^{-1}$ is numerically unstable and computationally expensive ( $O(d^3)$ ). Computer science does not insist on expressing a solution in closed form, as long as there is an algorithm. If we have an algorithm to solve linear equations, it suffices to specify that we want $w$ that solves $X^TX \boldsymbol{w} = X^T \boldsymbol y$ . Algorithms that does this range from matrix reduction to conjugate gradient.

There are still issues to be concerned about

The system might be singular (e.g. if $n < d$ ). For a minimizer $\hat{w}$ , any $\hat{w} + u$ is equally as good if $X^T X u = 0$ .
Solving linear systems requires matrix reductions such as Gaussian elimination or LU factorization. What if we do not have enough compute to do this?

These issues require us to think more if we want the exact, guaranteed solution.

Applied machine learning

Method: try a flexible algorithm on the actual data and model

In machine learning, an algorithm does not need to work for all inputs and models. In fact, we might not even care too much about the actual loss, since the goal is to generalize as opposed to minimize training loss. So, we want an algorithm that gives a reasonable result without having to fully understand the data and the model. The most successful algorithm in ML is (stochastic) gradient descent, where for each step $t$ , it does the following:

$\begin{align} w_{t} \gets w_{t} - \eta_t \nabla L_t(w_t). \\ \end{align}$

The gradient for the least squares problem is

$\begin{align} \nabla L_t(w) & = 2 x_t (x_t \cdot w - y_t). \\ \end{align}$

For least squares, SGD is rather nice:

The gradient for the vector case is no more complicated than the scalar case.
If we have limited budget and stop in the middle, we get higher training loss. However, this is often okay (or done deliberately as in early-stopping) because the goal is to generalize well, as opposed to minimizing the training loss.
(might) work for ill-conditioned/underdetermined problems
- if the initial value is $w_0 = 0$ , sgd finds a solution with small norm, rather than a faraway place in the subspace.

In fact, SGD does not even care that we are solving least squares and $L$ can be some other loss function. SDG is popular/effective in machine learning precisely because it behaves robustly even when we do not fully understand the model or the data (which is always in the case). Instead of relying on models/methods that we understand, it seems more successful in ML to empirically test if SGD solves the problem – it often does, if not, analyze the issue, change something and try again.