COS 429 - Computer Vision

Fall 2019

Course home Outline and Lecture Notes Assignments


Assignment 4: Deep Learning

Due Thursday, Dec. 5


Part I. Implementing SGD

Most deep networks are trained using some variant of stochastic gradient descent. The main advantage of this is that it operates on one training example at a time, avoiding both the memory and computational costs of inverting a matrix the size of the entire training dataset.

Your first task is to re-implement the logistic regression code you wrote for Assignment 2, Part 1 to use Stochastic Gradient Descent. Recall that this optimizes the following loss function (where the terminology has been changed just a bit to be more consistent with the rest of this assignment): $$ L = \sum_i L_i(x_i, z_i, w) = \sum_i \bigl(s(w_1 x_{i1} + w_2 x_{i2} + w_3 x_{i3} + \ldots) - z_i\bigr)^2 = \sum_i \bigl(s(x_i \cdot w) - z_i\bigr)^2 , $$ where $x_i$ is the $i$-th datapoint, $z_i$ is its ground-truth label, and $w$ is the vector of parameters $(w_1,w_2,w_3,\ldots)^T$. (In the code, the vector of $w$ is called params.) As before, $s$ is the logistic function defined as $$ s(z) = \frac{1}{1 + e^{-z}}. $$

SGD starts with a random initial estimate for the parameters $w^{(0)}$ and then repeatedly updates the estimate $w$ by walking along the negative of the gradient of one component of the loss function: $$ w^{(k)} = w^{(k-1)} - \eta_k \nabla L_i(x_i,z_i,w^{(k-1)}), $$ where $i$ is the index of a random datapoint and $\eta_k$ is the learning rate. (If we were doing regular gradient descent, we would have to sum up the gradient across all datapoints.)

As we saw in assignment 2, because the derivative of $s$ is simple, evaluating gradients is very simple as well: $$ \nabla L_i(x_i,z_i,w^{(k-1)}) = 2 \; (\hat{z}_i - z_i) \; \hat{z}_i \; (1-\hat{z}_i) \; x_i $$ where $$ \hat{z}_i = s(x_i \cdot w^{(k-1)}). $$

The choice of learning rate $\eta_k$ can dramatically affect performance, and in fact a great deal of research these days involves schemes for intelligently setting and varying $\eta$ (including varying the learning rate per component of the weight vector). We will use a simple scheme that decreases the rate over time: $$ \eta_k = \frac{1}{epoch}. $$ Recall that the learning proceeds in a series of epochs, where one epoch consists of an SGD step for each of the input datapoints, in a random order. (This is a pretty high learning rate, since it's 1 for the entire first epoch. Larger networks typically work much better with a smaller learning rate.)

As with the selection of learning rate, determining when to terminate is an interesting research question. We will just terminate after a fixed number of epochs.


Download the starter code for this part. It contains the following files:

Do:

What to turn in:




Last update 25-Nov-2019 13:57:23