COS 429 - Computer Vision

Fall 2019

Course home Outline and Lecture Notes Assignments


Assignment 4: Deep Learning

Due Thursday, Dec. 5


Part II. Adding a hidden layer

The "network" that you trained in part 1 was rather simple, and still only supported a linear decision boundary. The next step up in complexity is to add additional nodes to the network, so that the decision boundary can be made nonlinear. That will allow classifying datasets such as this one:

We will implement a simple network with only "fully connected" layers, "relu", and the logistic function – Here is a simple example of the architecture that we will use. It contains one "hidden" fully connected layer with two neurons (u and v) and one output layer with one neuron:



For convenience, we've given the names $u$ and $v$ to the neurons of the two hidden nodes, and $\hat{z}$ to the final output after it has been squished by the logistic function. As before, $z$ is the ground-truth label. The different $w$ are the weights to be learned during training, then used during testing.

To train this network using SGD, we need to evaluate $$ \nabla L = \begin{pmatrix} \frac{\partial L}{\partial w_1} \\ \frac{\partial L}{\partial w_2} \\ \frac{\partial L}{\partial w_3} \\ \vdots \end{pmatrix} $$

As discussed in class, evaluating these partial derivatives is done using "backpropagation", which is just a fancy name for repeatedly applying the chain rule of differentiation, collapsing into a row vector for efficiency, and repeating. For example, suppose we wish to evaluate $\frac{\partial L}{\partial w_5}$. Tracing back through the network to find the dependency, we know that $L$ depends on $\hat{z}$, which depends on $v$, which in turn depends on $w_5$. So, using the chain rule, we can write $$ \frac{\partial L}{\partial w_5} = \frac{\partial L}{\partial \hat{z}} \frac{\partial \hat{z}}{\partial v} \frac{\partial v}{\partial w_5} $$ and then proceed to write out all of those partial derivatives: $$ \frac{\partial L}{\partial w_5} = \bigl( 2 \; (\hat{z} - z) \bigr) \bigl( \hat{z} \; (1 - \hat{z}) \; w_9 \bigr) \bigl( Z(v) \; x \bigr) $$ Here $Z$ is a function that returns 1 if its argument is $>$ 0 and 0 otherwise. (Note that it's critical for this to be defined as $>$ 0 and not $\geq$ 0.) $Z$ is the derivative of RELU, just as $\hat{z}\;(1-\hat{z})$ is the derivative of the sigmoid.

In practical implementations, for efficiency and modularity, each layer of the network supports a backwards layer which takes as inputs the derivatives of the loss with respect to the outputs of the layer, and returns the derivatives with respect to the inputs of the layer (inputs includes the input vector 'x' to the layer, as well as the parameters at that layer). In order to get comfortable with backprop, we suggest that you write out the partial derivatives of $L$ at each layer before moving forward.

The starter code for this part defines a function tinynet_sgd that supports the above architecture, but with a few important generalizations:


Do the following:

What to turn in:






Last update 25-Nov-2019 14:19:12