least squares with least thinking

Outline

The midterm covers lecture material and non-optional reading. Double sided notes on an A4 paper is allowed.

Some tips:

Nothing in the overview (lecture 1) will appear in the midterm
Make sure you understand losses and their properties fairly well, especially what was in assignment 1
Understand backprop, automatic differentiation and the graph formulation
Understand modules and how to implement them
Model structures
- How backprop works when parameters are shared
- Understand the limitations/advantages of each model type
- Apply a model to problems with certain properties
- Equations for model architectures will be provided
Word vectors
- Distributional hypothesis and its implications
- From Turney and Pantel, focus on the types of matrices (term-doc, word-context, pair-pattern, etc.) and what they capture
- The glove and word2vec papers should be read carefully
Encoder decoder models
Regularization and optimization (less emphasized)
- Equations for optimization method will be provided, questions about their properties
- Questions about when to change regularization type/strength

Example questions

Generally will require a bit of thinking in addition to knowing some basic concepts.

Loss functions

When the hinge loss is written as $\max(0, 1 + s_{y'} - s_y)$ , what is $y'$ ?

Show that the hinge loss is an upperbound of the 0-1 loss:

$\max(0, 1 + s_{y'} - s_y) > 1[y' \neq y]$

The sparse classifier in assignment 1 allowed for efficient updates when each data point has sparse features. What is the $L_2$ regularizer, and what happens to the sparse classifier updates with a $L_2$ regularizer?

Neural networks basics

What happens to the training process if the parameters $W_i, b_i$ before going through a ReLu layer are initialized at 0

Suppose we are using a neural networks with sigmoid units, where $W \in R^{m \times n}$ with $W_{ij} \sim \mathcal{N}(0, 0.1)$ are the weights. The network designer decided that the model is not powerful enough, and added 10 times as many hidden units to each layer. What happens to the gradient of this network?

Explain the forward and backward functions in a module. What are the inputs and outputs, what are their dimensions? Write clear pseudo-code (or python) for the forward and backward functions in the following modules

$\begin{align} y &= [x; x] \\ y &= x_1 + x_2 \\ y &= ||x||^2 \end{align}$

Neural network structures

Given update equations for an RNN model, draw the diagram.

Write weight matrix for a recurrent neural network that does the following

Whenever there is q, u will follow, otherwise i.i.d from $p(x)$

Can a RNN model that a word never ends in v or j? If so, specify the weights, if not propose a modification then specify the weights.