In [1]:
__author__ = "Sida Wang"
__version__ = "COS 495 NLP Spring 2018"

Representations and loss functions

Recall that we considered the least squares problem,

\begin{align} L(w) = \sum_{i=1}^n (\phi(x_i) \cdot w - y_i)^2. \end{align}

How does it have to do with ML/NLP? A common approach in ML is to define a loss function that reflects what we desire, and then optimize it. Least squares is the simplest such loss function. Suppose we want to solve a sentiment classification problem under this loss, then $x_i$ refers to a document, $y_i \in \mathbb R$ refers to its label, and $\phi(x)$ is a suitable feature map.

Setup

Data

Concretely, we tackle the problem of classifying movie reviews where each data point consists of a short review, and its polarity. Get the data here

git clone https://github.com/cos495nlp/data

In [2]:
import numpy as np
import random
def sentence_polarity():
    with open('./data/rt-polarity/rt-polarity.pos.utf8') as f:
        neg = [{'y':'pos', 'x':sent} for sent in f.readlines()]
    with open('./data/rt-polarity/rt-polarity.neg.utf8') as f:
        pos = [{'y':'neg', 'x':sent} for sent in f.readlines()]
    data = pos + neg
    random.seed(1)
    random.shuffle(data)
    print(len(data), len(pos))
    return data
data = sentence_polarity()
train_data,test_data = data[:9000],data[9000:]
test_data[:3]
10662 5331
Out[2]:
[{'x': "if you're burnt out on it's a wonderful life marathons and bored with a christmas carol , it might just be the movie you're looking for . it depends on how well flatulence gags fit into your holiday concept . \n",
  'y': 'pos'},
 {'x': "the film is beautifully mounted , but , more to the point , the issues are subtly presented , managing to walk a fine line with regard to the question of joan's madness . \n",
  'y': 'pos'},
 {'x': 'how many more times will indie filmmakers subject us to boring , self-important stories of how horrible we are to ourselves and each other ? \n',
  'y': 'neg'}]

Bag of features with hashing

If we can convert y=pos|neg to a scalar, and review to a vector, then we can apply least squares. For y it is easy, say 1 for pos and -1 for neg. At test time, we can predict whichever closest to $\phi(x) \cdot w$, i.e. pos if $\phi(x) \cdot w > 0$, neg otherwise.

How about text itself? One intuitive and effective representation is the bag-of-words representation, where bag hints at ignoring order. Here is an implementation of bag-of-words using hashing:

In [3]:
d = 100000
def bow_hash(x, d=d):
    one_ind = [hash(x_i) % d for x_i in x.split(' ')]
    phi_x = np.zeros(d)
    phi_x[one_ind] = 1
    return phi_x

phi = bow_hash

(Stoichastic) gradient descent

Once we defined the loss and converted each $x$ to a vector $\phi(x) \in \mathbb R^{d}$, we can use stoichastic gradient descent (SGD) algorithm, which performs the update \begin{align} w_i \gets w_{i-1} - \eta \nabla L(w_{i-1}). \\ \end{align}

In [4]:
def sgd(w, gradloss, data, T=6, eta=1e-2, printIterval=3):
    for t in range(T):
        loss_t = 0
        for data_i in data:
            grad, loss = gradloss(w, data_i)
            w -= eta * grad
            loss_t += loss
        if t % printIterval == 0:
            print(t, '\t', loss_t / len(data))
    return w

# dataset-specific conversion for both x and y
def convert_y(lossf):
    return lambda w, data_i:\
        lossf(w, phi(data_i['x']), y = 1 if data_i['y']=='pos' else -1);

def predict(x, w):
    return 'pos' if np.dot(w, phi(x)) > 0 else 'neg'

def accuracy(predict, data):
    return np.mean([predict(data_i['x'])==data_i['y'] for data_i in data])

def print_results(w):
    print('train', accuracy(lambda x: predict(x, w), train_data))
    print('test', accuracy(lambda x: predict(x, w), test_data))

Loss functions

Squared difference

We can now classify documents using the squared loss, and its gradient

\begin{align} L(w) &= (\phi(x) \cdot w - y)^2, \\ \nabla L(w) &= 2\phi(x)(\phi(x) \cdot w - y). \end{align}

In [5]:
def gradloss_ls(w, phi_x, y):
    w_phi = np.dot(phi_x, w)
    loss = (w_phi - y)**2
    grad = 2 * phi_x * (w_phi - y)
    return grad, loss

w_ls = sgd(np.zeros(d), convert_y(gradloss_ls), train_data, T=5)
print_results(w_ls)
0 	 0.856967998556
3 	 0.401182978589
train 0.944888888889
test 0.758122743682

I got a test accuracy of over 70%, not bad for such a simple method that does not seem very suitable for the task! Some results are in table 2 of baselines. Let us look at two other common loss functions.

Hinge loss

The hinge loss function, and its derivative are

$$ \begin{align} L(w) &= \max(1 - y \phi(x) \cdot w, 0), \\ \nabla L(w) &= \Big\{\begin{array}{ll} -y \phi(x) & 1 - y \phi(x)\cdot w > 0, \\ 0\ & \text{otherwise.} \end{array} \end{align} $$

This is also known as the SVM loss or the margin loss.

In [6]:
def gradloss_svm(w, phi_x, y):
    w_phi = np.dot(phi_x, w)
    loss = max(1 - w_phi*y, 0)
    grad = -phi_x*y if 1 - w_phi*y > 0 else 0
    return grad, loss

w_svm = sgd(np.zeros(d), convert_y(gradloss_svm), train_data, T=5)
print_results(w_svm)
0 	 0.812316666667
3 	 0.475362222222
train 0.870777777778
test 0.759927797834

Logistic loss

The logistic loss and its derivatives are

\begin{align} L(w) = \log(1 + \exp(-y \phi(x) \cdot w)),\\ \nabla L(w) = -y \phi(x) \frac{\exp(-y \phi(x) \cdot w)}{1 + \exp(-y \phi(x) \cdot w)}. \end{align}

In [7]:
from numpy import exp, log
def gradloss_lr(w, phi_x, y):
    w_phi = np.dot(phi_x, w)
    loss = log(1 + exp(-w_phi*y)) # what could go wrong?
    grad = -phi_x*y * exp(-w_phi*y)/(1 + exp(-w_phi*y))
    return grad, loss

w_lr = sgd(np.zeros(d), convert_y(gradloss_lr), train_data, T=5)
print_results(w_lr)
0 	 0.647221821267
3 	 0.511347912144
train 0.821222222222
test 0.743080625752
In [8]:
from vega import VegaLite
scores = np.linspace(-2, 4, num=500)
data = [(score, gradloss_lr(1, score, 1)[1], 'LR') for score in scores]
data += [(score, gradloss_svm(1, score, 1)[1], 'SVM') for score in scores]
data += [(score, 0.5*gradloss_ls(1, score, 1)[1], 'LS') for score in scores]

plotdata = list(zip(*data))
spec = \
{
    "width": 250, "height": 250,
    "mark": "line",
    "encoding": {
        "x": {
          "field": "score",
        },
        "y": {
          "field": "loss",
        },
         "color": {
          "field": "Loss type",
          "type": "nominal"
        }
    }
}
display(VegaLite(spec, {'score': plotdata[0], 'loss': plotdata[1], 'Loss type': plotdata[2]}))

Some predictions

In [9]:
def test(x):
    print(predict(x, w_svm), '\t', x)

test('a great movie')
test('a not so good movie')
test('it is hard to imagine something more sleep inducing')
test('there were many memorable moments')
test('it made me laugh so many times, its not even funny') # error
test('the beginning was great, but as a whole it sucked')
test('worth my money')
test('it was not bad')
test('no one on earth will say it is bad')

words = ['no', 'not', 'bad', 'awesome', 'good'];
print([(x, w_svm[[hash(x) % d]][0]) for x in words])
pos 	 a great movie
neg 	 a not so good movie
neg 	 it is hard to imagine something more sleep inducing
pos 	 there were many memorable moments
neg 	 it made me laugh so many times, its not even funny
neg 	 the beginning was great, but as a whole it sucked
pos 	 worth my money
neg 	 it was not bad
neg 	 no one on earth will say it is bad
[('no', -0.68000000000000038), ('not', -0.049999999999999975), ('bad', -0.98000000000000065), ('awesome', 0.19000000000000003), ('good', 0.13999999999999996)]

Generalizing loss functions

We would like to generalize the loss function so it can handle a variety of tasks. With no assumptions on what $y$ might be, we can define the featurizer $\phi(x,y)$ to be a function of $y$ as well. Let the score of $y$ be $s_y = w \cdot \phi(x,y)$ then prediction is the candidate with the maximum score

$$ \hat{y} = \arg\max_y w \cdot \phi(x,y). $$

The structured hinge loss is

$$ L(x,y,w) = \max(0, 1 - (w \cdot \phi(x,y) - \max_{y' \neq y} w \cdot \phi(x,y'))). $$

While the hinge loss has the advantage of being friendly to discrete search methods that finds $\max_{y' \neq y} w \cdot \phi(x,y')$, it only pay attentions to the first and second highest scores. Having probabilities values for all the predictions is useful for sampling and for soft labels. The softmax loss (i.e. the negative log likelihood) is:

\begin{align} \log p_w(y|x) = -L(x,y,w) &= \log\frac{\exp(w \cdot \phi(x,y))}{\sum_{y'} \exp(w \cdot \phi(x,y'))}\\ & = w \cdot \phi(x,y) - \log {\sum_{y'} \exp(w \cdot \phi(x,y'))}. \end{align}

If we are given a soft label $p^*(y)$, a slightly generalization is the cross-entropy loss $$ L(x,y,w) = \operatorname{KL}(p^*(y) || p_w(y | x)) = \sum_y p^*(y) \log \frac{p^*(y)}{p_w(y | x)}. $$

Beyond accuracy

The hinge loss is a (convex) upperbound of the 0-1 loss $1[y \neq y']$. More generally, we might have a preference $\operatorname{Cost}(y,y')$ when the true answer is $y$ and $y'$ was predicted. For example, if we are predicting the numerical ratings of product reviews, it is worse when a review with true rating 5 is classified as rating 1 than when a review with true rating 5 is classified as rating 4. We want to modify the hinge loss to use $\operatorname{Cost}(y,y')=|y - y'|$ in place of $1[y \neq y']$. It should have the following properties to be considered a hinge loss:

  • The loss is an upperbound of $\operatorname{Cost}(y,y')$ where $y'$ is the max scoring prediction, and $y$ is the right label
  • Increasing scores of all wrong answers by $\Delta$ also increases the loss by $\Delta$ (assuming there is a loss, and one of the wrong answers is predicted)
  • The upperbound is tight when the score of the prediction is equal to the score of the label ($s_y = s_y'$), and the scores of other $y$s are small

Excercise: find an example of this loss.

Examples of $\phi(x,y)$

For multi-class classification, the feature function usually just simulates a matrix vector product: $w \cdot \phi(x,y) = w_y \dot \phi(x)$. On the other hand, if $y$ has structure, the feature function might breakdown $y$. For example, if $y = [\text{Noun, Verb, Noun}]$ is a sequence of part-of-speech tags, a possible feature function is $$ \phi(x,y) = \left[I_\text{Noun}(y_1)\phi_1(x), I_\text{Verb before Noun}(y_1, y2)\phi_2(x), I_\text{Noun before Verb}(y_1, y_2)\phi_3(x)\ldots\right], $$ where $I_{c}(y)$ are indicator functions of the condition $c$ on $y$.

How well this works strongly depends on how good these features are. Instead of thinking too hard about good features, it is preferrable to use a more flexible model that can learn the features. The starting point of feature learning is to put everything about the data into a distributed representation, that is, dense vectors.

Bag of vectors

The bag-of-words representation does not capture any similarities between words. Everyone word is treated as being completely different from every other word. An alternative that could capture similarity between words is the bag-of-vectors representation. To start, we just use random vectors, which does not perform too well.

In [10]:
vec_dim = 200;
word_vecs = np.random.rand(d, vec_dim)
def bov_hash(x):
    one_ind = [hash(x_i) % d for x_i in x.split(' ')]
    return np.mean(word_vecs[one_ind, :], 0)
phi = bov_hash
w_svm_bov = sgd(np.zeros(vec_dim), convert_y(gradloss_svm), train_data, T=30, eta=1e-3, printIterval=5)
print_results(w_svm_bov)
0 	 1.0047791627
5 	 0.962820148762
10 	 0.927296812348
15 	 0.903846992199
20 	 0.888806987751
25 	 0.87851092327
train 0.599
test 0.589049338147