Learning the Covariance Matrix

When we defined the covariance of Feature i and Feature j, we gave the formula for computing it from examples. This formula can be used to write the covariance matrix as follows. Let { x(1), x(2), ... , x(n) } be a collection of n examples, all from the same class, and met m be the estimated mean vector for that class. Then

C = [ ( x(1)-m)( x(1)-m)' + (x(2)-m)(x(2)-m) ' + ... + (x(n)-m)(x(n)-m)' ] / ( n-1 ) .

This formula conceals a trap for the unwary. Recall that x is a d-element vector, and C is a d-by-d matrix. It turns out that if n < d + 1, the matrix C is singular. That is very bad, since we need to invert C to form the Mahalanobis distance.

Even if n > d + 1, this estimate for the true covariance matrix may be very poor. When you think about it, C contains d² elements. Taking into account that C has to be symmetric, we can show that C contains d(d-1)/2 independent elements. We should not expect to get a good estimate for C until our number n of examples gets close to the number d(d-1)/2 of unknown elements. This is not a big problem if d is small, but it is not unusual to have 50 or 100 features. If d = 100, d(d-1)/2 is about 5000, meaning that we need around 5000 examples!

Back to Mean On to Regularization Up to Learning