Digit Recognition

In this project we will do digit recognition, using the original MNIST dataset. We will try to reproduce
many of the results presented in this paper. The state-of-the-art error rate on this dataset is roughly 0.5%; the methods
explored in this project got us an error rate of approximately 2-3% (modulo implementation details). In this project I will exclusively
use linear kernels: the error rate can be reduced further by using non-linear kernels.
To avoid computational expense, we will only train on a subset of the
data. Using the raw pixels as features to train a linear svm, here is the plot of the error rate vs the number of
training examples:

Error rates vs. Number of Training Samples

100 samples: 27.62%

200 samples: 23.23%

500 samples: 19.68%

1000 samples: 19.15%

2000 samples: 18.83%

5000 samples: 14.77%

10000 samples: 11.91%

The performance can be improved if we consider more sophisticated features rather than raw pixel values.
Consider a feature where the image is overlaid with a grid in which each cell is of size c x c and the values inside
each cell are added up to produce one feature per cell. Also suggest using multiple overlapping grids that are
offset by c/2 instead of a single grid. We will construct two sets of grids, one with a cell size of 4 pixels and another
with a cell size of 7 pixels. The final feature vector will be the raw pixel values, together with the features from all
the cells concatenated together. Adding such cell features should theoretically allow us to categorize our data on a more macro-scale. In other words, if there exists some noise that reduces the accuracy obtained by looking only at individual pixel values, taking the sum of larger sized cells over a grid should, to some extent, overlook that noise. Particularly with the help of normalization across individual cell features, we would be able reduce the magnitude that which noise could possibly contribute to the resulting inaccuracies. Theoretically, the improvement should not be anything too significant.

Regarding the results below, The final feature vector is the concatenation of the raw pixel values and the features from all the cells from the 4 × 4 and 7 × 7 grid. All three portions of the final feature vector are max-normalized independently to produce the best results at 10,000 samples. The small blip of variation at 2000 samples could be a byproduct from using max-normalization as opposed to using L1 or L2 normalization.

Error rates vs. Number of Training Samples with 4×4 and 7×7 Non-overlapping Features

100 samples: 27.39%

200 samples: 22.72%

500 samples: 19.13%

1000 samples: 17.99%

2000 samples: 18.37%

5000 samples: 14.96%

10000 samples: 11.85%

Error rates vs. Number of Training Samples with 4×4 and 7×7 Overlapping Features

Overlap values of 2 for 4 × 4 and 4 for 7 × 7.

100 samples: 27.03%

200 samples: 22.09%

500 samples: 18.29%

1000 samples: 17.46%

2000 samples: 17.96%

5000 samples: 14.64%

10000 samples: 11.42%

Observations: the increase in accuracy is quite small, possibly because the feature vectors continue to be dominated by the raw pixel values (which make up most of the feature vector) and also because we are just reducing noise, but it is an improvement nonetheless. Additionally, we see a small increase in performance after using overlapping cell blocks as opposed to non-overlapping cells, which makes sense since overlapping captures more information with a larger number of cells (and takes up a larger portion of the feature vector). But to have a stronger increase in accuracy, we need a stronger feature, beyond pixel intensity.

To get to state-of-the-art performance however, we will need to go beyond pixel intensity. We will use a variant of
PHOG(Pyramidal Histogram of Oriented Gradients). To construct the PHOG feature vector, we first compute
the gradient magnitude and orientation at each pixel in the image. Then we construct "pyramid features" similar
to spatial pyramids. However, instead of summing the pixel intensities in each cell we will construct a histogram of orientations
in each cell. Orientation histograms typically have between 8 to 10 bins, resolved by modular arithmetic.
The final feature vector is the concatenations of all the histograms from all the cells. Typically the histograms
are normalized (so that the bin counts sum to 1) before being concatenated. Gradient orientation attempts to categorize data with emphasis on the edges of the image - the stronger the gradient direction is (with the help of gradient magnitude), the stronger the edge is. Unlike classifying with pixel intensities, which could error out between two images of the same number but on different sides of the image space, gradient orientation classifies the data using the overall contour of the number obtained from the training set (sorted by edge orientation and weighted with the strength of the edge).

Error rates vs. Number of Training Samples

PHOG feature vector computed with tap filter [-1 0 1] gradient over 4×4 and 7×7 cells.

L2 normalization is performed over orientation histograms.

100 samples: 13.95%

200 samples: 9.67%

500 samples: 6.27%

1000 samples: 4.98%

2000 samples: 3.63%

5000 samples: 3.05%

10000 samples: 2.58%

Observations: we see a significant boost in performance over the results from spatial pyramids. This makes sense considering that we are now using a much more sophisticated feature vector as opposed to the feature vector with spatial pyramids.

Error rates vs. Number of Training Samples

PHOG feature vector computed with tap filter [-1 0 1] gradient over 4×4 and 7×7 cells.

No normalization is performed over orientation histograms.

100 samples: 15.87%

200 samples: 9.41%

500 samples: 6.39%

1000 samples: 5.21%

2000 samples: 4.47%

5000 samples: 3.98%

10000 samples: 3.10%

Observations: we do see a very small drop in performance when we do not normalize the histograms before concatenation. This is likely due to the fact that the feature sets with the larger overall gradient magnitudes (and hence larger norms) would dominate those that do not have magnitudes as large. Since we are dealing with “pyramid features” with difference cell sizes, this characteristic is inevitable.

Error rates vs. Number of Training Samples

PHOG feature vector computed with Gaussian derivative filter gradient over 4×4 and 7×7 cells.

L2 normalization is performed over orientation histograms.

100 samples: 14.44%

200 samples: 9.78%

500 samples: 6.04%

1000 samples: 5.64%

2000 samples: 4.14%

5000 samples: 3.21%

10000 samples: 2.45%

Observations: we get another small performance boost with the Gaussian derivative filter. With the help of the filter, we could eliminate some of the gradient noise that clouds classification.

Left image: error cases from using a Tap Filter. Right image: error cases from using a Gaussian derivative filter

The system I have built now is almost on par with the state of the art as far as features are concerned. To get the full boost however (in addition to raising the number of training samples used), we will have to shift to a non-linear kernel, such as an intersection kernel. I'll save this for a later date. :)