Statistical Classification Example

You're building a simple biometric system based on measurements of the human hand.
Your system is responsible for differentiating between male and female hands:

Your hardware engineer has already designed and built a feature extractor which gives three numbers for each handscan. You've collected five training data tokens for each of the sexes:

Male handscan training data L1 P1 T1
subjectM1 3.51 2.52 2.49
subjectM2 3.54 2.51 2.51
subjectM3 3.47 2.51 2.50
subjectM4 3.51 2.51 2.49
subjectM5 3.55 2.54 2.53
Female handscan training data
subjectF1 2.89 2.21 2.17
subjectF2 2.95 2.20 2.29
subjectF3 3.12 2.25 2.35
subjectF4 2.75 2.07 1.99
subjectF5 3.05 2.14 2.21

a. Using a Euclidean linear statistical classifier, compute how the following unknown
input handscan feature vectors would be classified:

. L1 P1 T1
x1 3.30 2.70 2.20
x2 2.90 1.99 2.20
x3 2.55 2.31 2.40

b. Take a ruler and place it between your middle and index fingers. Push a little firmly on the ruler to depress the skin between those fingers, then measure the length of your middle finger. That's L1 in the handscan feature set. Do the same for your pinkie, that's P1. Do the same for your thumb, that's T1. Compute what the Euclidean classifier would yield as a decision based on your hand metrics. Was it correct as to your sex? (don't freak out if not, I made up the data :).

c. Compute the Covariance matrix based on the training data for the Male handscan training data.

d. What does this matrix tell you about the features in that region?

e. If you were forced to pick only two features from the existing three, which ones would you pick? Why?
Plot the training data in 2D based on your selected two features.
Plot and label the Male and Female mean vectors.
Plot and label the unknown input data points x1, x2, and x3, and your own measurement point.


Solutions

Training Data:
              Male L1 Male P1 Male T1         Fem L1  Fem P1  Fem T1
	M/F1	3.51	2.52	2.49		2.89	2.21	2.17		
	M/F2	3.54	2.51	2.51		2.95	2.2	2.29		
	M/F3	3.47	2.51	2.5		3.12	2.25	2.35		
	M/F4	3.51	2.51	2.49		2.75	2.07	1.99		
	M/F5	3.55	2.54	2.53		3.05	2.14	2.21		

	Means	3.516	2.518	2.504		2.952	2.174	2.202		
        Stdevs  0.0313  0.0130  0.0167          0.1435  0.0702  0.1375      


a)  We calculate distances of each unknown to the male and female means.
Then we use minimum distance to classify each unknown input handscan.

Unknown Data    L1      P1      T1    DistM3D   DistF3D  Class
        x1      3.3     2.7     2.2   0.41496   0.63070   Male     
        x2      2.9     1.99    2.2   0.86640   0.19121   Fem      
        x3      2.55    2.31    2.4   0.99359   0.46829   Fem      

b) Your finger.  Some mis-classifications

c) Compute the Covariance Matrix:
   c(x,y) = ( (x[0]-mean_x)*(y[0]-mean_y) + ... +
               (x[n-1]-mean_x) * (y[n-1]-mean_y)) / (n-1)

      L        P       T
L  0.00098  0.00024 0.000345
P  0.00024  0.00017 0.00016                          
T  0.000345 0.00016 0.00028                          

d) What does it tell us?   Technically, NOTHING.  There aren't
really enough training vectors to reliably say much about
the covariance.  If it did tell us something, however it would
be that the Long finger varies the most, and the Pinkie the least
(main diagonal, we knew this anyway from the standard deviations
we calculated above).  To see how things co-vary, we need to divide
by the individual standard deviations, then inspect.

    SD(L1) = sqrt( c(L1,L1) ) = .0313
    SD(P1) = sqrt( c(P1,P1) ) = .0130
    SD(T1) = sqrt( c(T1,T1) ) = .0167

now we can find the coefficient of correlation cc between each pair such that:

c(x,y) = cc * SD(x) * SD(y)

cc( L1, P1 ) = .590
cc( L1, T1 ) = .66
cc( P1, T1 ) = .737

from this, it seems that L1 and P1 are the LEAST correlated,
and P1 and T1 are the MOST correlated.

e) NOTE:  the covariance matrix and correllation coefficients were
only calculated using the male data, so they don't necessarily
tell us much about which feature to throw away.

To do that, the best method is "withholding," throwing out each
feature one at a time and seeing which pairs still give us a
"correct" (unchanged from the 3D case) classification.

DistM-LP        DistF-LP        Class
0.282453536     0.630698026     Male   CORRECT
0.811319912     0.191206694     Fem    CORRECT
0.988139666     0.424381903     Fem    CORRECT

DistM-LT	DistF-LT	Class
0.372923585     0.348005747     Fem    WRONG!!
0.6869294       0.052038447     Fem    CORRECT
0.971582215     0.448116056     Fem    CORRECT

DistM-PT	DistF-PT	Class
0.354316243     0.526003802     Male   CORRECT
0.609261848     0.184010869     Fem    CORRECT
0.23255107      0.240208243     Male   WRONG!!

So from this experiment, we could throw away the
thumb dimension/measurement, and still get the
same classification results.  Note that the covar.
matrix did tell us that the thumb and the long
finger were most highly correlated, witholding
told us which one we could throw away.