Lecture 1.3: Interpreting results, descriptive statistics
-
descriptive statistics [Lar82], sample mean and sample variance
After collecting data in a simulation experiment, we often want to
calculate some statistics to characterize the results, typically estimates
of the mean and variance of certain observed quantities. If you measure
the tail length of 10 laboratory mice, for example, you might very naturally
calculate the average length, and the average square deviation from the
average length.
The average length is more properly called the sample mean,
and the average square deviation from sample mean, the sample variance.
Here's a point that's sometimes shrouded in mystery, but is actually
simple to understand: When we sum the squares of the differences between
the observations and the sample mean, we should divide by (N-1), not N,
where there are N observations. The reason for this intuitively is that
there are (N-1) degrees of freedom among the N sample deviations from sample
mean, because they must sum to zero. The algebra shows that the sum
of the deviations squared divided by (N-1) has the right expected value,
the variance of the observed quantity, and thus is an unbiased
estimate of the actual variance.
If the observed random variable has mean mu and variance sigma^2, and
we take N independent samples, the sample mean is an unbiased estimator
of mu (its expected value is mu), and its variance is (1/N)*sigma^2.
The square-root of this, the standard deviation
of the sample mean, thus decreases as the
square-root of the number of observations. This is true in many
practical situations --- you need a hundred times as many observations
to decrease the standard deviation of the sample mean by a factor of ten.
- Importance of Gaussian (normal) distribution
The Gaussian, or normal distribution plays a fundamental
role in probability theory and statistics. One reason is that the sum of
independent observations tends to this distribution rather quickly in practice.
With broad conditions this can be proved mathematically, and the result
is called the Central Limit Theorem. Many observed variables in
nature are in fact the sum of many independent random effects, and are
very well approximated by a Gaussian random variable.
It's common for experimentalists to assume that the variables and especially
the noise they are dealing with is roughly Gaussian for estimating confidence
intervals. That's OK, as long as you are aware of the assumption and are
prepared to think about exceptional cases when you might be led astray.
Important properties of the Gaussian:
- linear combinations are also Gaussian
- has maximum entropy; that is, is ``most random''
- least-squares estimates are maximum likelihood
- many derived random variables have analytically known densities,
like chi-squared, student t (see below)
- the sample mean and the sample variance of N independent, identically
distributed samples are independent, and the sample mean is also normal
with the same mean as the parent distribution, and (1/N)th the variance.
- distributions derived from Gaussian
Certain statistics derived from independent samples of a Gaussian have
special, known and well understood distributions which can be computed
fairly easily. (They used to be published in big fat tables.) The two
most important are
- the normalized sample variance (sum of the square deviations from sample
mean, divided by the true variance) of N independent samples from a Gaussian
is chi-squared distributed with N-1 degrees of freedom
-
the deviation of sample mean from true mean, normalized appropriately,
is student t distributed with N-1 degrees of freedom.
- confidence intervals and interpretation of results
If you plot, say, the sample mean of a number of measurements, it would
be very nice if you could say roughly how accurate your estimate is ---
that is, how far you can reasonably expect your estimate to differ from
the true mean. These "error bars" are expected in your graphs as a matter
of course in many experimental disciplines.
If you assume that the samples you are measuring are independent samples
from a Gaussian distribution, the distributions above enable you to estimate
confidence intervals. Take the sample mean, for example. It's (normalized)
deviation from the true mean has a known distribution (student t), so you
can look up two values, say L and R, with the property that the probability
that your observation is within the interval [L,R] is 99%. You can then
plot this interval around your data point, giving the viewer an estimate
of how accurate the observation is. (The same idea works for sample variance
if you use the chi-squared distribution.)
master reference list