Lecture 1.3: Outline

Lecture 1.3: Interpreting results, descriptive statistics

descriptive statistics [Lar82], sample mean and sample variance

After collecting data in a simulation experiment, we often want to calculate some statistics to characterize the results, typically estimates of the mean and variance of certain observed quantities. If you measure the tail length of 10 laboratory mice, for example, you might very naturally calculate the average length, and the average square deviation from the average length.

The average length is more properly called the sample mean, and the average square deviation from sample mean, the sample variance.

Here's a point that's sometimes shrouded in mystery, but is actually simple to understand: When we sum the squares of the differences between the observations and the sample mean, we should divide by (N-1), not N, where there are N observations. The reason for this intuitively is that there are (N-1) degrees of freedom among the N sample deviations from sample mean, because they must sum to zero. The algebra shows that the sum of the deviations squared divided by (N-1) has the right expected value, the variance of the observed quantity, and thus is an unbiased estimate of the actual variance.

If the observed random variable has mean mu and variance sigma^2, and we take N independent samples, the sample mean is an unbiased estimator of mu (its expected value is mu), and its variance is (1/N)*sigma^2. The square-root of this, the standard deviation of the sample mean, thus decreases as the square-root of the number of observations. This is true in many practical situations --- you need a hundred times as many observations to decrease the standard deviation of the sample mean by a factor of ten.
Importance of Gaussian (normal) distribution

The Gaussian, or normal distribution plays a fundamental role in probability theory and statistics. One reason is that the sum of independent observations tends to this distribution rather quickly in practice. With broad conditions this can be proved mathematically, and the result is called the Central Limit Theorem. Many observed variables in nature are in fact the sum of many independent random effects, and are very well approximated by a Gaussian random variable.

It's common for experimentalists to assume that the variables and especially the noise they are dealing with is roughly Gaussian for estimating confidence intervals. That's OK, as long as you are aware of the assumption and are prepared to think about exceptional cases when you might be led astray.

Important properties of the Gaussian:
- linear combinations are also Gaussian
- has maximum entropy; that is, is ``most random''
- least-squares estimates are maximum likelihood
- many derived random variables have analytically known densities, like chi-squared, student t (see below)
- the sample mean and the sample variance of N independent, identically distributed samples are independent, and the sample mean is also normal with the same mean as the parent distribution, and (1/N)th the variance.
distributions derived from Gaussian

Certain statistics derived from independent samples of a Gaussian have special, known and well understood distributions which can be computed fairly easily. (They used to be published in big fat tables.) The two most important are
- the normalized sample variance (sum of the square deviations from sample mean, divided by the true variance) of N independent samples from a Gaussian is chi-squared distributed with N-1 degrees of freedom
- the deviation of sample mean from true mean, normalized appropriately, is student t distributed with N-1 degrees of freedom.
confidence intervals and interpretation of results

If you plot, say, the sample mean of a number of measurements, it would be very nice if you could say roughly how accurate your estimate is --- that is, how far you can reasonably expect your estimate to differ from the true mean. These "error bars" are expected in your graphs as a matter of course in many experimental disciplines.

If you assume that the samples you are measuring are independent samples from a Gaussian distribution, the distributions above enable you to estimate confidence intervals. Take the sample mean, for example. It's (normalized) deviation from the true mean has a known distribution (student t), so you can look up two values, say L and R, with the property that the probability that your observation is within the interval [L,R] is 99%. You can then plot this interval around your data point, giving the viewer an estimate of how accurate the observation is. (The same idea works for sample variance if you use the chi-squared distribution.)

master reference list