Sample variance estimation formula: Why is (n-1) to be used instead of n as the denominator?
Let Y1, Y2,
.., Yn be the sample data points for the random variable Y. Let μY and σY denote the mean and standard deviation of Y respectively. This assumption usually holds true in case of sampling with replacement. Hence, E(YiYj) = E(Yi)E(Yj) for any i, j ∈ {1, .., n} such that i ≠ j.
The estimator for mean of Y is simply the average value of the data points, i.e.
Meanest(Y) or Ỹest =
which seems intuitive. By the same reasoning, it would seem that the best estimate for the variance of Y would be Varest(Y) =
Let us estimate the mean value of the estimated variance, since the estimate itself is a random variable dependent on the values of data points Yi in the sample set. Let A denote the variance estimator that uses n as the denominator. Let us denote by Sn the numerator in the formula for Ỹest, i.e.
Thus, the
estimated variance has a mean value that is less than the actual variance. From
equation ⑤, it is clear that if the estimator A is multiplied by n/(n - 1) (which is a linear transformation of A), then
the resulting estimator Amod will have a mean value that equals the actual variance. Thus, the more precise estimator of variance is: