Sunday, November 11, 2018

Basic statistics - Some notes and proofs


Sample variance estimation formula: Why is (n-1) to be used instead of n as the denominator?



Let Y1, Y2, .., Yn be the sample data points for the random variable Y. Let μY and σY denote the mean and standard deviation of Y respectively. This assumption usually holds true in case of sampling with replacement. Hence, E(YiYj) = E(Yi)E(Yj) for any i, j ∈ {1, .., n} such that i ≠ j.

The estimator for mean of Y is simply the average value of the data points, i.e. 
Meanest(Y) or est =


which seems intuitive. By the same reasoning, it would seem that the best estimate for the variance of Y would be Varest(Y) =




Let us estimate the mean value of the estimated variance, since the estimate itself is a random variable dependent on the values of data points Yi in the sample set. Let A denote the variance estimator that uses n as the denominator. Let us denote by Sn the numerator in the formula for est, i.e.




Now, by definition,

Thus, the estimated variance has a mean value that is less than the actual variance. From equation ⑤, it is clear that if the estimator A is multiplied by n/(n - 1) (which is a linear transformation of A), then the resulting estimator Amod will have a mean value that equals the actual variance. Thus, the more precise estimator of variance is: