a revisit to χ2-test

a revisit to $χ 2$ -test

hantian

February 11, 2025

in high-school statistics, $χ^{2}$ -test is a very important element that enriches the students’ power to put statistics in real life scenarios. however, it is rarely explained from the scratch except for a few words of ‘the central limit theorem’. where and how the clt is used, and how come the degree of freedom is one less that the number of categories? this is so unsatisfactory.

1 goodness of fit

for categorical goodness-of-fit cases, the number of observation in each discrete case would be binomially distributed with parameter $O_{i} \sim B (n, p_{i})$ where $n$ is the size of the sample. applying the normal approximation:

Z_{i} = \frac{O_{i} - n p_{i}}{\sqrt{n p_{i}}} \sim N (0, 1 - p_{i}) .

however, it should be noted that $Z_{i}$ and $Z_{j}$ are not independent, as an observation can not be in two categories simultaneously. we can, however, find the covariance between $Z_{i}$ and $Z_{j}$ :

Cov (Z_{i}, Z_{j}) = E (Z_{i} Z_{j}) = \frac{E (O_{i} O_{j})}{n \sqrt{p_{i} p_{j}}} - n \sqrt{p_{i} p_{j}} .

where we can show that

E (O_{i} O_{j}) = E (O_{i} E (O_{j} | O_{i})) = n p_{i} (n - 1) p_{j} .

similar arguments can be made for test of independence.

notice that $Cov (Z_{i} Z_{j}) = - \sqrt{p_{i} p_{j}}$ , or

Z = {(Z_{i})}^{T} \sim N (0, Σ) .

where

Σ = (\begin{matrix} 1 - p_{1} & - \sqrt{p_{1} p_{2}} & \dots & - \sqrt{p_{1} p_{n}} \\ - \sqrt{p_{1} p_{2}} & 1 - p_{2} & \dots & - \sqrt{p_{2} p_{n}} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ - \sqrt{p_{1} p_{n}} & - \sqrt{p_{2} p_{n}} & \dots & 1 - p_{n} \end{matrix}) = I - p^{T} p

where $p = {(\sqrt{p_{i}})}^{T}$ is a unit vector by definition. this tells us that $Σ$ has $n - 1$ eigenvalues of 1 and one eigenvalue of 0, or equivalently put: there is a orthonormal transformation $A$ such that

A Σ A^{T} = (\begin{matrix} I_{n - 1} & 0 \\ 0 & 0 \end{matrix}) .

this result is also known as the sylvester’s theroem.

this means the components of $A Z$ , except for the last 0, have independent standardised normal distribution. measure the vector $Z$ in this frame:

| Z |^{2} = | A Z |^{2} = \sum_{i = 1}^{n - 1} X_{i}^{2} .

finally we have

\sum_{i = 1}^{n} \frac{{(O_{i} - E_{i})}^{2}}{\sqrt{E_{i}}} \sim χ_{n - 1}^{2} .

2 central limit theorem

however, in the above, the so-called ‘normal approximation’ is rather suspicious…the core of the argument lies in the well known central limit theorem. what is it and what are the conditions of it?

the central limit theorem states that if a random variable $X$ is distributed with mean $μ$ and standard deviation $σ$ , then for a independant random sample of $X$ , we would have the following:

Z = \frac{\bar{X} - μ}{\frac{σ}{\sqrt{n}}} \sim N (0, 1) .

that is, the average error from the population distribution would be normally distributed and as the size of the sample increases, the deviation shrinks down. this is essentially the theoretical basis for multiple measurements to imporve the precision. however, the normality is rather intriguing, as the theorem makes no assumption on the population distribution as long as it has valid expectation and variance.

how come that the err automatically converges to a normal distribution? well, the proof is quite bland, we look for its density function.

talking about convolutions, generating function is the magic words one want to spell. remember that the generating function is defined by:

M_{X} (t) = E (e^{tX}) .

notice that $M (0) = E (e^{0}) = 1$ , $M^{'} (0) = E (X) = μ$ and $M^{″} (0) = E (X^{2}) = σ^{2} + n μ^{2}$ .

readers with a bit exposure to the fourier analysis will easily identify that the generating function is actually a fourier transform of the density function:

M (t) = \int f (x) e^{xt} d x .

if $X_{i}$ ’s are i.i.d., then we would have $E (\prod_{i} e^{X_{i}}) = \prod_{i} E (e^{X_{i}})$ , that is

M_{\bar{X}} (t) = \prod_{i} M_{\frac{X_{i}}{n}} (t) = {(M_{X} (\frac{t}{n}))}^{n} .

notice that $\frac{t}{n} \to 0$ when $n \to \infty$ , thus we can approximate the logarithm of $M (t)$ by taylor expansion:

m_{\bar{X}} (t) = ln M_{\bar{X}} (t) = n ln (M_{X} (\frac{t}{n})) .

thus,

m_{\bar{X}} = n (ln 1 + \frac{1}{M (0)} M^{'} (0) \frac{t}{n} + \frac{1}{2} \frac{M^{″} (0) M (0) - M^{'} {(0)}^{2}}{M {(0)}^{2}} \frac{t^{2}}{n^{2}} + …) .

that is

m_{\bar{X}} = μt + \frac{1}{2 n} (σ^{2} + μ^{2} - μ^{2}) t^{2} + \frac{1}{n} (\dots) = μt + \frac{1}{2} {(\frac{σ}{\sqrt{n}})}^{2} t^{2} + O (\frac{1}{n})

to recover the density function from a moment generating function would require an inverse transformation whose difficulty may vary depending on the actual format of $M (t)$ . nevertheless, since we are trying to prove the central limit theorem, we know which distribution to look up, it would be a good idea to compare the moment generating function, or rather, the logarithmic moment generating function to that of a normal distribution.

the density function of $Y \sim N (μ, σ^{2})$ is

f (y) = \frac{1}{\sqrt{2 π} σ} e^{\frac{{(y - μ)}^{2}}{2 σ^{2}}} .

thus

m_{Y} (t) = μt + \frac{1}{2} σ^{2} t^{2} .

this proved the claim that $\bar{X} \sim N (μ, \frac{σ^{2}}{n})$ , and finally we can have the ‘normal approximation’.

3 one more thing

it should always be noted that aforementioned logics only applies if the distribution is discrete, for continuous distributions, $χ^{2}$ test is not the go-to solution. then, what is the counterpart of it in the continuous population?

the central problem lies in that without categories, we no longer have any prior information of the distribution of observation. that is, we are no longer sure if the observation is normal, or asympototically normal, the question switched from a parametric problem to a non-parametric one.

a simple example can be: given the sample from i.i.d. $X_{i}$ , $i = 1, 2, \dots, n$ , we want to test the claim:

H₀: $X \sim D$ .

where $D$ is some known distribution.

3.1 q-q plot

q-q plot is a great way to visualise the case. by sorting the sample to $X_{(1)} \leq X_{(2)} \leq \dots X_{(n)}$ , we are not losing any information. notice that by null hypothesis, $a X_{(i)} + b$ should be somewhere around either

1.: the $i$ -th quantile of $D$ ,
2.: the expected value of $Y_{(i)}$ from $D$ .

plot $(X_{(i)}, Y_{(i)})$ on a plane, we are expecting to see a linear correlation if the null hypothesis is true. thus, by looking at the correlation of the two sequence we can decide the statistic inference. this would be a topic in our future discussion though.

the test is known as shapiro-francia test when $D$ is a normal distribution.