🎿

Category
Statistics 📊
Published on
May 12, 2024
Updated on

# Introduction

The Chi-squared test ($\chi^2)$ is typically used in the following scenarios:

1. To test the independence between two categorical variables, i.e. whether the occurrence of one variable affects the probability of the occurrence of the other variable.
2. To determine if a sample data matches an expected distribution, also known as a “goodness-of-fit” test.
3. To test the equality of proportions of different populations ⇒ this is what we’ll examine here.

In statistical experiments, we may want to test if a proportion metric is significantly different between groups (Control and Treatment, for the most simple cases).

For proportions, we could use either the Z-test or Chi-squared test.

• Z-test is generally more powerful, but it assumes a large sample size (>30). Indeed, it’s a parametric test, that assumes normality, which may not be true for small sample sizes. However, according to the Central Limit Theorem, a large sample will eventually be normally distributed.
• Chi-squared test is less powerful, but since it’s a non-parametric test, it will be valid even for smaller sample size, that may not follow a normal distribution.

# Formula

The formula for comparing distributions is:

$\chi^2 = \sum_{i,j}{\frac{(O_{i,j}-E_{i,j})^2}{E_{i,j}}}$

Where:

• $O_{i,j}$ are the observed values for each category $i$ in each group $j$
• $E_{i,j}$ are the expected values for each category $i$ in each group $j$, which is actually computed from the overall distribution

# Implementation in Python

Let’s see how we can apply the Chi-squared test measure independence between two groups of simulated data:

1. Generate sample data:
2. # Import libraries
import numpy as np
import statsmodels.stats.proportion as ssm

# Create two binomial samples
n1 = 1000; n2 = 800
k1 = 150; k2 = 140

# Compute proportions
p1 = k1/n1
p2 = k2/n2
p = (k1+k2)/(n1+n2)
3. Compute the results with StatsModels:
4. # Check with statsmodels
prop_chi = ssm.proportions_chisquare(
count=[k2, k1],
nobs=[n2, n1],
)

print("Chi-squared statistic: {:.4f}\np-value: {:.4f}".format(prop_chi[0], prop_chi[1]))
Chi-squared statistic: 2.0553
p-value: 0.1517

Since the p-value is >0.05, we cannot conclude for a significant difference with 95% confidence.

5. We can also compare the results with a Z-test for proportions:
6. # Z-test for proportions
prop_z_test = ssm.proportions_ztest(
count=[k2, k1],
nobs=[n2, n1],
alternative='two-sided',
)

print("z-score: {:.4f}\np-value: {:.4f}".format(prop_z_test[0], prop_z_test[1]))
z-score: 1.4336
p-value: 0.1517

As expected, the p-value for Z-test is exactly identical to the Chi-squared p-value.