Category

Statistics 📊

Published on

May 12, 2024

Updated on

# Introduction

The Chi-squared test ($\chi^2)$ is typically used in the following scenarios:

- To test the
**independence between two categorical variables**, i.e. whether the occurrence of one variable affects the probability of the occurrence of the other variable. - To determine if a
**sample data matches an expected distribution,**also known as a “goodness-of-fit” test. - To test the
**equality of proportions**of different populations*⇒ this is what we’ll examine here.*

In statistical experiments, we may want to test if a **proportion metric **is significantly different between groups (Control and Treatment, for the most simple cases).

For proportions, we could use either the Z-test or Chi-squared test.

**Z-test**is generally**more powerful**, but it assumes a**large sample size (>30)**. Indeed, it’s a**parametric test, that assumes normality**, which may not be true for small sample sizes. However, according to the Central Limit Theorem, a large sample will eventually be normally distributed.**Chi-squared**test is**less powerful**, but since it’s a non-parametric test, it will be**valid even for smaller sample size**, that may not follow a normal distribution.

# Formula

The formula for comparing distributions is:

$\chi^2 = \sum_{i,j}{\frac{(O_{i,j}-E_{i,j})^2}{E_{i,j}}}$Where:

- $O_{i,j}$ are the observed values for each category $i$ in each group $j$
- $E_{i,j}$ are the expected values for each category $i$ in each group $j$, which is actually computed from the overall distribution

# Implementation in Python

Let’s see how we can apply the Chi-squared test measure independence between two groups of simulated data:

**Generate sample data:****Compute the results with StatsModels:****We can also compare the results with a Z-test for proportions:**

```
# Import libraries
import numpy as np
import statsmodels.stats.proportion as ssm
# Create two binomial samples
n1 = 1000; n2 = 800
k1 = 150; k2 = 140
# Compute proportions
p1 = k1/n1
p2 = k2/n2
p = (k1+k2)/(n1+n2)
```

```
# Check with statsmodels
prop_chi = ssm.proportions_chisquare(
count=[k2, k1],
nobs=[n2, n1],
)
print("Chi-squared statistic: {:.4f}\np-value: {:.4f}".format(prop_chi[0], prop_chi[1]))
```

```
Chi-squared statistic: 2.0553
p-value: 0.1517
```

Since the p-value is >0.05, we cannot conclude for a significant difference with 95% confidence.

```
# Z-test for proportions
prop_z_test = ssm.proportions_ztest(
count=[k2, k1],
nobs=[n2, n1],
alternative='two-sided',
)
print("z-score: {:.4f}\np-value: {:.4f}".format(prop_z_test[0], prop_z_test[1]))
```

```
z-score: 1.4336
p-value: 0.1517
```

As expected, the p-value for Z-test is exactly identical to the Chi-squared p-value.