Frequentist hypothesis testing cheat sheet

Definitions

Hypothesis testing (aka A/B testing) is used to assess if an experiment \(H_1\) generates significantly different results from the baseline \(H_0\) parameters, i.e. if the results may have been observed by chance, or are representative of the larger population. In this notebook, we will focus on classical frequentist hypothesis testing, but there are other techniques such as the bayesian method.

Most common distributions:

  • Standard normal distribution (aka Z-distribution): for continuous variables, used if sample size is large and population \(\sigma\) is known
  • Student distribution (aka T-distribution): similar to Z-distribution but with thicker tails; used for smaller sample sizes or unknown \(\sigma\)
  • Binomial distribution: discrete distribution that generates probabilities

Hypotheses:

  • \(H_0\) aka null hypothesis
  • \(H_1\), alternative hypothesis
  • Test statistics: usually the standardized difference between means or proportions

Significance is measured by:

  • \(\alpha = P(H_1\|H_0)\) is the the significance value. Probability to reject \(H_0\) by error, where in fact it was true (type I error). It is often set at 5%
  • p-value: probability that you would get the same results by chance, when \(H_0\) is true. If the \(p\)-value is lower than \(\alpha\), we accept \(H_1\)
  • \(\beta = P(H_0\|H_1)\) is the probability of failing to reject \(H_0\) when there was actually a change (type II error). It is often set at 20%.
  • \(1 - \beta\) is the sensitivity aka recall. Probability to correctly detect a change when there was indeed one. It is often set at 80%.
  • Effect size, aka practical difference is the minimum effect that you want to observe (e.g. +2% in click-through rate). Cohen’s \(d\) is a standardized measure of the effect size.
  • Minimim sample size is the minimum number of observations needed to be able to conclude to a significant difference, given an effect size. The smaller the effect size needed, the greater the minimum sample size will be.
# Import libraries
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import scipy.stats as st
sns.set()
# Summary graph
fig, ax = plt.subplots(2, 1, figsize=(12,12))

## Plot hypotheses distributions
for i in range(len(ax)):
    mu1 = 0
    x1 = np.linspace(mu1-5, mu1+5, 100)
    y1 = st.norm.pdf(x1, loc=mu1, scale=1)
    ax[i].plot(x1, y1, color='steelblue')
    ax[i].text(-1.1, .375, '$H_0$', color='steelblue')
    ax[i].fill_between(x1[:31], 0, y1[:31], color='steelblue', alpha=.3)
    ax[i].fill_between(x1[-31:], 0, y1[-31:], color='steelblue', alpha=.3)
    ax[i].set_xlim(mu1-4, 7)
    ax[i].set_ylim(0, )
    ax[i].margins(0,)

mu2 = 3.5
x2 = np.linspace(mu2-5, mu2+5, 100)
y2 = st.norm.pdf(x2, loc=mu2, scale=1.2)
ax[0].plot(x2, y2, color='firebrick')
ax[0].text(4.5, .31, '$H_1$', color='firebrick')
ax[0].set_xticks([-1.96, 0, 1.96, 3.5])

## Plot alphas and betas
ax[0].text(2.1, .01, '$\\alpha / 2$')
ax[0].text(-2.2, .01, '$\\alpha / 2$')
ax[0].text(1.2, .02, 'β')
ax[0].text(3.5, .15, '1-β')
ax[0].text(3, .12, 'Statistical power')
ax[0].fill_between(x2[:35], 0, y2[:35], color='firebrick', alpha=.3)
ax[0].vlines(1.96, 0, .35, color='black', linestyles='dashed', alpha=.5)
ax[0].text(1.5, .36, 'Decision\nthreshold')

## Plot test statistic
ax[1].set_xticks([-1.96, 0, 1.5, 1.96])
ax[1].set_xticklabels(['$-z_{α/2}$', '$\mu$', 'z-score', '\n$z_{α/2}$'])
ax[1].vlines([-1.96, 1.96], 0, .06, color='steelblue', alpha=.5)
ax[1].vlines(0, 0, .4, color='steelblue', linestyles='dashed', alpha=.5)
ax[1].vlines(1.5, 0, .13, color='orange')
ax[1].scatter(1.5, 0, color='orange');

png

Standard Normal and Student distributions

Formulas

Statistic Notation Formula
Population size \(N\) -
Sample size \(n\) -
Population mean \(\mu\) \(\frac{\sum{x}}{N}\)
Sample mean \(\bar x\) \(\frac{\sum{x}}{n}\)
Standard deviation \(\sigma\) \(\sqrt{\frac{\sum{(x - \mu)^2}}{N}}\)
Sample std deviation \(s\) \(\sqrt{\frac{\sum{(x - \bar x)^2}}{n-1}}\)
Variance \(\sigma^2\) \(\sigma^2\)
Sample variance \(s^2\) \(s^2\)
Standard error (of the Mean) \(SE\) \(s/\sqrt n\)

Hypothesis testing

Test Distribution Standard Error Test statistic Confidence Interval
Sample mean vs population (n≥30) z-distribution \(\frac{\sigma}{\sqrt n}\) \(\frac{\bar x - \mu}{\sigma / \sqrt n}\) \(\bar x \pm z \frac{\sigma}{\sqrt n}\)
Sample mean vs population (n<30) t-distribution \(\frac{s}{\sqrt n}\) \(\frac{\bar x - \mu}{s / \sqrt n}\) \(\bar x \pm t_{n-1} \frac{s}{\sqrt n}\)
Difference in populations means z if n≥30 and \(\sigma\) known \(\frac{\sigma}{\sqrt n}\) \(\frac{(\bar{x}_1 - \bar{x}_2) - 0}{\sqrt{\frac{\sigma^2_1}{n_1}+\frac{\sigma^2_2}{n_2}}}\) \(\bar{x} \pm z \frac{\sigma}{\sqrt n}\)
Difference in samples means t if n<30 or \(\sigma\) unknown \(\frac{s}{\sqrt n}\) \(\frac{(\bar{x}_1 - \bar{x}_2) - 0}{\sqrt{\frac{s^2_1}{n_1}+\frac{s^2_2}{n_2}}}\) \(\bar{x} \pm t_{n-1} \frac{s}{\sqrt n}\)

Implementation in Python

Let’s use the formulas above to test for the statistical difference between two samples means. We’ll first compute the formulas manually, then check our results with SciPy functions out-of-the-box.

# Create two normally distributed samples
np.random.seed(1)
h0 = np.random.normal(loc=0., size=100)
h1 = np.random.normal(loc=0.35, size=80)

# Sample sizes
n0 = len(h0)
n1 = len(h1)

# Means
x0 = np.mean(h0)
x1 = np.mean(h1)

# Standard errors (not of the means!) with n-1 degrees of freedom
s0 = np.std(h0, ddof=1)
s1 = np.std(h1, ddof=1)

# t-test
t = (x0 - x1) / np.sqrt(s0**2/n0 + s1**2/n1)
print("t-score: {:.3f}".format(t))
print("p-value: {:.4f}".format(st.t.sf(abs(t), df=n0+n1-2)*2))   # Multiply by 2 for two-tailed test

# 95% confidence intervals for the means
t0 = st.t.ppf(1-0.025, n0-1)
m0 = t0 * s0/np.sqrt(n0)
ci0_low = x0 - m0; ci0_hi = x0 + m0
print("Confidence interval for h0: ({:.4f}, {:.4f})".format(ci0_low, ci0_hi))

t1 = st.t.ppf(1-0.025, n1-1)
m1 = t1 * s1/np.sqrt(n1)
ci1_low = x1 - m1; ci1_hi = x1 + m1
print("Confidence interval for h1: ({:.4f}, {:.4f})".format(ci1_low, ci1_hi))

# Plot distributions and CIs
fig, ax = plt.subplots(1, 2, figsize=(14,6))

sns.histplot(h0, color='steelblue', ax=ax[0])
sns.histplot(h1, color='green', ax=ax[0])

sns.histplot(h0, color='steelblue', kde=True, alpha=0, ax=ax[1])
sns.histplot(h1, color='green', kde=True, alpha=0, ax=ax[1])
ax[1].axvline(x0, color='gold')
ax[1].axvline(x1, color='gold')
plt.axvspan(ci0_low, ci0_hi, alpha=0.5, color='steelblue')
plt.axvspan(ci1_low, ci1_hi, alpha=0.5, color='green');
t-score: -2.773
p-value: 0.0061
Confidence interval for h0: (-0.1159, 0.2371)
Confidence interval for h1: (0.2335, 0.6591)

png

# Check results with scipy
t_test = st.ttest_ind(h0, h1, equal_var=False)
print("t-score: {:.3f}\np-value: {:.4f}".format(t_test[0], t_test[1]))

scipy_h0 = st.t.interval(alpha=0.95, df=n0-1, loc=x0, scale=st.sem(h0))
scipy_h1 = st.t.interval(alpha=0.95, df=n1-1, loc=x1, scale=st.sem(h1))
print("Confidence interval for h0: ({:.4f}, {:.4f})".format(scipy_h0[0], scipy_h0[1]))
print("Confidence interval for h1: ({:.4f}, {:.4f})".format(scipy_h1[0], scipy_h1[1]))
t-score: -2.773
p-value: 0.0062
Confidence interval for h0: (-0.1159, 0.2371)
Confidence interval for h1: (0.2335, 0.6591)

As expected, the result are the same with our step-by-step implementation and the SciPy test: with a p-value <0.05, we can confidently conclude that the two means are different.

Binomial distribution

The binomial distribution is discrete, and will compute a probability.

Conditions for a binomial distribution:

  1. Only two possible exclusive outcomes
  2. Events must be independent from each other
  3. Events must have identical distribution (i.e. \(p\) same for all events)

Formulas

Statistic Notation Formula
Probability of occurence \(p\) -
Opposite probability \(q\) \((1-p)\)
Sample size \(n\) -
Successes \(k\) -
Sample proportion \(\hat p\) \(k/n\)
Standard deviation \(\sigma\) \(\sqrt{p (1- p)}\)
Variance \(\sigma^2\) \(\sigma^2\)
Standard error \(SE\) \(\sqrt{\frac{\hat p (1- \hat p)}{n}}\)

Hypothesis testing

Test Distribution Standard Error Test statistic Confidence Interval
Sample proportion vs population binomial \(\sqrt{\frac{\hat p (1- \hat p)}{n}}\) \(\frac{\hat p - p_0}{\sqrt{\frac{p_0 (1- p_0)}{n}}}\) \(\hat p \pm z . SE\)
Difference in proportions binomial \(\sqrt{\frac{\hat p_1 (1-\hat p_1)}{n_1}+\frac{\hat p_2 (1-\hat p_2)}{n_2}}\) \(\frac{\hat p_2 - \hat p_1 - 0}{\sqrt{\hat p (1- \hat p)(\frac{1}{n_1}+\frac{1}{n_2})}}\) \((\hat p_2 - \hat p_1) \pm z . SE\)

Implementation in Python

Let’s now put the formulas in practice, and test for the difference between two binomial distributions. Again, we’ll check our results with the SciPy relevant function.

# Create two normally distributed samples
np.random.seed(1)
n0 = 1000; n1 = 500
k0 = np.random.binomial(n=n0, p=0.20)
k1 = np.random.binomial(n=n1, p=0.27)

# Compute p_hat
p0 = k0/n0
p1 = k1/n1
p = (k0+k1)/(n0+n1)
print("p0 = {}/{} = {:.3f}".format(k0, n0, p0))
print("p1 = {}/{} = {:.3f}".format(k1, n1, p1))
print("p = {}/{} = {:.3f}".format(k0+k1, n0+n1, p))

# z-test
z = (p1 - p0)/np.sqrt(p*(1-p)*(1/n0+1/n1))
print("z-score: {:.3f}".format(z))
print("p-value: {:.4f}".format(st.norm.sf(abs(z))*2))   # Multiply by 2 for two-tailed test

# Confidence intervals of each sample
m0 = 1.96 * np.sqrt(p0*(1-p0)/n0)
ci0_low = p0 - m0; ci0_hi = p0 + m0
m1 = 1.96 * np.sqrt(p1*(1-p1)/n1)
ci1_low = p1 - m1; ci1_hi = p1 + m1
print("Confidence interval for h0: ({:.4f}, {:.4f})".format(ci0_low, ci0_hi))
print("Confidence interval for h1: ({:.4f}, {:.4f})".format(ci1_low, ci1_hi))

# Plot probability mass function
fig, ax = plt.subplots(1, 1, figsize=(7,6))
x0 = np.arange(st.binom.ppf(0.001, n0, p0), st.binom.ppf(0.999, n0, p0))
sns.lineplot(x=x0/n0, y=st.binom.pmf(x0, n0, p0), color='steelblue', ax=ax)
x1 = np.arange(st.binom.ppf(0.001, n1, p1), st.binom.ppf(0.999, n1, p1))
sns.lineplot(x=x1/n1, y=st.binom.pmf(x1, n1, p1), color='green', ax=ax)

# Add means and confidence intervals to plot
plt.axvspan(ci0_low, ci0_hi, alpha=0.5, color='steelblue')
ax.axvline(p0, color='gold')

plt.axvspan(ci1_low, ci1_hi, alpha=0.5, color='green')
ax.axvline(p1, color='gold');
p0 = 199/1000 = 0.199
p1 = 129/500 = 0.258
p = 328/1500 = 0.219
z-score: 2.606
p-value: 0.0092
Confidence interval for h0: (0.1743, 0.2237)
Confidence interval for h1: (0.2196, 0.2964)

png

# Check with scipy
prop_t_test = st.ttest_ind_from_stats(p0, np.sqrt(p0*(1-p0)), n0, p1, np.sqrt(p1*(1-p1)), n1)
print("t-score:{:.4f}\np-value: {:.4f}".format(prop_t_test[0], prop_t_test[1]))

scipy_bin_h0 = st.binomtest(k0, n0, p0).proportion_ci()
scipy_bin_h1 = st.binomtest(k1, n1, p1).proportion_ci()
print("Confidence interval for h0: ({:.4f}, {:.4f})".format(scipy_bin_h0[0], scipy_bin_h0[1]))
print("Confidence interval for h1: ({:.4f}, {:.4f})".format(scipy_bin_h1[0], scipy_bin_h1[1]))
t-score:-2.6120
p-value: 0.0091
Confidence interval for h0: (0.1747, 0.2251)
Confidence interval for h1: (0.2202, 0.2987)

Calculating the required sample size

General formula for calculating a minimum sample size \(n\) given:

  • \(z\)-score
  • standard deviation \(\sigma\)
  • effect size \(e\) in absolute value
\[n = 2 \left ( \frac{z\sigma}{e} \right )^2\]

For continuous metrics

Applied to continuous metrics, the formula is:

\[n = 2 \left ( \frac{(z_\alpha + z_\beta)\sigma}{e} \right )^2\]

where

  • \(\sigma\) is the standard deviation
  • \(z\) are the z-scores associated with \(\alpha\) and \(1-\beta\)
  • \(e\) is the minimum detectable effect in absolute value
# Function to get minimum sample size (two-sided test)
def sample_size_cont(
    mde_abs, 
    variance,
    power=.8,
    alpha=.05
):
    t_alpha = st.norm.ppf(1-alpha/2)
    t_beta = st.norm.ppf(power)
    result = 2*((t_alpha + t_beta) * np.sqrt(variance) / mde_abs)**2
    print("Sample size: {:.0f}".format(result))

Let’s apply our function for a MDE of 0.05, and a (pooled) variance of 5:

mde_abs = .05   # Minimum detectable effect in absolute value
variance = 5    # Pooled variance
sample_size_cont(mde_abs, variance)
Sample size: 31396

This can also be implemented with statsmodels TTestIndPower().solve_power() function:

# Statsmodel equivalent function
effect_size = mde_abs / np.sqrt(variance)
smresult = smpr.TTestIndPower().solve_power(
    effect_size=effect_size, 
    power=.8, 
    alpha=.05, 
    ratio=1, 
    alternative='two-sided'
)
print("Sample size: {:.0f}".format(smresult))
Sample size: 31396

For proportions

Applied to proportions, the minimum sample size is calculated as:

\[n = 2 \left ( \frac {(z_\alpha + z_\beta) p(1-p)}{e} \right ) ^2\]

where

  • \(p\) is the baseline proportion
  • \(z\) are the z-scores associated with \(\alpha\) and \(1-\beta\)
  • \(e\) is the minimum detectable effect in absolute value
# Function to get minimum sample size
def sample_size_ratio(p, mde_abs, alpha=.05, power=.8):
    t_alpha = st.norm.ppf(1-alpha/2)
    t_beta = st.norm.ppf(power)
    result = 2 * (((t_alpha + t_beta)**2 * p*(1-p))/mde_abs**2)
    print("Sample size: {:.0f}".format(result))

Let’s try this with a proportion \(p\) of 0.20 and an absolute MDE of 0.01:

p = .20    # Baseline rate
mde = .01  # Absolute minimum detectable effect
sample_size_ratio(p=p, mde_abs=mde)
Sample size: 25116

Again, it can be implemented with statsmodels:

# Statsmodels equivalent function
effect_size = smp.proportion_effectsize(p, p+mde)
smresult = smpr.NormalIndPower().solve_power(
    effect_size=effect_size, 
    power=.8, 
    alpha=.05, 
    ratio=1, 
    alternative='two-sided'
)
print("Sample size: {:.0f}".format(smresult))
Sample size: 25580

Sources