Frequentist hypothesis testing cheat sheet

Definitions

Hypothesis testing (aka A/B testing) is used to assess if an experiment \(H_1\) generates significantly different results from the baseline \(H_0\) parameters, i.e. if the results may have been observed by chance, or are representative of the larger population. In this notebook, we will focus on classical frequentist hypothesis testing, but there are other techniques such as the bayesian method.

Most common distributions:

  • Standard normal distribution (aka Z-distribution): for continuous variables, used if sample size is large and population \(\sigma\) is known
  • Student distribution (aka T-distribution): similar to Z-distribution but with thicker tails; used for smaller sample sizes or unknown \(\sigma\)
  • Binomial distribution: discrete distribution that generates probabilities

Hypotheses:

  • \(H_0\) aka null hypothesis
  • \(H_1\), alternative hypothesis
  • Test statistics: usually the standardized difference between means or proportions

Significance is measured by:

  • \(\alpha = P(H_1\|H_0)\) is the the significance value. Probability to reject \(H_0\) by error, where in fact it was true (type I error). It is often set at 5%
  • p-value: probability that you would get the same results by chance, when \(H_0\) is true. If the \(p\)-value is lower than \(\alpha\), we accept \(H_1\)
  • \(\beta = P(H_0\|H_1)\) is the probability of failing to reject \(H_0\) when there was actually a change (type II error). It is often set at 20%.
  • \(1 - \beta\) is the sensitivity aka recall. Probability to correctly detect a change when there was indeed one. It is often set at 80%.
  • Effect size, aka practical difference is the minimum effect that you want to observe (e.g. +2% in click-through rate). Cohen’s \(d\) is a standardized measure of the effect size.
  • Minimim sample size is the minimum number of observations needed to be able to conclude to a significant difference, given an effect size. The smaller the effect size needed, the greater the minimum sample size will be.
# Import libraries
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import scipy.stats as st
sns.set()
# Summary graph
fig, ax = plt.subplots(2, 1, figsize=(12,12))

## Plot hypotheses distributions
for i in range(len(ax)):
    mu1 = 0
    x1 = np.linspace(mu1-5, mu1+5, 100)
    y1 = st.norm.pdf(x1, loc=mu1, scale=1)
    ax[i].plot(x1, y1, color='steelblue')
    ax[i].text(-1.1, .375, '$H_0$', color='steelblue')
    ax[i].fill_between(x1[:31], 0, y1[:31], color='steelblue', alpha=.3)
    ax[i].fill_between(x1[-31:], 0, y1[-31:], color='steelblue', alpha=.3)
    ax[i].set_xlim(mu1-4, 7)
    ax[i].set_ylim(0, )
    ax[i].margins(0,)

mu2 = 3.5
x2 = np.linspace(mu2-5, mu2+5, 100)
y2 = st.norm.pdf(x2, loc=mu2, scale=1.2)
ax[0].plot(x2, y2, color='firebrick')
ax[0].text(4.5, .31, '$H_1$', color='firebrick')
ax[0].set_xticks([-1.96, 0, 1.96, 3.5])

## Plot alphas and betas
ax[0].text(2.1, .01, '$\\alpha / 2$')
ax[0].text(-2.2, .01, '$\\alpha / 2$')
ax[0].text(1.2, .02, 'β')
ax[0].text(3.5, .15, '1-β')
ax[0].text(3, .12, 'Statistical power')
ax[0].fill_between(x2[:35], 0, y2[:35], color='firebrick', alpha=.3)
ax[0].vlines(1.96, 0, .35, color='black', linestyles='dashed', alpha=.5)
ax[0].text(1.5, .36, 'Decision\nthreshold')

## Plot test statistic
ax[1].set_xticks([-1.96, 0, 1.5, 1.96])
ax[1].set_xticklabels(['$-z_{α/2}$', '$\mu$', 'z-score', '\n$z_{α/2}$'])
ax[1].vlines([-1.96, 1.96], 0, .06, color='steelblue', alpha=.5)
ax[1].vlines(0, 0, .4, color='steelblue', linestyles='dashed', alpha=.5)
ax[1].vlines(1.5, 0, .13, color='orange')
ax[1].scatter(1.5, 0, color='orange');

png

Standard Normal and Student distributions

Formulas

Statistic Notation Formula
Population size \(N\) -
Sample size \(n\) -
Population mean \(\mu\) \(\frac{\sum{x}}{N}\)
Sample mean \(\bar x\) \(\frac{\sum{x}}{n}\)
Standard deviation \(\sigma\) \(\sqrt{\frac{\sum{(x - \mu)^2}}{N}}\)
Sample std deviation \(s\) \(\sqrt{\frac{\sum{(x - \bar x)^2}}{n-1}}\)
Variance \(\sigma^2\) \(\sigma^2\)
Sample variance \(s^2\) \(s^2\)
Standard error (of the Mean) \(SE\) \(s/\sqrt n\)

Hypothesis testing

Test Distribution Standard Error Test statistic Confidence Interval
Sample mean vs population (n≥30) z-distribution \(\frac{\sigma}{\sqrt n}\) \(\frac{\bar x - \mu}{\sigma / \sqrt n}\) \(\bar x \pm z \frac{\sigma}{\sqrt n}\)
Sample mean vs population (n<30) t-distribution \(\frac{s}{\sqrt n}\) \(\frac{\bar x - \mu}{s / \sqrt n}\) \(\bar x \pm t_{n-1} \frac{s}{\sqrt n}\)
Difference in populations means z if n≥30 and \(\sigma\) known \(\frac{\sigma}{\sqrt n}\) \(\frac{(\bar{x}_1 - \bar{x}_2) - 0}{\sqrt{\frac{\sigma^2_1}{n_1}+\frac{\sigma^2_2}{n_2}}}\) \(\bar{x} \pm z \frac{\sigma}{\sqrt n}\)
Difference in samples means t if n<30 or \(\sigma\) unknown \(\frac{s}{\sqrt n}\) \(\frac{(\bar{x}_1 - \bar{x}_2) - 0}{\sqrt{\frac{s^2_1}{n_1}+\frac{s^2_2}{n_2}}}\) \(\bar{x} \pm t_{n-1} \frac{s}{\sqrt n}\)

Implementation in Python

Let’s use the formulas above to test for the statistical difference between two samples means. We’ll first compute the formulas manually, then check our results with SciPy functions out-of-the-box.

# Create two normally distributed samples
np.random.seed(1)
h0 = np.random.normal(loc=0., size=100)
h1 = np.random.normal(loc=0.35, size=80)

# Sample sizes
n0 = len(h0)
n1 = len(h1)

# Means
x0 = np.mean(h0)
x1 = np.mean(h1)

# Standard errors (not of the means!) with n-1 degrees of freedom
s0 = np.std(h0, ddof=1)
s1 = np.std(h1, ddof=1)

# t-test
t = (x0 - x1) / np.sqrt(s0**2/n0 + s1**2/n1)
print("t-score: {:.3f}".format(t))
print("p-value: {:.4f}".format(st.t.sf(abs(t), df=n0+n1-2)*2))   # Multiply by 2 for two-tailed test

# 95% confidence intervals for the means
t0 = st.t.ppf(1-0.025, n0-1)
m0 = t0 * s0/np.sqrt(n0)
ci0_low = x0 - m0; ci0_hi = x0 + m0
print("Confidence interval for h0: ({:.4f}, {:.4f})".format(ci0_low, ci0_hi))

t1 = st.t.ppf(1-0.025, n1-1)
m1 = t1 * s1/np.sqrt(n1)
ci1_low = x1 - m1; ci1_hi = x1 + m1
print("Confidence interval for h1: ({:.4f}, {:.4f})".format(ci1_low, ci1_hi))

# Plot distributions and CIs
fig, ax = plt.subplots(1, 2, figsize=(14,6))

sns.histplot(h0, color='steelblue', ax=ax[0])
sns.histplot(h1, color='green', ax=ax[0])

sns.histplot(h0, color='steelblue', kde=True, alpha=0, ax=ax[1])
sns.histplot(h1, color='green', kde=True, alpha=0, ax=ax[1])
ax[1].axvline(x0, color='gold')
ax[1].axvline(x1, color='gold')
plt.axvspan(ci0_low, ci0_hi, alpha=0.5, color='steelblue')
plt.axvspan(ci1_low, ci1_hi, alpha=0.5, color='green');
t-score: -2.773
p-value: 0.0061
Confidence interval for h0: (-0.1159, 0.2371)
Confidence interval for h1: (0.2335, 0.6591)

png

# Check results with scipy
t_test = st.ttest_ind(h0, h1, equal_var=False)
print("t-score: {:.3f}\np-value: {:.4f}".format(t_test[0], t_test[1]))

scipy_h0 = st.t.interval(alpha=0.95, df=n0-1, loc=x0, scale=st.sem(h0))
scipy_h1 = st.t.interval(alpha=0.95, df=n1-1, loc=x1, scale=st.sem(h1))
print("Confidence interval for h0: ({:.4f}, {:.4f})".format(scipy_h0[0], scipy_h0[1]))
print("Confidence interval for h1: ({:.4f}, {:.4f})".format(scipy_h1[0], scipy_h1[1]))
t-score: -2.773
p-value: 0.0062
Confidence interval for h0: (-0.1159, 0.2371)
Confidence interval for h1: (0.2335, 0.6591)

As expected, the result are the same with our step-by-step implementation and the SciPy test: with a p-value <0.05, we can confidently conclude that the two means are different.

Binomial distribution

The binomial distribution is discrete, and will compute a probability.

Conditions for a binomial distribution:

  1. Only two possible exclusive outcomes
  2. Events must be independent from each other
  3. Events must have identical distribution (i.e. \(p\) same for all events)

Formulas

Statistic Notation Formula
Probability of occurence \(p\) -
Opposite probability \(q\) \((1-p)\)
Sample size \(n\) -
Successes \(k\) -
Sample proportion \(\hat p\) \(k/n\)
Standard deviation \(\sigma\) \(\sqrt{p (1- p)}\)
Variance \(\sigma^2\) \(\sigma^2\)
Standard error \(SE\) \(\sqrt{\frac{\hat p (1- \hat p)}{n}}\)

Hypothesis testing

Test Distribution Standard Error Test statistic Confidence Interval
Sample proportion vs population binomial \(\sqrt{\frac{\hat p (1- \hat p)}{n}}\) \(\frac{\hat p - p_0}{\sqrt{\frac{p_0 (1- p_0)}{n}}}\) \(\hat p \pm z . SE\)
Difference in proportions binomial \(\sqrt{\frac{\hat p_1 (1-\hat p_1)}{n_1}+\frac{\hat p_2 (1-\hat p_2)}{n_2}}\) \(\frac{\hat p_2 - \hat p_1 - 0}{\sqrt{\hat p (1- \hat p)(\frac{1}{n_1}+\frac{1}{n_2})}}\) \((\hat p_2 - \hat p_1) \pm z . SE\)

Implementation in Python

Let’s now put the formulas in practice, and test for the difference between two binomial distributions. Again, we’ll check our results with the SciPy relevant function.

# Create two normally distributed samples
np.random.seed(1)
n0 = 1000; n1 = 500
k0 = np.random.binomial(n=n0, p=0.20)
k1 = np.random.binomial(n=n1, p=0.27)

# Compute p_hat
p0 = k0/n0
p1 = k1/n1
p = (k0+k1)/(n0+n1)
print("p0 = {}/{} = {:.3f}".format(k0, n0, p0))
print("p1 = {}/{} = {:.3f}".format(k1, n1, p1))
print("p = {}/{} = {:.3f}".format(k0+k1, n0+n1, p))

# z-test
z = (p1 - p0)/np.sqrt(p*(1-p)*(1/n0+1/n1))
print("z-score: {:.3f}".format(z))
print("p-value: {:.4f}".format(st.norm.sf(abs(z))*2))   # Multiply by 2 for two-tailed test

# Confidence intervals of each sample
m0 = 1.96 * np.sqrt(p0*(1-p0)/n0)
ci0_low = p0 - m0; ci0_hi = p0 + m0
m1 = 1.96 * np.sqrt(p1*(1-p1)/n1)
ci1_low = p1 - m1; ci1_hi = p1 + m1
print("Confidence interval for h0: ({:.4f}, {:.4f})".format(ci0_low, ci0_hi))
print("Confidence interval for h1: ({:.4f}, {:.4f})".format(ci1_low, ci1_hi))

# Plot probability mass function
fig, ax = plt.subplots(1, 1, figsize=(7,6))
x0 = np.arange(st.binom.ppf(0.001, n0, p0), st.binom.ppf(0.999, n0, p0))
sns.lineplot(x=x0/n0, y=st.binom.pmf(x0, n0, p0), color='steelblue', ax=ax)
x1 = np.arange(st.binom.ppf(0.001, n1, p1), st.binom.ppf(0.999, n1, p1))
sns.lineplot(x=x1/n1, y=st.binom.pmf(x1, n1, p1), color='green', ax=ax)

# Add means and confidence intervals to plot
plt.axvspan(ci0_low, ci0_hi, alpha=0.5, color='steelblue')
ax.axvline(p0, color='gold')

plt.axvspan(ci1_low, ci1_hi, alpha=0.5, color='green')
ax.axvline(p1, color='gold');
p0 = 199/1000 = 0.199
p1 = 129/500 = 0.258
p = 328/1500 = 0.219
z-score: 2.606
p-value: 0.0092
Confidence interval for h0: (0.1743, 0.2237)
Confidence interval for h1: (0.2196, 0.2964)

png

# Check with scipy
prop_t_test = st.ttest_ind_from_stats(p0, np.sqrt(p0*(1-p0)), n0, p1, np.sqrt(p1*(1-p1)), n1)
print("t-score:{:.4f}\np-value: {:.4f}".format(prop_t_test[0], prop_t_test[1]))

scipy_bin_h0 = st.binomtest(k0, n0, p0).proportion_ci()
scipy_bin_h1 = st.binomtest(k1, n1, p1).proportion_ci()
print("Confidence interval for h0: ({:.4f}, {:.4f})".format(scipy_bin_h0[0], scipy_bin_h0[1]))
print("Confidence interval for h1: ({:.4f}, {:.4f})".format(scipy_bin_h1[0], scipy_bin_h1[1]))
t-score:-2.6120
p-value: 0.0091
Confidence interval for h0: (0.1747, 0.2251)
Confidence interval for h1: (0.2202, 0.2987)

Sample size and Minimum Detectable Effect

The general formula for calculating a minimum sample size given a margin of error \(e\) is:

\[n = \left ( \frac{z\sigma}{e} \right )^2\]

Applied to proportions, given a minimum detectable effect, the minimum sample size is calculated as:

\[n = \frac{2 p(1-p) (z_{1-\alpha/2} + z_{1-\beta})^2}{D^2}\]

where

  • \(p\) is the baseline proportion
  • \(z\) are the z-scores associated with \(\alpha\) and \(\beta\)
  • \(D\) is the minimum detectable effect in absolute value

Implementation in Python

# Function to get minimum sample size
def sample_size_mde(p, mde_pct, alpha=.05, power=.8):
    delta = p * mde_pct
    t_alpha2 = st.norm.ppf(1-alpha/2)
    t_beta = st.norm.ppf(power)
    return round((2*p*(1-p) * (t_alpha2 + t_beta)**2)/delta**2)

# We have a baseline rate at 20% and want to detect a relative 10% change:
sample_size_mde(p=.20, mde_pct=.10)
6279

Sources