Code
import numpy as np
import pandas as pd
from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')
np.random.seed(42)
print("Libraries loaded.")When a marketing team reports that customers who received the new email spent 12% more than the control group, the natural question is: is that real, or is it noise?
Hypothesis testing is the formal framework for answering that question. Without it, we risk redesigning products based on random fluctuations in small samples, or killing successful interventions because we stopped the test too early. It is also how we know whether two groups are genuinely different, whether a new process actually improves quality, or whether an observed pattern in the data is worth acting on.
In this chapter we cover the hypothesis testing framework, the most common tests, the pitfall of running too many tests at once, the distinction between statistical and practical significance, how to plan the sample size before running an experiment, and a complete A/B test from start to finish.
Every hypothesis test follows the same five-step structure. It is worth knowing this cold, because every test we look at in this chapter is just a specialization of these steps.
We start by stating the null hypothesis (\(H_0\)) — the assumption that nothing interesting is happening, the status quo. For example: the new email has no effect on spending, or the two groups have equal means.
We then state the alternative hypothesis (\(H_1\)) — what we are trying to show. For example: the new email increases spending (one-tailed), or the email changes spending in either direction (two-tailed).
Before running any analysis, we choose a significance level (\(\alpha\)) — the probability we are willing to accept of falsely concluding there is an effect when there is none. The conventional choice is \(\alpha = 0.05\), though \(\alpha = 0.01\) is used in higher-stakes settings. This must be chosen before seeing the data.
We then run the test and compute a p-value — the probability of observing results at least as extreme as ours, assuming \(H_0\) is true. A small p-value means the data is unlikely under the null hypothesis.
Finally, we make a decision: if \(p \leq \alpha\), we reject \(H_0\). If \(p > \alpha\), we fail to reject \(H_0\) — which is not the same as proving it true.
Common misconception: A p-value of 0.03 does not mean there is a 3% chance the null hypothesis is correct. It means that if the null were true, we would see data this extreme only 3% of the time. The distinction matters a great deal in practice.
Two types of errors are possible in any hypothesis test.
| \(H_0\) is True | \(H_0\) is False | |
|---|---|---|
| Reject \(H_0\) | Type I Error (False Positive) | Correct ✓ |
| Fail to reject \(H_0\) | Correct ✓ | Type II Error (False Negative) |
A Type I error means we conclude there is an effect when there is none. Its probability is exactly \(\alpha\) — the significance level we chose. A Type II error means we miss a real effect. Its probability is \(\beta\), and \(1 - \beta\) is called the statistical power of the test.
Lowering \(\alpha\) reduces Type I errors but increases Type II errors for the same sample size. The only way to reduce both simultaneously is a larger sample.
The right tradeoff depends on the context. In drug trials, a false positive — approving an ineffective drug — is catastrophic, so \(\alpha = 0.01\) or lower is appropriate. In a marketing A/B test, a false negative (missing a profitable email variant) may cost more than a false positive, so \(\alpha = 0.05\) or even 0.10 may be acceptable. The important thing is to decide this before seeing the data.
import numpy as np
import pandas as pd
from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')
np.random.seed(42)
print("Libraries loaded.")The one-sample t-test asks whether the mean of a sample is significantly different from a known reference value.
Consider a factory that claims its bolts have a mean diameter of 10 mm. We measure 30 bolts and want to know whether the production process is off-target. The test statistic is:
\[t = \frac{\bar{x} - \mu_0}{s / \sqrt{n}}\]
where \(\bar{x}\) is the sample mean, \(\mu_0\) the hypothesized mean, \(s\) the sample standard deviation, and \(n\) the sample size. This test is appropriate when the data is roughly normally distributed, or when \(n > 30\) (by the Central Limit Theorem).
# One-sample t-test: are bolt diameters on target?
# H0: mean = 10.0 mm; H1: mean != 10.0 mm (two-tailed)
bolt_diameters = np.random.normal(loc=10.15, scale=0.4, size=30)
t_stat, p_value = stats.ttest_1samp(bolt_diameters, popmean=10.0)
print(f'Sample mean: {bolt_diameters.mean():.3f} mm')
print(f't-statistic: {t_stat:.3f}')
print(f'p-value: {p_value:.4f}')
print()
alpha = 0.05
if p_value < alpha:
print(f'Reject H0 (p={p_value:.4f} < {alpha}). Mean is significantly different from 10 mm.')
else:
print(f'Fail to reject H0 (p={p_value:.4f}). Insufficient evidence of a difference.')
ci = stats.t.interval(0.95, df=len(bolt_diameters)-1,
loc=bolt_diameters.mean(), scale=stats.sem(bolt_diameters))
print(f'95% CI: ({ci[0]:.3f}, {ci[1]:.3f}) mm')The two-sample t-test asks whether two independent groups have different means — the most common test in A/B testing.
Consider Group A (old checkout flow) and Group B (redesigned flow). Did Group B spend more? The default in scipy is Welch’s t-test, which does not assume equal variances — this is the safer choice. Use equal_var=True only if you have confirmed equal variances with Levene’s test. When the same subjects appear in both conditions (e.g., before and after treatment), use stats.ttest_rel instead.
# Two-sample t-test: A/B test on purchase amounts
group_a = np.random.normal(loc=85, scale=25, size=200) # control
group_b = np.random.normal(loc=92, scale=28, size=200) # treatment
t_stat, p_value = stats.ttest_ind(group_a, group_b, equal_var=False) # Welch's
print(f'Group A mean: ${group_a.mean():.2f} (n={len(group_a)})')
print(f'Group B mean: ${group_b.mean():.2f} (n={len(group_b)})')
print(f'Difference: ${group_b.mean() - group_a.mean():.2f}')
print(f't-statistic: {t_stat:.3f}')
print(f'p-value: {p_value:.4f}')
print()
if p_value < 0.05:
print(f'Reject H0: Group B mean is significantly different from Group A (p={p_value:.4f}).')
else:
print(f'Fail to reject H0: No significant difference detected (p={p_value:.4f}).')
fig, ax = plt.subplots(figsize=(8, 4))
ax.hist(group_a, bins=30, alpha=0.6, label=f'Group A (mean=${group_a.mean():.0f})', color='steelblue')
ax.hist(group_b, bins=30, alpha=0.6, label=f'Group B (mean=${group_b.mean():.0f})', color='darkorange')
ax.axvline(group_a.mean(), color='steelblue', linestyle='--', lw=2)
ax.axvline(group_b.mean(), color='darkorange', linestyle='--', lw=2)
ax.set_xlabel('Purchase Amount ($)')
ax.set_ylabel('Count')
ax.set_title(f'A/B Test: Purchase Amounts (p={p_value:.4f})')
ax.legend()
plt.tight_layout()
plt.show()When the outcome is a category — clicked or did not click, converted or did not convert — the t-test does not apply. We use the chi-square test of independence instead.
The test compares the observed frequency in each cell of a contingency table to the expected frequency if the two variables were independent:
\[\chi^2 = \sum \frac{(O - E)^2}{E}\]
The assumption is that expected counts in each cell are at least 5. If the cells are smaller than that, Fisher’s exact test is more appropriate.
# Chi-square test: email variant vs. click-through
observed = np.array([
[120, 380], # Variant A: 120 clicked, 380 did not
[155, 345], # Variant B: 155 clicked, 345 did not
])
chi2, p_value, dof, expected = stats.chi2_contingency(observed)
df_obs = pd.DataFrame(observed, index=['Variant A', 'Variant B'],
columns=['Clicked', 'Not Clicked'])
print(df_obs)
print(f'Click rate A: {120/500:.1%}, Click rate B: {155/500:.1%}')
print()
print(f'Chi-square: {chi2:.3f}, df: {dof}, p-value: {p_value:.4f}')
print()
if p_value < 0.05:
print('Reject H0: Variant significantly affects click-through rate.')
else:
print('Fail to reject H0: No significant difference in click rates.')The t-test assumes approximate normality. When that assumption is violated — small samples, ordinal data, or heavily skewed distributions — non-parametric alternatives make no distributional assumptions and are more robust.
The most common substitute for the two-sample t-test is the Mann-Whitney U test, which compares the rank orderings of two samples rather than their means. It is particularly appropriate for ordinal data such as satisfaction ratings.
| Parametric | Non-Parametric Alternative | Use When |
|---|---|---|
| One-sample t-test | Wilcoxon signed-rank test | Small sample, non-normal |
| Two-sample t-test | Mann-Whitney U test | Ordinal data, skewed distributions |
| Paired t-test | Wilcoxon signed-rank (paired) | Paired non-normal data |
| One-way ANOVA | Kruskal-Wallis test | 3+ groups, non-normal |
The trade-off is that non-parametric tests are slightly less powerful than their parametric counterparts when the normality assumption does hold. For small samples (\(n < 50\)), it is worth testing normality first with scipy.stats.shapiro.
# Mann-Whitney U test: satisfaction scores (ordinal 1-10) — t-test not appropriate
scores_old = np.array([5,6,7,5,8,6,4,7,6,5,7,8,6,5,7,6,5,6,7,8,5,6,4,7,6,5,6,7,6,5])
scores_new = np.array([7,8,7,9,8,7,8,9,7,8,9,7,8,8,9,7,8,7,9,8,8,7,9,8,7,8,9,8,7,8])
_, p_norm_old = stats.shapiro(scores_old)
_, p_norm_new = stats.shapiro(scores_new)
print(f'Shapiro-Wilk p (old group): {p_norm_old:.4f}')
print(f'Shapiro-Wilk p (new group): {p_norm_new:.4f}')
print('(p < 0.05 suggests non-normality)')
print()
u_stat, p_value = stats.mannwhitneyu(scores_old, scores_new, alternative='two-sided')
print(f'Median (old): {np.median(scores_old):.1f}')
print(f'Median (new): {np.median(scores_new):.1f}')
print(f'Mann-Whitney U: {u_stat:.1f}, p-value: {p_value:.6f}')
print()
if p_value < 0.05:
print('The new design produces significantly higher satisfaction scores.')Running 20 independent tests at \(\alpha = 0.05\) means we would expect approximately one false positive by chance alone, even if none of the treatments have any real effect. The probability of at least one false positive across \(k\) independent tests is:
\[P(\text{at least one false positive}) = 1 - (1 - \alpha)^k\]
For 20 tests: \(1 - 0.95^{20} \approx 64\%\). This is why running many tests and cherry-picking the one with \(p < 0.05\) is misleading.
Bonferroni correction divides \(\alpha\) by the number of tests: \(\alpha_{\text{adj}} = \alpha / k\). It is conservative — appropriate when any false positive is costly.
Benjamini-Hochberg (FDR) controls the expected proportion of false discoveries among all rejected null hypotheses. It is less conservative and preferred when testing many hypotheses — feature importance rankings, dashboard metrics, genomics.
from statsmodels.stats.multitest import multipletests
# Simulate 20 p-values: 18 nulls + 2 genuine effects
null_pvals = np.random.uniform(0, 1, 18)
effect_pvals = np.array([0.003, 0.012])
all_pvals = np.concatenate([null_pvals, effect_pvals])
print(f'p-values < 0.05 before any correction: {(all_pvals < 0.05).sum()}')
print()
reject_bonf, pvals_bonf, _, _ = multipletests(all_pvals, alpha=0.05, method='bonferroni')
reject_bh, pvals_bh, _, _ = multipletests(all_pvals, alpha=0.05, method='fdr_bh')
print(f'After Bonferroni: {reject_bonf.sum()} significant')
print(f'After Benjamini-Hochberg: {reject_bh.sum()} significant')
print()
print('Last 4 results (the 2 genuine effects are at position 18-19):')
pd.DataFrame({
'raw_p': all_pvals[-4:].round(4),
'bonf_p': pvals_bonf[-4:].round(4),
'bh_p': pvals_bh[-4:].round(4),
'reject_bonf': reject_bonf[-4:],
'reject_bh': reject_bh[-4:]
})A result can be statistically significant without being practically meaningful. With a large enough sample, even a 0.01% difference in conversion rate will produce \(p < 0.05\). Effect size measures the magnitude of an effect, independent of sample size, and is the quantity that actually informs business decisions.
Cohen’s d is the standard effect size for comparing two means:
\[d = \frac{\bar{x}_1 - \bar{x}_2}{s_{\text{pooled}}}\]
A d of 0.2 is considered small, 0.5 medium, and 0.8 large. For proportions, Cramér’s V serves the same role for chi-square tests.
It is good practice to always report both the p-value and the effect size. A p < 0.001 result with Cohen’s d = 0.02 is statistically very strong but may not justify the cost of a product change.
def cohens_d(x1, x2):
n1, n2 = len(x1), len(x2)
s_pooled = np.sqrt(((n1-1)*x1.std()**2 + (n2-1)*x2.std()**2) / (n1+n2-2))
return (x1.mean() - x2.mean()) / s_pooled
d = cohens_d(group_b, group_a)
label = ('negligible' if abs(d)<0.2 else 'small' if abs(d)<0.5
else 'medium' if abs(d)<0.8 else 'large')
print(f"Cohen's d = {d:.3f} ({label} effect)")
print(f'Mean difference: ${group_b.mean()-group_a.mean():.2f}')
print(f'As % of Group A: {(group_b.mean()-group_a.mean())/group_a.mean():.1%}')
print()
print('Statistical significance tells us the effect is real.')
print('Effect size tells us whether the effect is worth acting on.')Statistical power is the probability of correctly detecting an effect when one exists (\(1 - \beta\)). A conventional target is 80% power.
Four quantities determine each other: sample size, effect size, significance level, and power. We usually solve for sample size given the other three. The effect size we plug in should be the minimum detectable effect — the smallest lift that would actually be worth acting on, not the lift we hope to see.
This step is done before starting the experiment, and the sample size calculation commits us to how long to run it. Stopping early because \(p < 0.05\) inflates the Type I error rate considerably — if we must monitor, sequential testing methods such as mSPRT handle this correctly.
from statsmodels.stats.power import TTestIndPower
analysis = TTestIndPower()
print('Required n per group to detect effect at 80% power, alpha=0.05:')
for d_val in [0.2, 0.5, 0.8]:
n = analysis.solve_power(effect_size=d_val, alpha=0.05, power=0.8, alternative='two-sided')
print(f' d={d_val}: n={n:.0f}')
# Power curve
ns = np.arange(10, 600, 10)
pw_s = [analysis.solve_power(effect_size=0.2, alpha=0.05, nobs1=n, alternative='two-sided') for n in ns]
pw_m = [analysis.solve_power(effect_size=0.5, alpha=0.05, nobs1=n, alternative='two-sided') for n in ns]
pw_l = [analysis.solve_power(effect_size=0.8, alpha=0.05, nobs1=n, alternative='two-sided') for n in ns]
fig, ax = plt.subplots(figsize=(9, 5))
ax.plot(ns, pw_s, label='Small (d=0.2)', color='tomato')
ax.plot(ns, pw_m, label='Medium (d=0.5)', color='goldenrod')
ax.plot(ns, pw_l, label='Large (d=0.8)', color='steelblue')
ax.axhline(0.80, linestyle='--', color='black', lw=1, label='80% target')
ax.set_xlabel('Sample Size per Group')
ax.set_ylabel('Statistical Power')
ax.set_title('Power Curves — Two-Sample t-Test ($\\alpha=0.05$)')
ax.legend()
ax.set_ylim(0, 1)
plt.tight_layout()
plt.show()Let us put all of the above together. An e-commerce site is testing a new product page (Variant B) against the existing one (Variant A). The metric is conversion rate.
We start by defining the minimum detectable effect — the smallest lift worth rolling out the change for. We use this to calculate the required sample size, commit to that number, run the experiment until we reach it, then analyze.
from statsmodels.stats.proportion import proportions_ztest, proportion_effectsize
from statsmodels.stats.power import NormalIndPower
baseline_rate = 0.05 # current 5% conversion
mde = 0.01 # minimum detectable effect: 1 pp lift
es = proportion_effectsize(baseline_rate + mde, baseline_rate)
n = int(np.ceil(NormalIndPower().solve_power(effect_size=es, alpha=0.05, power=0.80)))
print(f'Required sample per group: {n}')
print()
# Simulate experiment
conv_a = np.random.binomial(1, 0.050, n)
conv_b = np.random.binomial(1, 0.062, n)
rate_a, rate_b = conv_a.mean(), conv_b.mean()
z_stat, p_value = proportions_ztest([conv_b.sum(), conv_a.sum()], [n, n])
print(f'Variant A: {conv_a.sum()}/{n} = {rate_a:.2%}')
print(f'Variant B: {conv_b.sum()}/{n} = {rate_b:.2%}')
print(f'Absolute lift: {rate_b-rate_a:.2%} Relative: {(rate_b-rate_a)/rate_a:.1%}')
print(f'z={z_stat:.3f}, p={p_value:.4f}')
print()
monthly_visitors = 50_000
lift_revenue = (rate_b - rate_a) * monthly_visitors * 75 # avg order $75
if p_value < 0.05:
print(f'Statistically significant. Estimated additional revenue/month: ${lift_revenue:,.0f}')
print('Recommendation: roll out Variant B.')
else:
print(f'Not significant (p={p_value:.4f}). Do not roll out.')Peeking and stopping early. Continuously monitoring a test and stopping when \(p < 0.05\) dramatically inflates the false positive rate. Commit to the sample size before starting. If real-time monitoring is required, use sequential testing methods.
P-hacking. Testing many hypotheses and reporting only the significant ones is a form of cherry-picking. Apply multiple testing corrections, and pre-register hypotheses before collecting data when possible.
Ignoring effect size. A p < 0.001 result with Cohen’s d = 0.02 is statistically very strong but practically worthless. Always report effect sizes alongside p-values, and always report the absolute lift as well as the relative lift.
Using the wrong test. Ordinal ratings with a t-test, paired data with an independent t-test, small cell counts with chi-square — each of these introduces bias. Check the assumptions before running the test, not after.
Treating non-significance as proof of no effect. \(p = 0.4\) does not mean no effect exists. It means we did not find sufficient evidence. The study may simply be underpowered. Report the confidence interval. A wide interval that includes zero is much more informative than a bare p-value.