How to Perform a Two-Proportion Z-Test in Python
The two-proportion z-test answers a simple question: are these two proportions meaningfully different, or is the difference just noise? You'll reach for this test constantly in product analytics and...
Key Insights
- The two-proportion z-test compares conversion rates, success rates, or any binary outcomes between two groups—making it the statistical backbone of A/B testing
- Use
statsmodels.stats.proportion.proportions_ztest()for production code, but understand the manual calculation to debug edge cases and validate results - Always check your assumptions: you need at least 10 successes and 10 failures in each group, and observations must be independent
Introduction & Use Cases
The two-proportion z-test answers a simple question: are these two proportions meaningfully different, or is the difference just noise? You’ll reach for this test constantly in product analytics and experimentation.
Common scenarios include:
- A/B testing: Did the new checkout flow increase conversion rate?
- Survey analysis: Do men and women respond differently to a question?
- Quality control: Does the defect rate differ between two manufacturing lines?
- Marketing: Which email subject line gets more clicks?
The test compares two independent groups, each with a binary outcome (converted/didn’t convert, clicked/didn’t click, yes/no). You’re testing whether the population proportions are equal.
Null hypothesis (H₀): p₁ = p₂ (no difference between proportions) Alternative hypothesis (H₁): p₁ ≠ p₂ (two-tailed) or p₁ > p₂ / p₁ < p₂ (one-tailed)
Statistical Foundation
The z-statistic measures how many standard errors the observed difference is from zero. Here’s the formula:
z = (p̂₁ - p̂₂) / √[p̂(1 - p̂)(1/n₁ + 1/n₂)]
Where:
- p̂₁ and p̂₂ are the sample proportions
- p̂ is the pooled proportion: (x₁ + x₂) / (n₁ + n₂)
- n₁ and n₂ are the sample sizes
- x₁ and x₂ are the number of successes in each group
The pooled proportion assumes the null hypothesis is true (both groups have the same underlying rate), giving us the best estimate of that shared rate.
Assumptions to check:
- Independence: Observations within and between groups are independent
- Sample size: Each group needs np ≥ 10 and n(1-p) ≥ 10 (at least 10 successes and 10 failures)
- Random sampling: Samples are randomly selected from their populations
Let’s implement this manually:
import numpy as np
from scipy import stats
def two_proportion_ztest_manual(x1, n1, x2, n2):
"""
Calculate z-statistic and p-value for two-proportion z-test.
Parameters:
-----------
x1, x2 : int
Number of successes in each group
n1, n2 : int
Sample sizes for each group
Returns:
--------
z_stat : float
The z-statistic
p_value : float
Two-tailed p-value
"""
# Sample proportions
p1 = x1 / n1
p2 = x2 / n2
# Pooled proportion under null hypothesis
p_pooled = (x1 + x2) / (n1 + n2)
# Standard error
se = np.sqrt(p_pooled * (1 - p_pooled) * (1/n1 + 1/n2))
# Z-statistic
z_stat = (p1 - p2) / se
# Two-tailed p-value
p_value = 2 * (1 - stats.norm.cdf(abs(z_stat)))
return z_stat, p_value
# Example: A/B test results
# Control: 120 conversions out of 1000
# Treatment: 145 conversions out of 1000
z, p = two_proportion_ztest_manual(145, 1000, 120, 1000)
print(f"Z-statistic: {z:.4f}")
print(f"P-value: {p:.4f}")
Output:
Z-statistic: 1.6835
P-value: 0.0923
Using statsmodels for the Z-Test
For production code, use statsmodels. It handles edge cases and provides additional functionality:
from statsmodels.stats.proportion import proportions_ztest
# Data setup
# successes: array of success counts [treatment, control]
# nobs: array of total observations [treatment, control]
successes = np.array([145, 120])
nobs = np.array([1000, 1000])
# Perform the test
z_stat, p_value = proportions_ztest(
count=successes,
nobs=nobs,
alternative='two-sided' # or 'larger', 'smaller'
)
print(f"Z-statistic: {z_stat:.4f}")
print(f"P-value: {p_value:.4f}")
The alternative parameter controls the hypothesis:
'two-sided': H₁: p₁ ≠ p₂'larger': H₁: p₁ > p₂'smaller': H₁: p₁ < p₂
For A/B tests where you specifically want to know if treatment beats control, use 'larger' with treatment as the first proportion.
# One-tailed test: Is treatment conversion rate higher?
z_stat, p_value = proportions_ztest(
count=successes,
nobs=nobs,
alternative='larger'
)
print(f"One-tailed p-value: {p_value:.4f}")
Output:
One-tailed p-value: 0.0461
Notice the one-tailed p-value is exactly half the two-tailed value. This matters for your conclusion.
Interpreting Results
The p-value represents the probability of observing a difference this extreme (or more extreme) if the null hypothesis were true. It does not tell you the probability that the null hypothesis is true.
Decision framework:
- If p-value < α (typically 0.05), reject the null hypothesis
- If p-value ≥ α, fail to reject the null hypothesis
Here’s a function that produces actionable interpretations:
def interpret_proportion_test(x1, n1, x2, n2, alpha=0.05, alternative='two-sided'):
"""
Perform and interpret a two-proportion z-test.
Returns a dictionary with test results and plain-English interpretation.
"""
successes = np.array([x1, x2])
nobs = np.array([n1, n2])
z_stat, p_value = proportions_ztest(
count=successes,
nobs=nobs,
alternative=alternative
)
p1 = x1 / n1
p2 = x2 / n2
diff = p1 - p2
relative_lift = (p1 - p2) / p2 * 100 if p2 > 0 else float('inf')
# Check assumptions
assumptions_met = all([
x1 >= 10, (n1 - x1) >= 10,
x2 >= 10, (n2 - x2) >= 10
])
significant = p_value < alpha
if significant:
if diff > 0:
conclusion = f"Group 1 ({p1:.2%}) significantly outperforms Group 2 ({p2:.2%})"
else:
conclusion = f"Group 2 ({p2:.2%}) significantly outperforms Group 1 ({p1:.2%})"
else:
conclusion = f"No significant difference detected between {p1:.2%} and {p2:.2%}"
return {
'z_statistic': z_stat,
'p_value': p_value,
'significant': significant,
'group1_rate': p1,
'group2_rate': p2,
'absolute_difference': diff,
'relative_lift_pct': relative_lift,
'assumptions_met': assumptions_met,
'conclusion': conclusion
}
# Usage
result = interpret_proportion_test(145, 1000, 120, 1000)
for key, value in result.items():
print(f"{key}: {value}")
Common pitfalls to avoid:
- Multiple comparisons: Testing many variants inflates false positive rates. Use Bonferroni correction or control the false discovery rate.
- Peeking at results: Don’t stop the test early when you see significance. Commit to a sample size upfront.
- Ignoring practical significance: A 0.1% improvement might be statistically significant with large samples but worthless in practice.
Confidence Intervals for the Difference
P-values tell you whether there’s a difference. Confidence intervals tell you how big it might be:
from statsmodels.stats.proportion import confint_proportions_2indep
def proportion_difference_ci(x1, n1, x2, n2, alpha=0.05):
"""
Calculate confidence interval for the difference between two proportions.
Uses the Wald method with continuity correction.
"""
p1 = x1 / n1
p2 = x2 / n2
diff = p1 - p2
# Standard error for the difference (unpooled)
se = np.sqrt((p1 * (1 - p1) / n1) + (p2 * (1 - p2) / n2))
# Critical value
z_crit = stats.norm.ppf(1 - alpha / 2)
# Confidence interval
ci_lower = diff - z_crit * se
ci_upper = diff + z_crit * se
return diff, (ci_lower, ci_upper)
# Calculate 95% CI
diff, ci = proportion_difference_ci(145, 1000, 120, 1000)
print(f"Difference: {diff:.4f} ({diff*100:.2f} percentage points)")
print(f"95% CI: [{ci[0]:.4f}, {ci[1]:.4f}]")
print(f"95% CI: [{ci[0]*100:.2f}pp, {ci[1]*100:.2f}pp]")
Output:
Difference: 0.0250 (2.50 percentage points)
95% CI: [-0.0041, 0.0541]
95% CI: [-0.41pp, 5.41pp]
The confidence interval spans zero, which aligns with our non-significant two-tailed result. If zero isn’t in the interval, the difference is significant at that confidence level.
Practical Example: E-commerce Conversion Test
Let’s work through a complete A/B test analysis. An e-commerce company tested a redesigned product page:
import pandas as pd
import matplotlib.pyplot as plt
# Simulated A/B test data
np.random.seed(42)
control_visitors = 5000
treatment_visitors = 5000
control_conversion_rate = 0.032
treatment_conversion_rate = 0.038
# Generate data
data = pd.DataFrame({
'variant': ['control'] * control_visitors + ['treatment'] * treatment_visitors,
'converted': (
list(np.random.binomial(1, control_conversion_rate, control_visitors)) +
list(np.random.binomial(1, treatment_conversion_rate, treatment_visitors))
)
})
# Aggregate results
summary = data.groupby('variant').agg(
visitors=('converted', 'count'),
conversions=('converted', 'sum'),
rate=('converted', 'mean')
).round(4)
print("Test Summary:")
print(summary)
print()
# Extract values for the test
control = summary.loc['control']
treatment = summary.loc['treatment']
# Perform the z-test
successes = np.array([int(treatment['conversions']), int(control['conversions'])])
nobs = np.array([int(treatment['visitors']), int(control['visitors'])])
z_stat, p_value = proportions_ztest(successes, nobs, alternative='larger')
# Calculate confidence interval
diff, ci = proportion_difference_ci(
int(treatment['conversions']), int(treatment['visitors']),
int(control['conversions']), int(control['visitors'])
)
# Results
print(f"Control conversion rate: {control['rate']:.2%}")
print(f"Treatment conversion rate: {treatment['rate']:.2%}")
print(f"Absolute lift: {diff*100:.2f} percentage points")
print(f"Relative lift: {(treatment['rate'] - control['rate']) / control['rate'] * 100:.1f}%")
print(f"\nZ-statistic: {z_stat:.4f}")
print(f"P-value (one-tailed): {p_value:.4f}")
print(f"95% CI for difference: [{ci[0]*100:.2f}pp, {ci[1]*100:.2f}pp]")
# Visualization
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
# Bar chart of conversion rates
colors = ['#2ecc71' if p_value < 0.05 else '#3498db', '#95a5a6']
axes[0].bar(['Treatment', 'Control'],
[treatment['rate'] * 100, control['rate'] * 100],
color=colors, edgecolor='black', linewidth=1.2)
axes[0].set_ylabel('Conversion Rate (%)')
axes[0].set_title('Conversion Rates by Variant')
axes[0].set_ylim(0, max(treatment['rate'], control['rate']) * 100 * 1.3)
for i, rate in enumerate([treatment['rate'], control['rate']]):
axes[0].text(i, rate * 100 + 0.1, f'{rate:.2%}', ha='center', fontweight='bold')
# Confidence interval visualization
axes[1].errorbar(
x=[diff * 100], y=[0.5],
xerr=[[diff * 100 - ci[0] * 100], [ci[1] * 100 - diff * 100]],
fmt='o', markersize=10, capsize=5, capthick=2, color='#2ecc71'
)
axes[1].axvline(x=0, color='red', linestyle='--', label='No difference')
axes[1].set_xlabel('Difference in Conversion Rate (percentage points)')
axes[1].set_title('95% Confidence Interval for Treatment Effect')
axes[1].set_ylim(0, 1)
axes[1].set_yticks([])
axes[1].legend()
plt.tight_layout()
plt.savefig('ab_test_results.png', dpi=150, bbox_inches='tight')
plt.show()
# Final recommendation
alpha = 0.05
if p_value < alpha:
print(f"\n✅ RECOMMENDATION: Deploy treatment. The improvement is statistically significant (p={p_value:.4f}).")
else:
print(f"\n⚠️ RECOMMENDATION: Continue testing or keep control. Results not significant (p={p_value:.4f}).")
This workflow gives you everything needed to make a data-driven decision: the raw numbers, statistical test, confidence interval, and visualization. The one-tailed test is appropriate here because we specifically want to know if the treatment is better—we wouldn’t ship it if it were significantly worse.
The two-proportion z-test is foundational for anyone working with conversion data. Master it, understand its assumptions, and you’ll have a reliable tool for making product decisions backed by evidence.