How to Perform a Two-Proportion Z-Test in Python

Key Insights

The two-proportion z-test compares conversion rates, success rates, or any binary outcomes between two groups—making it the statistical backbone of A/B testing
Use statsmodels.stats.proportion.proportions_ztest() for production code, but understand the manual calculation to debug edge cases and validate results
Always check your assumptions: you need at least 10 successes and 10 failures in each group, and observations must be independent

Introduction & Use Cases

The two-proportion z-test answers a simple question: are these two proportions meaningfully different, or is the difference just noise? You’ll reach for this test constantly in product analytics and experimentation.

Common scenarios include:

A/B testing: Did the new checkout flow increase conversion rate?
Survey analysis: Do men and women respond differently to a question?
Quality control: Does the defect rate differ between two manufacturing lines?
Marketing: Which email subject line gets more clicks?

The test compares two independent groups, each with a binary outcome (converted/didn’t convert, clicked/didn’t click, yes/no). You’re testing whether the population proportions are equal.

Null hypothesis (H₀): p₁ = p₂ (no difference between proportions) Alternative hypothesis (H₁): p₁ ≠ p₂ (two-tailed) or p₁ > p₂ / p₁ < p₂ (one-tailed)

Statistical Foundation

The z-statistic measures how many standard errors the observed difference is from zero. Here’s the formula:

z = (p̂₁ - p̂₂) / √[p̂(1 - p̂)(1/n₁ + 1/n₂)]

Where:

p̂₁ and p̂₂ are the sample proportions
p̂ is the pooled proportion: (x₁ + x₂) / (n₁ + n₂)
n₁ and n₂ are the sample sizes
x₁ and x₂ are the number of successes in each group

The pooled proportion assumes the null hypothesis is true (both groups have the same underlying rate), giving us the best estimate of that shared rate.

Assumptions to check:

Independence: Observations within and between groups are independent
Sample size: Each group needs np ≥ 10 and n(1-p) ≥ 10 (at least 10 successes and 10 failures)
Random sampling: Samples are randomly selected from their populations

Let’s implement this manually:

import numpy as np
from scipy import stats

def two_proportion_ztest_manual(x1, n1, x2, n2):
    """
    Calculate z-statistic and p-value for two-proportion z-test.
    
    Parameters:
    -----------
    x1, x2 : int
        Number of successes in each group
    n1, n2 : int
        Sample sizes for each group
    
    Returns:
    --------
    z_stat : float
        The z-statistic
    p_value : float
        Two-tailed p-value
    """
    # Sample proportions
    p1 = x1 / n1
    p2 = x2 / n2
    
    # Pooled proportion under null hypothesis
    p_pooled = (x1 + x2) / (n1 + n2)
    
    # Standard error
    se = np.sqrt(p_pooled * (1 - p_pooled) * (1/n1 + 1/n2))
    
    # Z-statistic
    z_stat = (p1 - p2) / se
    
    # Two-tailed p-value
    p_value = 2 * (1 - stats.norm.cdf(abs(z_stat)))
    
    return z_stat, p_value

# Example: A/B test results
# Control: 120 conversions out of 1000
# Treatment: 145 conversions out of 1000
z, p = two_proportion_ztest_manual(145, 1000, 120, 1000)
print(f"Z-statistic: {z:.4f}")
print(f"P-value: {p:.4f}")

Output:

Z-statistic: 1.6835
P-value: 0.0923

Using statsmodels for the Z-Test

For production code, use statsmodels. It handles edge cases and provides additional functionality:

from statsmodels.stats.proportion import proportions_ztest

# Data setup
# successes: array of success counts [treatment, control]
# nobs: array of total observations [treatment, control]
successes = np.array([145, 120])
nobs = np.array([1000, 1000])

# Perform the test
z_stat, p_value = proportions_ztest(
    count=successes,
    nobs=nobs,
    alternative='two-sided'  # or 'larger', 'smaller'
)

print(f"Z-statistic: {z_stat:.4f}")
print(f"P-value: {p_value:.4f}")

The alternative parameter controls the hypothesis:

'two-sided': H₁: p₁ ≠ p₂
'larger': H₁: p₁ > p₂
'smaller': H₁: p₁ < p₂

For A/B tests where you specifically want to know if treatment beats control, use 'larger' with treatment as the first proportion.

# One-tailed test: Is treatment conversion rate higher?
z_stat, p_value = proportions_ztest(
    count=successes,
    nobs=nobs,
    alternative='larger'
)
print(f"One-tailed p-value: {p_value:.4f}")

Output:

One-tailed p-value: 0.0461

Notice the one-tailed p-value is exactly half the two-tailed value. This matters for your conclusion.

Interpreting Results

The p-value represents the probability of observing a difference this extreme (or more extreme) if the null hypothesis were true. It does not tell you the probability that the null hypothesis is true.

Decision framework:

If p-value < α (typically 0.05), reject the null hypothesis
If p-value ≥ α, fail to reject the null hypothesis

Here’s a function that produces actionable interpretations:

def interpret_proportion_test(x1, n1, x2, n2, alpha=0.05, alternative='two-sided'):
    """
    Perform and interpret a two-proportion z-test.
    
    Returns a dictionary with test results and plain-English interpretation.
    """
    successes = np.array([x1, x2])
    nobs = np.array([n1, n2])
    
    z_stat, p_value = proportions_ztest(
        count=successes,
        nobs=nobs,
        alternative=alternative
    )
    
    p1 = x1 / n1
    p2 = x2 / n2
    diff = p1 - p2
    relative_lift = (p1 - p2) / p2 * 100 if p2 > 0 else float('inf')
    
    # Check assumptions
    assumptions_met = all([
        x1 >= 10, (n1 - x1) >= 10,
        x2 >= 10, (n2 - x2) >= 10
    ])
    
    significant = p_value < alpha
    
    if significant:
        if diff > 0:
            conclusion = f"Group 1 ({p1:.2%}) significantly outperforms Group 2 ({p2:.2%})"
        else:
            conclusion = f"Group 2 ({p2:.2%}) significantly outperforms Group 1 ({p1:.2%})"
    else:
        conclusion = f"No significant difference detected between {p1:.2%} and {p2:.2%}"
    
    return {
        'z_statistic': z_stat,
        'p_value': p_value,
        'significant': significant,
        'group1_rate': p1,
        'group2_rate': p2,
        'absolute_difference': diff,
        'relative_lift_pct': relative_lift,
        'assumptions_met': assumptions_met,
        'conclusion': conclusion
    }

# Usage
result = interpret_proportion_test(145, 1000, 120, 1000)
for key, value in result.items():
    print(f"{key}: {value}")

Common pitfalls to avoid:

Multiple comparisons: Testing many variants inflates false positive rates. Use Bonferroni correction or control the false discovery rate.
Peeking at results: Don’t stop the test early when you see significance. Commit to a sample size upfront.
Ignoring practical significance: A 0.1% improvement might be statistically significant with large samples but worthless in practice.

Confidence Intervals for the Difference

P-values tell you whether there’s a difference. Confidence intervals tell you how big it might be:

from statsmodels.stats.proportion import confint_proportions_2indep

def proportion_difference_ci(x1, n1, x2, n2, alpha=0.05):
    """
    Calculate confidence interval for the difference between two proportions.
    
    Uses the Wald method with continuity correction.
    """
    p1 = x1 / n1
    p2 = x2 / n2
    diff = p1 - p2
    
    # Standard error for the difference (unpooled)
    se = np.sqrt((p1 * (1 - p1) / n1) + (p2 * (1 - p2) / n2))
    
    # Critical value
    z_crit = stats.norm.ppf(1 - alpha / 2)
    
    # Confidence interval
    ci_lower = diff - z_crit * se
    ci_upper = diff + z_crit * se
    
    return diff, (ci_lower, ci_upper)

# Calculate 95% CI
diff, ci = proportion_difference_ci(145, 1000, 120, 1000)
print(f"Difference: {diff:.4f} ({diff*100:.2f} percentage points)")
print(f"95% CI: [{ci[0]:.4f}, {ci[1]:.4f}]")
print(f"95% CI: [{ci[0]*100:.2f}pp, {ci[1]*100:.2f}pp]")

Output:

Difference: 0.0250 (2.50 percentage points)
95% CI: [-0.0041, 0.0541]
95% CI: [-0.41pp, 5.41pp]

The confidence interval spans zero, which aligns with our non-significant two-tailed result. If zero isn’t in the interval, the difference is significant at that confidence level.

Practical Example: E-commerce Conversion Test

Let’s work through a complete A/B test analysis. An e-commerce company tested a redesigned product page:

import pandas as pd
import matplotlib.pyplot as plt

# Simulated A/B test data
np.random.seed(42)

control_visitors = 5000
treatment_visitors = 5000
control_conversion_rate = 0.032
treatment_conversion_rate = 0.038

# Generate data
data = pd.DataFrame({
    'variant': ['control'] * control_visitors + ['treatment'] * treatment_visitors,
    'converted': (
        list(np.random.binomial(1, control_conversion_rate, control_visitors)) +
        list(np.random.binomial(1, treatment_conversion_rate, treatment_visitors))
    )
})

# Aggregate results
summary = data.groupby('variant').agg(
    visitors=('converted', 'count'),
    conversions=('converted', 'sum'),
    rate=('converted', 'mean')
).round(4)

print("Test Summary:")
print(summary)
print()

# Extract values for the test
control = summary.loc['control']
treatment = summary.loc['treatment']

# Perform the z-test
successes = np.array([int(treatment['conversions']), int(control['conversions'])])
nobs = np.array([int(treatment['visitors']), int(control['visitors'])])

z_stat, p_value = proportions_ztest(successes, nobs, alternative='larger')

# Calculate confidence interval
diff, ci = proportion_difference_ci(
    int(treatment['conversions']), int(treatment['visitors']),
    int(control['conversions']), int(control['visitors'])
)

# Results
print(f"Control conversion rate: {control['rate']:.2%}")
print(f"Treatment conversion rate: {treatment['rate']:.2%}")
print(f"Absolute lift: {diff*100:.2f} percentage points")
print(f"Relative lift: {(treatment['rate'] - control['rate']) / control['rate'] * 100:.1f}%")
print(f"\nZ-statistic: {z_stat:.4f}")
print(f"P-value (one-tailed): {p_value:.4f}")
print(f"95% CI for difference: [{ci[0]*100:.2f}pp, {ci[1]*100:.2f}pp]")

# Visualization
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Bar chart of conversion rates
colors = ['#2ecc71' if p_value < 0.05 else '#3498db', '#95a5a6']
axes[0].bar(['Treatment', 'Control'], 
            [treatment['rate'] * 100, control['rate'] * 100],
            color=colors, edgecolor='black', linewidth=1.2)
axes[0].set_ylabel('Conversion Rate (%)')
axes[0].set_title('Conversion Rates by Variant')
axes[0].set_ylim(0, max(treatment['rate'], control['rate']) * 100 * 1.3)

for i, rate in enumerate([treatment['rate'], control['rate']]):
    axes[0].text(i, rate * 100 + 0.1, f'{rate:.2%}', ha='center', fontweight='bold')

# Confidence interval visualization
axes[1].errorbar(
    x=[diff * 100], y=[0.5],
    xerr=[[diff * 100 - ci[0] * 100], [ci[1] * 100 - diff * 100]],
    fmt='o', markersize=10, capsize=5, capthick=2, color='#2ecc71'
)
axes[1].axvline(x=0, color='red', linestyle='--', label='No difference')
axes[1].set_xlabel('Difference in Conversion Rate (percentage points)')
axes[1].set_title('95% Confidence Interval for Treatment Effect')
axes[1].set_ylim(0, 1)
axes[1].set_yticks([])
axes[1].legend()

plt.tight_layout()
plt.savefig('ab_test_results.png', dpi=150, bbox_inches='tight')
plt.show()

# Final recommendation
alpha = 0.05
if p_value < alpha:
    print(f"\n✅ RECOMMENDATION: Deploy treatment. The improvement is statistically significant (p={p_value:.4f}).")
else:
    print(f"\n⚠️ RECOMMENDATION: Continue testing or keep control. Results not significant (p={p_value:.4f}).")

This workflow gives you everything needed to make a data-driven decision: the raw numbers, statistical test, confidence interval, and visualization. The one-tailed test is appropriate here because we specifically want to know if the treatment is better—we wouldn’t ship it if it were significantly worse.

The two-proportion z-test is foundational for anyone working with conversion data. Master it, understand its assumptions, and you’ll have a reliable tool for making product decisions backed by evidence.

Introduction & Use Cases

Statistical Foundation

Using statsmodels for the Z-Test

Interpreting Results

Confidence Intervals for the Difference

Practical Example: E-commerce Conversion Test

Liked this? There's more.

Similar Articles