How to Calculate a Confidence Interval for a Proportion in Python

Key Insights

The Wilson score interval outperforms the traditional Wald interval for most real-world applications, especially with small samples or proportions near 0 or 1
Python’s statsmodels.stats.proportion.proportion_confint() provides multiple methods in a single function call, making it the go-to tool for production code
Always visualize confidence intervals when comparing proportions—overlapping CIs don’t necessarily mean no significant difference, but non-overlapping CIs guarantee one

Introduction to Confidence Intervals for Proportions

Proportions are everywhere in software engineering and data analysis. Your A/B test shows a 3.2% conversion rate. Your survey indicates 68% of users prefer the new design. Your error rate sits at 0.5% over the last hour. These are all proportions—counts of successes divided by total observations.

But a single number tells an incomplete story. That 3.2% conversion rate came from 1,000 visitors. How confident are you that the true conversion rate isn’t actually 4% or 2.5%? This is where confidence intervals become essential.

A 95% confidence interval gives you a range where, if you repeated your experiment many times, 95% of the calculated intervals would contain the true population proportion. It’s not a probability statement about the parameter itself—it’s a statement about the reliability of your estimation procedure.

Let’s explore how to calculate these intervals in Python, from first principles to production-ready code.

The Math Behind Proportion Confidence Intervals

The Wald Interval (Normal Approximation)

The most common formula you’ll encounter is the Wald interval:

$$\hat{p} \pm z \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}$$

Where $\hat{p}$ is your sample proportion, $n$ is your sample size, and $z$ is the critical value (1.96 for 95% confidence).

import numpy as np
from scipy import stats

def wald_interval(successes: int, n: int, confidence: float = 0.95) -> tuple[float, float]:
    """Calculate Wald (normal approximation) confidence interval."""
    p_hat = successes / n
    z = stats.norm.ppf((1 + confidence) / 2)
    margin = z * np.sqrt(p_hat * (1 - p_hat) / n)
    return (p_hat - margin, p_hat + margin)

# Example: 32 conversions out of 1000 visitors
lower, upper = wald_interval(32, 1000)
print(f"Wald 95% CI: ({lower:.4f}, {upper:.4f})")
# Output: Wald 95% CI: (0.0211, 0.0429)

The Wilson Score Interval

The Wald interval has a dirty secret: it performs poorly when proportions are near 0 or 1, or when sample sizes are small. The Wilson score interval addresses these issues:

def wilson_interval(successes: int, n: int, confidence: float = 0.95) -> tuple[float, float]:
    """Calculate Wilson score confidence interval."""
    p_hat = successes / n
    z = stats.norm.ppf((1 + confidence) / 2)
    z2 = z ** 2
    
    denominator = 1 + z2 / n
    center = (p_hat + z2 / (2 * n)) / denominator
    margin = (z / denominator) * np.sqrt(p_hat * (1 - p_hat) / n + z2 / (4 * n ** 2))
    
    return (center - margin, center + margin)

# Same example
lower, upper = wilson_interval(32, 1000)
print(f"Wilson 95% CI: ({lower:.4f}, {upper:.4f})")
# Output: Wilson 95% CI: (0.0227, 0.0448)

When to Use Each Method

Use the Wald interval when you need quick mental math or when $np > 10$ and $n(1-p) > 10$. Use the Wilson interval for everything else—it’s more accurate and never produces intervals outside [0, 1].

Using statsmodels for Proportion CIs

In practice, you shouldn’t implement these formulas yourself. The statsmodels library provides a battle-tested implementation with multiple methods.

from statsmodels.stats.proportion import proportion_confint

successes = 32
n = 1000
confidence = 0.95

# Compare different methods
methods = ['normal', 'wilson', 'agresti_coull', 'jeffreys', 'binom_test']

print(f"Sample proportion: {successes/n:.4f}")
print(f"Sample size: {n}")
print("-" * 45)

for method in methods:
    lower, upper = proportion_confint(successes, n, alpha=1-confidence, method=method)
    width = upper - lower
    print(f"{method:15} CI: ({lower:.4f}, {upper:.4f}) width: {width:.4f}")

Output:

Sample proportion: 0.0320
Sample size: 1000
---------------------------------------------
normal          CI: (0.0211, 0.0429) width: 0.0218
wilson          CI: (0.0227, 0.0448) width: 0.0221
agresti_coull   CI: (0.0226, 0.0449) width: 0.0223
jeffreys        CI: (0.0223, 0.0440) width: 0.0217
binom_test      CI: (0.0221, 0.0451) width: 0.0230

The agresti_coull method (also called the “adjusted Wald”) adds 2 successes and 2 failures before calculating—a simple trick that dramatically improves coverage. The jeffreys method uses a Bayesian approach with a non-informative prior.

For most applications, Wilson or Agresti-Coull are your best choices.

Using scipy.stats for Quick Calculations

If you’re already using SciPy and want to avoid adding statsmodels as a dependency, you can use the beta distribution for an exact Bayesian credible interval:

from scipy.stats import beta

def bayesian_credible_interval(
    successes: int, 
    n: int, 
    confidence: float = 0.95,
    prior_alpha: float = 0.5,
    prior_beta: float = 0.5
) -> tuple[float, float]:
    """
    Calculate Bayesian credible interval using beta distribution.
    Default prior (0.5, 0.5) is Jeffreys prior.
    Use (1, 1) for uniform prior.
    """
    alpha = (1 - confidence) / 2
    posterior_alpha = prior_alpha + successes
    posterior_beta = prior_beta + (n - successes)
    
    lower = beta.ppf(alpha, posterior_alpha, posterior_beta)
    upper = beta.ppf(1 - alpha, posterior_alpha, posterior_beta)
    
    return (lower, upper)

# Jeffreys prior (non-informative)
lower, upper = bayesian_credible_interval(32, 1000)
print(f"Bayesian 95% CI (Jeffreys): ({lower:.4f}, {upper:.4f})")

# Uniform prior
lower, upper = bayesian_credible_interval(32, 1000, prior_alpha=1, prior_beta=1)
print(f"Bayesian 95% CI (Uniform): ({lower:.4f}, {upper:.4f})")

The beta distribution is the conjugate prior for binomial data, making this calculation both elegant and exact.

Handling Edge Cases and Sample Size Considerations

Edge cases reveal the differences between methods. Let’s examine what happens with small samples and extreme proportions:

from statsmodels.stats.proportion import proportion_confint
import pandas as pd

# Edge case: small sample, low proportion
test_cases = [
    (1, 20, "1/20 successes"),
    (0, 50, "0/50 successes"),
    (50, 50, "50/50 successes"),
    (5, 100, "5% rate, n=100"),
    (5, 1000, "0.5% rate, n=1000"),
]

methods = ['normal', 'wilson', 'agresti_coull']

for successes, n, description in test_cases:
    print(f"\n{description} (p̂ = {successes/n:.3f})")
    print("-" * 50)
    for method in methods:
        lower, upper = proportion_confint(successes, n, alpha=0.05, method=method)
        # Flag problematic intervals
        flag = ""
        if lower < 0:
            flag = " ⚠️ NEGATIVE"
        if upper > 1:
            flag = " ⚠️ >1"
        print(f"{method:15}: ({lower:7.4f}, {upper:.4f}){flag}")

Output:

1/20 successes (p̂ = 0.050)
--------------------------------------------------
normal         : (-0.0455, 0.1455) ⚠️ NEGATIVE
wilson         : ( 0.0089, 0.2381)
agresti_coull  : ( 0.0062, 0.2454)

0/50 successes (p̂ = 0.000)
--------------------------------------------------
normal         : ( 0.0000, 0.0000)
wilson         : ( 0.0000, 0.0709)
agresti_coull  : (-0.0175, 0.0896) ⚠️ NEGATIVE

50/50 successes (p̂ = 1.000)
--------------------------------------------------
normal         : ( 1.0000, 1.0000)
wilson         : ( 0.9291, 1.0000)
agresti_coull  : ( 0.9104, 1.0175) ⚠️ >1

Notice how the normal (Wald) method produces impossible negative bounds and zero-width intervals at the extremes. Wilson handles these cases gracefully.

Minimum sample size rule of thumb: For the normal approximation to be reliable, you need both $np \geq 10$ and $n(1-p) \geq 10$. For a 5% proportion, that means $n \geq 200$.

Practical Example: A/B Test Analysis

Let’s put everything together with a realistic A/B test scenario:

import numpy as np
import matplotlib.pyplot as plt
from statsmodels.stats.proportion import proportion_confint, proportions_ztest

# A/B test results
control = {"visitors": 5000, "conversions": 150}
treatment = {"visitors": 5000, "conversions": 185}

def analyze_variant(data: dict, name: str, method: str = 'wilson') -> dict:
    """Analyze a single variant and return statistics."""
    rate = data["conversions"] / data["visitors"]
    ci_lower, ci_upper = proportion_confint(
        data["conversions"], 
        data["visitors"], 
        alpha=0.05, 
        method=method
    )
    return {
        "name": name,
        "rate": rate,
        "ci_lower": ci_lower,
        "ci_upper": ci_upper,
        "n": data["visitors"],
        "conversions": data["conversions"]
    }

# Analyze both variants
control_stats = analyze_variant(control, "Control")
treatment_stats = analyze_variant(treatment, "Treatment")

# Print results
print("A/B Test Results")
print("=" * 60)
for stats in [control_stats, treatment_stats]:
    print(f"\n{stats['name']}:")
    print(f"  Conversion rate: {stats['rate']:.2%}")
    print(f"  95% CI: ({stats['ci_lower']:.2%}, {stats['ci_upper']:.2%})")
    print(f"  Sample size: {stats['n']:,}")

# Statistical significance test
z_stat, p_value = proportions_ztest(
    [treatment["conversions"], control["conversions"]],
    [treatment["visitors"], control["visitors"]],
    alternative='two-sided'
)

relative_lift = (treatment_stats["rate"] - control_stats["rate"]) / control_stats["rate"]
print(f"\nRelative lift: {relative_lift:+.1%}")
print(f"Z-statistic: {z_stat:.3f}")
print(f"P-value: {p_value:.4f}")
print(f"Significant at α=0.05: {'Yes' if p_value < 0.05 else 'No'}")

# Visualization
fig, ax = plt.subplots(figsize=(8, 5))

variants = [control_stats, treatment_stats]
positions = range(len(variants))
colors = ['#3498db', '#2ecc71']

for i, (stats, color) in enumerate(zip(variants, colors)):
    ax.barh(i, stats["rate"], color=color, alpha=0.7, height=0.5)
    ax.errorbar(
        stats["rate"], i,
        xerr=[[stats["rate"] - stats["ci_lower"]], 
              [stats["ci_upper"] - stats["rate"]]],
        fmt='none', color='black', capsize=5, capthick=2
    )
    ax.text(stats["ci_upper"] + 0.002, i, 
            f'{stats["rate"]:.2%}', va='center', fontweight='bold')

ax.set_yticks(positions)
ax.set_yticklabels([s["name"] for s in variants])
ax.set_xlabel("Conversion Rate")
ax.set_title("A/B Test Results with 95% Confidence Intervals")
ax.set_xlim(0, max(s["ci_upper"] for s in variants) * 1.15)
plt.tight_layout()
plt.savefig('ab_test_results.png', dpi=150)
plt.show()

Conclusion and Method Selection Guide

Choosing the right confidence interval method matters. Here’s a quick reference:

Situation	Recommended Method	Why
General purpose	Wilson	Best coverage, handles edge cases
Quick calculation	Agresti-Coull	Simple adjustment, nearly as good as Wilson
Large samples, p near 0.5	Normal (Wald)	Simplest, adequate in ideal conditions
Bayesian analysis	Jeffreys	Proper posterior interval
Regulatory/exact requirements	Clopper-Pearson (`binom_test`)	Conservative, guaranteed coverage

For production code, stick with statsmodels.stats.proportion.proportion_confint() using the Wilson method. It handles edge cases correctly, never produces impossible intervals, and is well-documented.

Further reading:

statsmodels proportion documentation
Brown, Cai, and DasGupta (2001): “Interval Estimation for a Binomial Proportion” — the definitive comparison of methods