How to Calculate a Confidence Interval for a Proportion in Python
Proportions are everywhere in software engineering and data analysis. Your A/B test shows a 3.2% conversion rate. Your survey indicates 68% of users prefer the new design. Your error rate sits at...
Key Insights
- The Wilson score interval outperforms the traditional Wald interval for most real-world applications, especially with small samples or proportions near 0 or 1
- Python’s
statsmodels.stats.proportion.proportion_confint()provides multiple methods in a single function call, making it the go-to tool for production code - Always visualize confidence intervals when comparing proportions—overlapping CIs don’t necessarily mean no significant difference, but non-overlapping CIs guarantee one
Introduction to Confidence Intervals for Proportions
Proportions are everywhere in software engineering and data analysis. Your A/B test shows a 3.2% conversion rate. Your survey indicates 68% of users prefer the new design. Your error rate sits at 0.5% over the last hour. These are all proportions—counts of successes divided by total observations.
But a single number tells an incomplete story. That 3.2% conversion rate came from 1,000 visitors. How confident are you that the true conversion rate isn’t actually 4% or 2.5%? This is where confidence intervals become essential.
A 95% confidence interval gives you a range where, if you repeated your experiment many times, 95% of the calculated intervals would contain the true population proportion. It’s not a probability statement about the parameter itself—it’s a statement about the reliability of your estimation procedure.
Let’s explore how to calculate these intervals in Python, from first principles to production-ready code.
The Math Behind Proportion Confidence Intervals
The Wald Interval (Normal Approximation)
The most common formula you’ll encounter is the Wald interval:
$$\hat{p} \pm z \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}$$
Where $\hat{p}$ is your sample proportion, $n$ is your sample size, and $z$ is the critical value (1.96 for 95% confidence).
import numpy as np
from scipy import stats
def wald_interval(successes: int, n: int, confidence: float = 0.95) -> tuple[float, float]:
"""Calculate Wald (normal approximation) confidence interval."""
p_hat = successes / n
z = stats.norm.ppf((1 + confidence) / 2)
margin = z * np.sqrt(p_hat * (1 - p_hat) / n)
return (p_hat - margin, p_hat + margin)
# Example: 32 conversions out of 1000 visitors
lower, upper = wald_interval(32, 1000)
print(f"Wald 95% CI: ({lower:.4f}, {upper:.4f})")
# Output: Wald 95% CI: (0.0211, 0.0429)
The Wilson Score Interval
The Wald interval has a dirty secret: it performs poorly when proportions are near 0 or 1, or when sample sizes are small. The Wilson score interval addresses these issues:
def wilson_interval(successes: int, n: int, confidence: float = 0.95) -> tuple[float, float]:
"""Calculate Wilson score confidence interval."""
p_hat = successes / n
z = stats.norm.ppf((1 + confidence) / 2)
z2 = z ** 2
denominator = 1 + z2 / n
center = (p_hat + z2 / (2 * n)) / denominator
margin = (z / denominator) * np.sqrt(p_hat * (1 - p_hat) / n + z2 / (4 * n ** 2))
return (center - margin, center + margin)
# Same example
lower, upper = wilson_interval(32, 1000)
print(f"Wilson 95% CI: ({lower:.4f}, {upper:.4f})")
# Output: Wilson 95% CI: (0.0227, 0.0448)
When to Use Each Method
Use the Wald interval when you need quick mental math or when $np > 10$ and $n(1-p) > 10$. Use the Wilson interval for everything else—it’s more accurate and never produces intervals outside [0, 1].
Using statsmodels for Proportion CIs
In practice, you shouldn’t implement these formulas yourself. The statsmodels library provides a battle-tested implementation with multiple methods.
from statsmodels.stats.proportion import proportion_confint
successes = 32
n = 1000
confidence = 0.95
# Compare different methods
methods = ['normal', 'wilson', 'agresti_coull', 'jeffreys', 'binom_test']
print(f"Sample proportion: {successes/n:.4f}")
print(f"Sample size: {n}")
print("-" * 45)
for method in methods:
lower, upper = proportion_confint(successes, n, alpha=1-confidence, method=method)
width = upper - lower
print(f"{method:15} CI: ({lower:.4f}, {upper:.4f}) width: {width:.4f}")
Output:
Sample proportion: 0.0320
Sample size: 1000
---------------------------------------------
normal CI: (0.0211, 0.0429) width: 0.0218
wilson CI: (0.0227, 0.0448) width: 0.0221
agresti_coull CI: (0.0226, 0.0449) width: 0.0223
jeffreys CI: (0.0223, 0.0440) width: 0.0217
binom_test CI: (0.0221, 0.0451) width: 0.0230
The agresti_coull method (also called the “adjusted Wald”) adds 2 successes and 2 failures before calculating—a simple trick that dramatically improves coverage. The jeffreys method uses a Bayesian approach with a non-informative prior.
For most applications, Wilson or Agresti-Coull are your best choices.
Using scipy.stats for Quick Calculations
If you’re already using SciPy and want to avoid adding statsmodels as a dependency, you can use the beta distribution for an exact Bayesian credible interval:
from scipy.stats import beta
def bayesian_credible_interval(
successes: int,
n: int,
confidence: float = 0.95,
prior_alpha: float = 0.5,
prior_beta: float = 0.5
) -> tuple[float, float]:
"""
Calculate Bayesian credible interval using beta distribution.
Default prior (0.5, 0.5) is Jeffreys prior.
Use (1, 1) for uniform prior.
"""
alpha = (1 - confidence) / 2
posterior_alpha = prior_alpha + successes
posterior_beta = prior_beta + (n - successes)
lower = beta.ppf(alpha, posterior_alpha, posterior_beta)
upper = beta.ppf(1 - alpha, posterior_alpha, posterior_beta)
return (lower, upper)
# Jeffreys prior (non-informative)
lower, upper = bayesian_credible_interval(32, 1000)
print(f"Bayesian 95% CI (Jeffreys): ({lower:.4f}, {upper:.4f})")
# Uniform prior
lower, upper = bayesian_credible_interval(32, 1000, prior_alpha=1, prior_beta=1)
print(f"Bayesian 95% CI (Uniform): ({lower:.4f}, {upper:.4f})")
The beta distribution is the conjugate prior for binomial data, making this calculation both elegant and exact.
Handling Edge Cases and Sample Size Considerations
Edge cases reveal the differences between methods. Let’s examine what happens with small samples and extreme proportions:
from statsmodels.stats.proportion import proportion_confint
import pandas as pd
# Edge case: small sample, low proportion
test_cases = [
(1, 20, "1/20 successes"),
(0, 50, "0/50 successes"),
(50, 50, "50/50 successes"),
(5, 100, "5% rate, n=100"),
(5, 1000, "0.5% rate, n=1000"),
]
methods = ['normal', 'wilson', 'agresti_coull']
for successes, n, description in test_cases:
print(f"\n{description} (p̂ = {successes/n:.3f})")
print("-" * 50)
for method in methods:
lower, upper = proportion_confint(successes, n, alpha=0.05, method=method)
# Flag problematic intervals
flag = ""
if lower < 0:
flag = " ⚠️ NEGATIVE"
if upper > 1:
flag = " ⚠️ >1"
print(f"{method:15}: ({lower:7.4f}, {upper:.4f}){flag}")
Output:
1/20 successes (p̂ = 0.050)
--------------------------------------------------
normal : (-0.0455, 0.1455) ⚠️ NEGATIVE
wilson : ( 0.0089, 0.2381)
agresti_coull : ( 0.0062, 0.2454)
0/50 successes (p̂ = 0.000)
--------------------------------------------------
normal : ( 0.0000, 0.0000)
wilson : ( 0.0000, 0.0709)
agresti_coull : (-0.0175, 0.0896) ⚠️ NEGATIVE
50/50 successes (p̂ = 1.000)
--------------------------------------------------
normal : ( 1.0000, 1.0000)
wilson : ( 0.9291, 1.0000)
agresti_coull : ( 0.9104, 1.0175) ⚠️ >1
Notice how the normal (Wald) method produces impossible negative bounds and zero-width intervals at the extremes. Wilson handles these cases gracefully.
Minimum sample size rule of thumb: For the normal approximation to be reliable, you need both $np \geq 10$ and $n(1-p) \geq 10$. For a 5% proportion, that means $n \geq 200$.
Practical Example: A/B Test Analysis
Let’s put everything together with a realistic A/B test scenario:
import numpy as np
import matplotlib.pyplot as plt
from statsmodels.stats.proportion import proportion_confint, proportions_ztest
# A/B test results
control = {"visitors": 5000, "conversions": 150}
treatment = {"visitors": 5000, "conversions": 185}
def analyze_variant(data: dict, name: str, method: str = 'wilson') -> dict:
"""Analyze a single variant and return statistics."""
rate = data["conversions"] / data["visitors"]
ci_lower, ci_upper = proportion_confint(
data["conversions"],
data["visitors"],
alpha=0.05,
method=method
)
return {
"name": name,
"rate": rate,
"ci_lower": ci_lower,
"ci_upper": ci_upper,
"n": data["visitors"],
"conversions": data["conversions"]
}
# Analyze both variants
control_stats = analyze_variant(control, "Control")
treatment_stats = analyze_variant(treatment, "Treatment")
# Print results
print("A/B Test Results")
print("=" * 60)
for stats in [control_stats, treatment_stats]:
print(f"\n{stats['name']}:")
print(f" Conversion rate: {stats['rate']:.2%}")
print(f" 95% CI: ({stats['ci_lower']:.2%}, {stats['ci_upper']:.2%})")
print(f" Sample size: {stats['n']:,}")
# Statistical significance test
z_stat, p_value = proportions_ztest(
[treatment["conversions"], control["conversions"]],
[treatment["visitors"], control["visitors"]],
alternative='two-sided'
)
relative_lift = (treatment_stats["rate"] - control_stats["rate"]) / control_stats["rate"]
print(f"\nRelative lift: {relative_lift:+.1%}")
print(f"Z-statistic: {z_stat:.3f}")
print(f"P-value: {p_value:.4f}")
print(f"Significant at α=0.05: {'Yes' if p_value < 0.05 else 'No'}")
# Visualization
fig, ax = plt.subplots(figsize=(8, 5))
variants = [control_stats, treatment_stats]
positions = range(len(variants))
colors = ['#3498db', '#2ecc71']
for i, (stats, color) in enumerate(zip(variants, colors)):
ax.barh(i, stats["rate"], color=color, alpha=0.7, height=0.5)
ax.errorbar(
stats["rate"], i,
xerr=[[stats["rate"] - stats["ci_lower"]],
[stats["ci_upper"] - stats["rate"]]],
fmt='none', color='black', capsize=5, capthick=2
)
ax.text(stats["ci_upper"] + 0.002, i,
f'{stats["rate"]:.2%}', va='center', fontweight='bold')
ax.set_yticks(positions)
ax.set_yticklabels([s["name"] for s in variants])
ax.set_xlabel("Conversion Rate")
ax.set_title("A/B Test Results with 95% Confidence Intervals")
ax.set_xlim(0, max(s["ci_upper"] for s in variants) * 1.15)
plt.tight_layout()
plt.savefig('ab_test_results.png', dpi=150)
plt.show()
Conclusion and Method Selection Guide
Choosing the right confidence interval method matters. Here’s a quick reference:
| Situation | Recommended Method | Why |
|---|---|---|
| General purpose | Wilson | Best coverage, handles edge cases |
| Quick calculation | Agresti-Coull | Simple adjustment, nearly as good as Wilson |
| Large samples, p near 0.5 | Normal (Wald) | Simplest, adequate in ideal conditions |
| Bayesian analysis | Jeffreys | Proper posterior interval |
| Regulatory/exact requirements | Clopper-Pearson (binom_test) |
Conservative, guaranteed coverage |
For production code, stick with statsmodels.stats.proportion.proportion_confint() using the Wilson method. It handles edge cases correctly, never produces impossible intervals, and is well-documented.
Further reading:
- statsmodels proportion documentation
- Brown, Cai, and DasGupta (2001): “Interval Estimation for a Binomial Proportion” — the definitive comparison of methods