How to Calculate a Confidence Interval in Python

Key Insights

Confidence intervals quantify uncertainty in your estimates—a 95% CI means if you repeated your sampling process 100 times, approximately 95 of those intervals would contain the true population parameter.
Use t-distributions for small samples (n < 30) or when population standard deviation is unknown; use z-scores only when you know the population standard deviation or have very large samples.
Bootstrap methods provide robust confidence intervals when your data violates normality assumptions, making them essential for real-world datasets that rarely follow textbook distributions.

Introduction to Confidence Intervals

Point estimates lie. When you calculate a sample mean and report it as “the answer,” you’re hiding crucial information about how much that estimate might vary. Confidence intervals fix this by providing a range that likely contains the true population parameter.

A 95% confidence interval doesn’t mean there’s a 95% probability the true value falls within your interval. The true value either is or isn’t in there—it’s fixed. Instead, a 95% CI means your methodology produces intervals that capture the true parameter 95% of the time across repeated sampling.

Higher confidence levels (99%) produce wider intervals. Lower levels (90%) produce narrower ones. The 95% level dominates because it balances precision with reliability, but your choice should depend on the cost of being wrong. Medical trials often use 99%; quick A/B tests might use 90%.

The Math Behind Confidence Intervals

The formula is straightforward:

CI = sample_mean ± (critical_value × standard_error)

The standard error measures how much your sample mean varies from sample to sample. It equals the standard deviation divided by the square root of your sample size. The critical value comes from either the z-distribution or t-distribution, depending on your situation.

Use t-scores when:

Your sample size is small (n < 30)
You don’t know the population standard deviation (almost always true)

Use z-scores when:

You know the population standard deviation
Your sample size is very large (n > 100) and you want simplicity

Here’s a manual calculation using NumPy:

import numpy as np
from scipy import stats

# Sample data
data = np.array([23, 25, 28, 22, 26, 24, 27, 25, 29, 24])

# Calculate components
n = len(data)
sample_mean = np.mean(data)
sample_std = np.std(data, ddof=1)  # ddof=1 for sample std
standard_error = sample_std / np.sqrt(n)

# Get t critical value for 95% CI
confidence_level = 0.95
degrees_freedom = n - 1
t_critical = stats.t.ppf((1 + confidence_level) / 2, degrees_freedom)

# Calculate margin of error
margin_of_error = t_critical * standard_error

# Calculate CI bounds
ci_lower = sample_mean - margin_of_error
ci_upper = sample_mean + margin_of_error

print(f"Sample Mean: {sample_mean:.2f}")
print(f"Standard Error: {standard_error:.2f}")
print(f"t Critical Value: {t_critical:.3f}")
print(f"95% CI: ({ci_lower:.2f}, {ci_upper:.2f})")

This outputs the confidence interval bounds and all intermediate values, which helps when debugging or explaining results to stakeholders.

Confidence Intervals for Means Using SciPy

Manual calculation teaches the concepts, but SciPy handles production code better. The scipy.stats.t.interval() function computes confidence intervals directly:

import numpy as np
from scipy import stats

# Sample data: response times in milliseconds
response_times = np.array([
    120, 135, 128, 142, 131, 125, 138, 127, 133, 129,
    144, 126, 137, 130, 141, 124, 136, 132, 139, 128
])

n = len(response_times)
mean = np.mean(response_times)
sem = stats.sem(response_times)  # Standard error of the mean

# Calculate 95% CI
ci_95 = stats.t.interval(
    confidence=0.95,
    df=n - 1,
    loc=mean,
    scale=sem
)

# Calculate 99% CI for comparison
ci_99 = stats.t.interval(
    confidence=0.99,
    df=n - 1,
    loc=mean,
    scale=sem
)

print(f"Mean response time: {mean:.1f}ms")
print(f"95% CI: ({ci_95[0]:.1f}, {ci_95[1]:.1f})ms")
print(f"99% CI: ({ci_99[0]:.1f}, {ci_99[1]:.1f})ms")

Notice how the 99% interval is wider than the 95% interval. You’re trading precision for confidence.

For large samples (n > 30), the t-distribution approaches the normal distribution, so results become nearly identical to z-based calculations. However, always using t-scores is safer—they’re correct for any sample size.

Confidence Intervals for Proportions

Proportions require different treatment. When measuring conversion rates, survey responses, or any binary outcome, use binomial proportion confidence intervals.

The statsmodels library provides several methods. The Wilson score interval performs well across different proportion values and sample sizes:

import numpy as np
from statsmodels.stats.proportion import proportion_confint

# Scenario: 47 conversions out of 500 visitors
successes = 47
trials = 500
observed_rate = successes / trials

# Wilson score interval (recommended)
ci_wilson = proportion_confint(
    count=successes,
    nobs=trials,
    alpha=0.05,  # 1 - confidence level
    method='wilson'
)

# Normal approximation (for comparison)
ci_normal = proportion_confint(
    count=successes,
    nobs=trials,
    alpha=0.05,
    method='normal'
)

# Agresti-Coull (good alternative)
ci_agresti = proportion_confint(
    count=successes,
    nobs=trials,
    alpha=0.05,
    method='agresti_coull'
)

print(f"Observed conversion rate: {observed_rate:.2%}")
print(f"Wilson 95% CI: ({ci_wilson[0]:.2%}, {ci_wilson[1]:.2%})")
print(f"Normal 95% CI: ({ci_normal[0]:.2%}, {ci_normal[1]:.2%})")
print(f"Agresti-Coull 95% CI: ({ci_agresti[0]:.2%}, {ci_agresti[1]:.2%})")

Avoid the normal approximation when proportions are near 0 or 1, or when sample sizes are small. Wilson and Agresti-Coull methods handle edge cases better.

Bootstrapping Method for Non-Normal Data

Real-world data often violates normality assumptions. Salaries are skewed. Response times have long tails. User session durations follow exponential-like distributions. Bootstrap confidence intervals handle these situations without parametric assumptions.

The bootstrap works by resampling your data with replacement thousands of times, calculating the statistic of interest for each resample, then using the distribution of those statistics to form confidence intervals:

import numpy as np
from scipy import stats

# Skewed data: user session durations in seconds
np.random.seed(42)
session_durations = np.concatenate([
    np.random.exponential(scale=60, size=80),
    np.random.exponential(scale=300, size=20)  # Heavy users
])

def bootstrap_ci(data, statistic_func, n_bootstrap=10000, confidence=0.95):
    """Calculate bootstrap confidence interval for any statistic."""
    n = len(data)
    bootstrap_stats = np.zeros(n_bootstrap)
    
    for i in range(n_bootstrap):
        # Resample with replacement
        resample = np.random.choice(data, size=n, replace=True)
        bootstrap_stats[i] = statistic_func(resample)
    
    # Percentile method
    alpha = 1 - confidence
    lower = np.percentile(bootstrap_stats, 100 * alpha / 2)
    upper = np.percentile(bootstrap_stats, 100 * (1 - alpha / 2))
    
    return lower, upper, bootstrap_stats

# Bootstrap CI for the mean
mean_ci = bootstrap_ci(session_durations, np.mean)
print(f"Sample mean: {np.mean(session_durations):.1f}s")
print(f"Bootstrap 95% CI for mean: ({mean_ci[0]:.1f}, {mean_ci[1]:.1f})s")

# Bootstrap CI for the median (parametric methods struggle here)
median_ci = bootstrap_ci(session_durations, np.median)
print(f"Sample median: {np.median(session_durations):.1f}s")
print(f"Bootstrap 95% CI for median: ({median_ci[0]:.1f}, {median_ci[1]:.1f})s")

SciPy 1.7+ includes scipy.stats.bootstrap() for a more polished implementation:

from scipy.stats import bootstrap

# Using scipy's bootstrap function
rng = np.random.default_rng(42)
result = bootstrap(
    (session_durations,),
    statistic=np.mean,
    n_resamples=10000,
    confidence_level=0.95,
    random_state=rng
)

print(f"SciPy Bootstrap 95% CI: ({result.confidence_interval.low:.1f}, "
      f"{result.confidence_interval.high:.1f})s")

Visualizing Confidence Intervals

Confidence intervals communicate uncertainty visually. Error bars work for comparing groups; shaded regions work for continuous data.

import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

# Generate sample data for three groups
np.random.seed(42)
groups = {
    'Control': np.random.normal(100, 15, 50),
    'Treatment A': np.random.normal(108, 18, 50),
    'Treatment B': np.random.normal(115, 12, 50)
}

# Calculate means and CIs
means = []
ci_errors = []
labels = []

for name, data in groups.items():
    n = len(data)
    mean = np.mean(data)
    sem = stats.sem(data)
    ci = stats.t.interval(0.95, df=n-1, loc=mean, scale=sem)
    
    means.append(mean)
    ci_errors.append(mean - ci[0])  # Symmetric, so use lower bound diff
    labels.append(name)

# Create bar plot with error bars
fig, ax = plt.subplots(figsize=(8, 6))
x_pos = np.arange(len(labels))

bars = ax.bar(x_pos, means, yerr=ci_errors, capsize=8, 
              color=['#3498db', '#e74c3c', '#2ecc71'],
              edgecolor='black', linewidth=1.2)

ax.set_xticks(x_pos)
ax.set_xticklabels(labels)
ax.set_ylabel('Score')
ax.set_title('Group Comparison with 95% Confidence Intervals')
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)

plt.tight_layout()
plt.savefig('confidence_intervals.png', dpi=150)
plt.show()

For time series or continuous data, shaded regions convey uncertainty effectively:

import numpy as np
import matplotlib.pyplot as plt

# Simulated time series with uncertainty
x = np.linspace(0, 10, 50)
y_true = 2 * np.sin(x) + 0.5 * x
y_observed = y_true + np.random.normal(0, 0.8, len(x))

# Rolling mean with bootstrap CI
window = 5
y_smooth = np.convolve(y_observed, np.ones(window)/window, mode='valid')
x_smooth = x[window//2:-(window//2)]

# Approximate CI band (simplified for visualization)
ci_width = 1.96 * 0.8 / np.sqrt(window)

fig, ax = plt.subplots(figsize=(10, 6))
ax.fill_between(x_smooth, y_smooth - ci_width, y_smooth + ci_width,
                alpha=0.3, color='blue', label='95% CI')
ax.plot(x_smooth, y_smooth, 'b-', linewidth=2, label='Smoothed mean')
ax.scatter(x, y_observed, alpha=0.5, s=20, color='gray', label='Observations')

ax.set_xlabel('Time')
ax.set_ylabel('Value')
ax.legend()
ax.set_title('Time Series with Confidence Band')
plt.tight_layout()
plt.show()

Practical Considerations and Common Pitfalls

Interpretation mistakes kill credibility. A 95% CI does not mean “95% probability the true value is in this range.” The true value is fixed; your interval either contains it or doesn’t. The 95% refers to the long-run success rate of your interval-generating procedure.

Sample size determines width. Confidence interval width scales with 1/√n. To halve your interval width, you need four times the sample size. This has real budget implications—know the precision you need before collecting data.

Overlapping intervals don’t mean no difference. Two groups can have overlapping 95% CIs yet still show a statistically significant difference. If you need to compare groups, use proper hypothesis tests or compute a confidence interval for the difference directly.

Non-random samples break everything. Confidence intervals assume random sampling from your population of interest. If your sample is biased—only surveying users who opted in, only measuring during peak hours—no statistical technique fixes that.

Choose confidence levels deliberately. The 95% default is arbitrary. For exploratory analysis, 90% might suffice. For decisions with serious consequences, 99% provides more protection. Match the confidence level to the stakes involved.

Confidence intervals transform single-point estimates into honest assessments of uncertainty. Use them consistently, interpret them correctly, and your analyses will earn the trust they deserve.