How to Calculate a Confidence Interval for a Mean in Python

Key Insights

Confidence intervals quantify uncertainty in your sample mean—a 95% CI means if you repeated your sampling 100 times, approximately 95 of those intervals would contain the true population mean.
Use the t-distribution for small samples (n < 30) or when population standard deviation is unknown; use the z-distribution only when you have large samples and know the population standard deviation.
SciPy’s scipy.stats.t.interval() handles most real-world cases correctly, while statsmodels provides convenient one-liner methods through DescrStatsW.

Introduction to Confidence Intervals

Point estimates lie. When you calculate a sample mean, you get a single number that pretends to represent the truth. But that number carries uncertainty—uncertainty that confidence intervals make explicit.

A confidence interval gives you a range of plausible values for a population parameter based on your sample data. When you report that the mean response time is 245ms with a 95% confidence interval of [238ms, 252ms], you’re communicating both your best estimate and the precision of that estimate.

The confidence level (90%, 95%, 99%) represents how often the interval construction method captures the true parameter across repeated sampling. A 95% confidence interval doesn’t mean there’s a 95% probability the true mean falls within your specific interval—the true mean is fixed, not random. Instead, it means your method produces intervals that contain the true mean 95% of the time.

Use confidence intervals for means when you need to:

Report measurement precision in research or A/B tests
Compare group means to determine if differences are meaningful
Set acceptable ranges for quality control metrics
Communicate uncertainty to stakeholders who need to make decisions

The Math Behind Confidence Intervals

The formula for a confidence interval around a mean is straightforward:

CI = x̄ ± (critical value × standard error)

Where:

x̄ is the sample mean
Critical value comes from either the z-distribution or t-distribution
Standard error equals s/√n (sample standard deviation divided by square root of sample size)

The critical value depends on your confidence level and which distribution you use. For 95% confidence with the z-distribution, it’s 1.96. For the t-distribution, it depends on degrees of freedom (n-1).

Use the t-distribution when:

Sample size is small (n < 30)
Population standard deviation is unknown (almost always the case)

Use the z-distribution when:

Sample size is large (n ≥ 30) AND you know the population standard deviation
In practice, this is rare—default to t-distribution

Here’s the manual calculation to understand each component:

import numpy as np
from scipy import stats

# Sample data: response times in milliseconds
response_times = np.array([245, 238, 252, 241, 249, 244, 251, 239, 247, 243])

# Step 1: Calculate sample statistics
n = len(response_times)
sample_mean = np.mean(response_times)
sample_std = np.std(response_times, ddof=1)  # ddof=1 for sample std

# Step 2: Calculate standard error
standard_error = sample_std / np.sqrt(n)

# Step 3: Get critical value from t-distribution
confidence_level = 0.95
alpha = 1 - confidence_level
degrees_of_freedom = n - 1
t_critical = stats.t.ppf(1 - alpha/2, degrees_of_freedom)

# Step 4: Calculate margin of error
margin_of_error = t_critical * standard_error

# Step 5: Construct the interval
ci_lower = sample_mean - margin_of_error
ci_upper = sample_mean + margin_of_error

print(f"Sample size: {n}")
print(f"Sample mean: {sample_mean:.2f}")
print(f"Sample std: {sample_std:.2f}")
print(f"Standard error: {standard_error:.2f}")
print(f"t-critical (95%): {t_critical:.3f}")
print(f"Margin of error: {margin_of_error:.2f}")
print(f"95% CI: [{ci_lower:.2f}, {ci_upper:.2f}]")

Output:

Sample size: 10
Sample mean: 244.90
Sample std: 4.77
Standard error: 1.51
t-critical (95%): 2.262
Margin of error: 3.41
95% CI: [241.49, 248.31]

Using SciPy for Confidence Intervals

Manual calculation teaches you the mechanics, but SciPy handles the work in production code. The scipy.stats module provides everything you need.

import numpy as np
from scipy import stats

# Sample data
response_times = np.array([245, 238, 252, 241, 249, 244, 251, 239, 247, 243])

# Calculate standard error
sem = stats.sem(response_times)

# Get confidence interval using t-distribution
confidence_level = 0.95
sample_mean = np.mean(response_times)
n = len(response_times)
df = n - 1

# Method 1: Using t.interval()
ci_low, ci_high = stats.t.interval(
    confidence=confidence_level,
    df=df,
    loc=sample_mean,
    scale=sem
)

print(f"Mean: {sample_mean:.2f}")
print(f"Standard Error: {sem:.2f}")
print(f"95% CI: [{ci_low:.2f}, {ci_high:.2f}]")

# For different confidence levels
for conf in [0.90, 0.95, 0.99]:
    low, high = stats.t.interval(confidence=conf, df=df, loc=sample_mean, scale=sem)
    print(f"{int(conf*100)}% CI: [{low:.2f}, {high:.2f}]")

Output:

Mean: 244.90
Standard Error: 1.51
95% CI: [241.49, 248.31]
90% CI: [242.12, 247.68]
95% CI: [241.49, 248.31]
99% CI: [240.03, 249.77]

Notice how wider confidence levels produce wider intervals. You’re trading precision for confidence.

Using Statsmodels for Quick Calculations

Statsmodels offers a cleaner API when you want confidence intervals without thinking about the underlying mechanics.

import numpy as np
from statsmodels.stats.weightstats import DescrStatsW

# Sample data
response_times = np.array([245, 238, 252, 241, 249, 244, 251, 239, 247, 243])

# Create descriptive statistics object
desc_stats = DescrStatsW(response_times)

# One-liner confidence interval
ci_low, ci_high = desc_stats.tconfint_mean(alpha=0.05)  # alpha = 1 - confidence_level

print(f"Mean: {desc_stats.mean:.2f}")
print(f"Std Error: {desc_stats.std_mean:.2f}")
print(f"95% CI: [{ci_low:.2f}, {ci_high:.2f}]")

# Get multiple confidence levels
for alpha in [0.10, 0.05, 0.01]:
    low, high = desc_stats.tconfint_mean(alpha=alpha)
    conf = int((1 - alpha) * 100)
    print(f"{conf}% CI: [{low:.2f}, {high:.2f}]")

The DescrStatsW class also supports weighted statistics, which is useful when your observations have different importance or frequencies.

# Weighted example: some measurements taken multiple times
values = np.array([245, 250, 240])
weights = np.array([5, 3, 2])  # First value observed 5 times, etc.

weighted_stats = DescrStatsW(values, weights=weights)
ci_low, ci_high = weighted_stats.tconfint_mean(alpha=0.05)
print(f"Weighted mean: {weighted_stats.mean:.2f}")
print(f"Weighted 95% CI: [{ci_low:.2f}, {ci_high:.2f}]")

Handling Different Sample Sizes

The distinction between z and t distributions matters most for small samples. Here’s a function that selects the appropriate method:

import numpy as np
from scipy import stats

def confidence_interval(
    data: np.ndarray,
    confidence: float = 0.95,
    population_std: float = None
) -> tuple[float, float, str]:
    """
    Calculate confidence interval for the mean.
    
    Returns (lower, upper, method_used)
    """
    n = len(data)
    sample_mean = np.mean(data)
    
    # Use z-distribution only if we know population std AND n >= 30
    if population_std is not None and n >= 30:
        se = population_std / np.sqrt(n)
        z_critical = stats.norm.ppf(1 - (1 - confidence) / 2)
        margin = z_critical * se
        method = "z-distribution"
    else:
        # Default to t-distribution
        se = stats.sem(data)
        df = n - 1
        t_critical = stats.t.ppf(1 - (1 - confidence) / 2, df)
        margin = t_critical * se
        method = f"t-distribution (df={df})"
    
    return (sample_mean - margin, sample_mean + margin, method)


# Test with different sample sizes
np.random.seed(42)

# Small sample
small_sample = np.random.normal(100, 15, size=10)
ci_low, ci_high, method = confidence_interval(small_sample)
print(f"Small sample (n=10): [{ci_low:.2f}, {ci_high:.2f}] using {method}")

# Large sample
large_sample = np.random.normal(100, 15, size=100)
ci_low, ci_high, method = confidence_interval(large_sample)
print(f"Large sample (n=100): [{ci_low:.2f}, {ci_high:.2f}] using {method}")

# Large sample with known population std
ci_low, ci_high, method = confidence_interval(large_sample, population_std=15)
print(f"Large sample (known σ): [{ci_low:.2f}, {ci_high:.2f}] using {method}")

For small samples, the t-distribution has heavier tails, producing wider intervals that account for the additional uncertainty in estimating the population standard deviation.

Visualizing Confidence Intervals

Visualization makes confidence intervals immediately interpretable. Error bars are the standard approach:

import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

# Simulated A/B test results: conversion rates across 4 variants
np.random.seed(42)
variants = ['Control', 'Variant A', 'Variant B', 'Variant C']
means = []
errors = []

# Generate sample data for each variant
sample_sizes = [500, 480, 510, 495]
true_rates = [0.12, 0.14, 0.11, 0.15]

for i, (n, rate) in enumerate(zip(sample_sizes, true_rates)):
    # Simulate conversion data (1 = converted, 0 = not converted)
    conversions = np.random.binomial(1, rate, size=n)
    
    sample_mean = np.mean(conversions)
    sem = stats.sem(conversions)
    
    # Calculate 95% CI
    ci_low, ci_high = stats.t.interval(
        confidence=0.95,
        df=n-1,
        loc=sample_mean,
        scale=sem
    )
    
    means.append(sample_mean * 100)  # Convert to percentage
    errors.append((sample_mean - ci_low) * 100)  # Symmetric error

# Create bar chart with error bars
fig, ax = plt.subplots(figsize=(10, 6))

x_pos = np.arange(len(variants))
bars = ax.bar(x_pos, means, yerr=errors, capsize=5, 
              color=['#2ecc71', '#3498db', '#e74c3c', '#9b59b6'],
              edgecolor='black', linewidth=1.2, alpha=0.8)

ax.set_ylabel('Conversion Rate (%)', fontsize=12)
ax.set_xlabel('Variant', fontsize=12)
ax.set_title('A/B Test Results with 95% Confidence Intervals', fontsize=14)
ax.set_xticks(x_pos)
ax.set_xticklabels(variants)
ax.set_ylim(0, 20)

# Add value labels
for i, (bar, mean, err) in enumerate(zip(bars, means, errors)):
    ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + err + 0.5,
            f'{mean:.1f}%', ha='center', va='bottom', fontsize=10)

plt.tight_layout()
plt.savefig('confidence_intervals.png', dpi=150)
plt.show()

When interpreting overlapping intervals: overlapping 95% CIs don’t necessarily mean no significant difference. The actual test for difference requires examining the CI of the difference between means, not whether individual CIs overlap.

Practical Considerations and Common Pitfalls

Assumptions to verify:

Random sampling: Your sample must represent the population. Convenience samples produce misleading intervals.
Independence: Each observation should be independent. Time-series data or clustered samples violate this.
Normality: For small samples, the underlying data should be approximately normal. For large samples (n > 30), the Central Limit Theorem provides robustness.

Common misinterpretations to avoid:

“There’s a 95% probability the true mean is in this interval” — Wrong. The true mean is fixed; your interval either contains it or doesn’t.
“95% of the data falls within this interval” — Wrong. That’s a prediction interval, not a confidence interval.
“Non-overlapping CIs mean significant difference” — Mostly true, but overlapping CIs don’t mean no difference.

Quick reference:

Scenario	Method	Python Function
Small sample, unknown σ	t-distribution	`scipy.stats.t.interval()`
Large sample, unknown σ	t-distribution (safe default)	`scipy.stats.t.interval()`
Large sample, known σ	z-distribution	`scipy.stats.norm.interval()`
Weighted data	t-distribution	`statsmodels.DescrStatsW.tconfint_mean()`

Default to the t-distribution. You’ll rarely go wrong, and the difference from z becomes negligible as sample size increases. The extra caution costs you nothing with large samples and protects you with small ones.