How to Determine Sample Size in Python
Getting sample size wrong is one of the most expensive mistakes in applied statistics. Too small, and you lack the statistical power to detect real effects—your experiment fails to show significance...
Key Insights
- Sample size determination balances statistical power, effect size, and practical constraints—underpowered studies waste resources while overpowered ones are inefficient
- Python’s
statsmodelslibrary provides robust power analysis tools for means and proportions, but understanding the underlying formulas helps you adapt calculations to non-standard scenarios - Always inflate your calculated sample size by 10-20% to account for dropout, missing data, and real-world messiness
Why Sample Size Matters
Getting sample size wrong is one of the most expensive mistakes in applied statistics. Too small, and you lack the statistical power to detect real effects—your experiment fails to show significance even when the effect exists. Too large, and you’ve wasted time, money, and potentially exposed more users to an inferior treatment than necessary.
Sample size determination sits at the intersection of statistics and practical decision-making. Before collecting any data, you need to answer: how many observations do I need to detect an effect of a given size with acceptable confidence?
This article walks through the Python tools and techniques for calculating sample sizes across common scenarios: comparing means, testing proportions, and estimating confidence intervals for surveys.
Key Concepts and Formulas
Four parameters drive sample size calculations:
Effect size measures the magnitude of the difference you’re trying to detect. For means, Cohen’s d standardizes this as the difference divided by the standard deviation. For proportions, it’s derived from the difference in rates.
Significance level (α) is your tolerance for false positives—typically 0.05, meaning a 5% chance of concluding there’s an effect when none exists.
Power (1-β) is the probability of detecting a true effect. Convention sets this at 0.80, giving you an 80% chance of finding a real effect.
Variance captures the spread in your data. Higher variance requires larger samples to detect the same effect size.
The basic formula for comparing two means illustrates the relationships:
import math
def sample_size_two_means(effect_size, alpha=0.05, power=0.80):
"""
Calculate sample size per group for two-sample t-test.
Uses normal approximation for simplicity.
"""
from scipy import stats
# Critical values for two-tailed test
z_alpha = stats.norm.ppf(1 - alpha / 2)
z_beta = stats.norm.ppf(power)
# Sample size per group
n = 2 * ((z_alpha + z_beta) / effect_size) ** 2
return math.ceil(n)
# Detect a medium effect (d=0.5) with 80% power
n_per_group = sample_size_two_means(effect_size=0.5)
print(f"Required sample size per group: {n_per_group}")
# Output: Required sample size per group: 64
This formula uses the normal approximation. For exact calculations, especially with smaller samples, you’ll want the t-distribution—which is where statsmodels becomes invaluable.
Sample Size for Comparing Means
The statsmodels library provides TTestPower and TTestIndPower classes for sample size calculations involving means. These handle one-sample, two-sample independent, and paired designs.
from statsmodels.stats.power import TTestIndPower, TTestPower
# Initialize the power analysis object
power_analysis = TTestIndPower()
# Two-sample independent t-test
# Detect medium effect (d=0.5), 80% power, alpha=0.05
sample_size = power_analysis.solve_power(
effect_size=0.5,
power=0.80,
alpha=0.05,
ratio=1.0, # Equal group sizes
alternative='two-sided'
)
print(f"Two-sample t-test: {math.ceil(sample_size)} per group")
# Output: Two-sample t-test: 64 per group
# Unequal group allocation (2:1 ratio)
sample_size_unequal = power_analysis.solve_power(
effect_size=0.5,
power=0.80,
alpha=0.05,
ratio=2.0, # Group 2 is twice as large as Group 1
alternative='two-sided'
)
print(f"Unequal allocation (1:2): {math.ceil(sample_size_unequal)} in smaller group")
# Output: Unequal allocation (1:2): 48 in smaller group
# One-sample t-test (comparing to known value)
one_sample_power = TTestPower()
one_sample_n = one_sample_power.solve_power(
effect_size=0.5,
power=0.80,
alpha=0.05,
alternative='two-sided'
)
print(f"One-sample t-test: {math.ceil(one_sample_n)} observations")
# Output: One-sample t-test: 34 observations
The solve_power method is flexible—you can solve for any parameter by leaving it as None. Need to know what power you’ll achieve with a fixed sample size? Set nobs and leave power=None.
# What power do we have with 50 subjects per group?
achieved_power = power_analysis.solve_power(
effect_size=0.5,
nobs1=50,
alpha=0.05,
ratio=1.0,
alternative='two-sided'
)
print(f"Power with n=50 per group: {achieved_power:.2%}")
# Output: Power with n=50 per group: 69.69%
Sample Size for Proportions
A/B tests and conversion rate experiments require sample size calculations for proportions. The approach differs because effect size is computed from the proportion difference.
from statsmodels.stats.power import NormalIndPower
from statsmodels.stats.proportion import proportion_effectsize
# Scenario: Current conversion rate is 10%, we want to detect
# a lift to 12% (20% relative improvement)
baseline_rate = 0.10
expected_rate = 0.12
# Calculate Cohen's h (effect size for proportions)
effect_size = proportion_effectsize(expected_rate, baseline_rate)
print(f"Cohen's h effect size: {effect_size:.4f}")
# Output: Cohen's h effect size: 0.0649
# Calculate required sample size
proportion_power = NormalIndPower()
sample_size = proportion_power.solve_power(
effect_size=effect_size,
power=0.80,
alpha=0.05,
ratio=1.0,
alternative='two-sided'
)
print(f"Required sample size per variant: {math.ceil(sample_size)}")
# Output: Required sample size per variant: 3724
Notice how detecting a 2 percentage point lift requires nearly 4,000 users per variant. This is why A/B tests on low-traffic sites often fail—they’re dramatically underpowered.
For one-sided tests (you only care if the new variant is better), you can reduce sample size:
# One-sided test: only detect improvements
sample_size_one_sided = proportion_power.solve_power(
effect_size=effect_size,
power=0.80,
alpha=0.05,
ratio=1.0,
alternative='larger'
)
print(f"One-sided test: {math.ceil(sample_size_one_sided)} per variant")
# Output: One-sided test: 2922 per variant
Sample Size for Surveys and Confidence Intervals
Not all sample size calculations involve hypothesis testing. For surveys and descriptive studies, you often want to estimate a parameter with a specific margin of error.
from scipy import stats
def sample_size_proportion_margin(
margin_of_error,
confidence_level=0.95,
estimated_proportion=0.5,
population_size=None
):
"""
Calculate sample size for estimating a proportion with
a specified margin of error.
Parameters:
-----------
margin_of_error : float
Desired margin of error (e.g., 0.03 for ±3%)
confidence_level : float
Confidence level (default 0.95)
estimated_proportion : float
Expected proportion (use 0.5 for maximum variance)
population_size : int, optional
Finite population size for correction
Returns:
--------
int : Required sample size
"""
z = stats.norm.ppf(1 - (1 - confidence_level) / 2)
p = estimated_proportion
# Initial sample size (infinite population)
n = (z ** 2 * p * (1 - p)) / (margin_of_error ** 2)
# Apply finite population correction if specified
if population_size is not None:
n = n / (1 + (n - 1) / population_size)
return math.ceil(n)
# Survey with ±3% margin of error, 95% confidence
n_survey = sample_size_proportion_margin(
margin_of_error=0.03,
confidence_level=0.95,
estimated_proportion=0.5
)
print(f"Survey sample size (±3%): {n_survey}")
# Output: Survey sample size (±3%): 1068
# Same survey but from a company with 5,000 employees
n_finite = sample_size_proportion_margin(
margin_of_error=0.03,
confidence_level=0.95,
estimated_proportion=0.5,
population_size=5000
)
print(f"With finite population correction: {n_finite}")
# Output: With finite population correction: 880
Using 0.5 for the estimated proportion is conservative—it maximizes variance and thus sample size. If you have prior knowledge that the true proportion is near 0.1 or 0.9, you can use that for a smaller required sample.
Power Analysis Visualization
Power curves help stakeholders understand the tradeoffs between sample size and detectable effects. They’re particularly useful when negotiating study parameters with non-technical collaborators.
import numpy as np
import matplotlib.pyplot as plt
from statsmodels.stats.power import TTestIndPower
power_analysis = TTestIndPower()
# Sample sizes to evaluate
sample_sizes = np.arange(10, 201, 5)
# Effect sizes to compare
effect_sizes = [0.2, 0.5, 0.8] # Small, medium, large
labels = ['Small (d=0.2)', 'Medium (d=0.5)', 'Large (d=0.8)']
colors = ['#e74c3c', '#3498db', '#2ecc71']
plt.figure(figsize=(10, 6))
for effect_size, label, color in zip(effect_sizes, labels, colors):
powers = [
power_analysis.solve_power(
effect_size=effect_size,
nobs1=n,
alpha=0.05,
ratio=1.0,
alternative='two-sided'
)
for n in sample_sizes
]
plt.plot(sample_sizes, powers, label=label, color=color, linewidth=2)
# Add reference lines
plt.axhline(y=0.80, color='gray', linestyle='--', alpha=0.7, label='80% Power')
plt.axhline(y=0.90, color='gray', linestyle=':', alpha=0.7, label='90% Power')
plt.xlabel('Sample Size per Group', fontsize=12)
plt.ylabel('Statistical Power', fontsize=12)
plt.title('Power Curves for Two-Sample t-Test (α = 0.05)', fontsize=14)
plt.legend(loc='lower right')
plt.grid(True, alpha=0.3)
plt.ylim(0, 1)
plt.tight_layout()
plt.savefig('power_curves.png', dpi=150)
plt.show()
This visualization immediately communicates that detecting small effects requires substantially larger samples. A medium effect needs about 64 per group for 80% power, while a small effect needs over 300.
Practical Considerations and Best Practices
Inflate for attrition. Real studies lose participants. If you expect 15% dropout, divide your calculated sample size by 0.85. For a study needing 100 completers, recruit 118.
Choose realistic effect sizes. Don’t pick effect sizes to get a convenient sample size. Use pilot data, literature reviews, or the minimum effect that would be practically meaningful. Optimistic effect sizes lead to underpowered studies.
Consider alternatives to statsmodels. The pingouin library offers a cleaner API for common scenarios:
import pingouin as pg
# Two-sample t-test power analysis
n = pg.power_ttest(d=0.5, power=0.8, alpha=0.05, contrast='two-samples')
print(f"Pingouin result: {n:.1f} per group")
# Output: Pingouin result: 64.0 per group
Document your assumptions. Sample size calculations depend on assumptions about effect size and variance. When these assumptions are wrong, your study may be underpowered. Record your reasoning so you can evaluate it post-hoc.
Run sensitivity analyses. Calculate sample sizes across a range of plausible effect sizes. Present stakeholders with scenarios: “If the effect is medium, we need 64 per group. If it’s small, we need 394.”
Sample size determination isn’t a one-time calculation—it’s an iterative process of balancing statistical rigor against practical constraints. Python’s tools make the calculations straightforward; the hard part is choosing the right inputs.