How to Determine Sample Size in R
Running a study with too few participants wastes everyone's time. You'll likely fail to detect effects that actually exist, leaving you with inconclusive results and nothing to show for your effort....
Key Insights
- Sample size determination requires balancing four interconnected factors: statistical power, significance level, effect size, and variance—changing any one affects the others
- R’s
pwrpackage provides straightforward functions for most common study designs, but you must specify realistic effect sizes based on domain knowledge or pilot data, not wishful thinking - Always inflate your calculated sample size by 10-20% to account for attrition, missing data, and real-world messiness that power calculations assume away
Introduction to Sample Size Determination
Running a study with too few participants wastes everyone’s time. You’ll likely fail to detect effects that actually exist, leaving you with inconclusive results and nothing to show for your effort. Running with too many participants wastes resources—money, time, and potentially participant goodwill.
Sample size determination sits at the intersection of statistics and practical planning. Get it right, and you’ve set yourself up for a study that can actually answer your research question. Get it wrong, and you’re either underpowered or burning budget unnecessarily.
Three concepts drive sample size calculations:
Statistical power is the probability of detecting an effect when one truly exists. Convention sets this at 0.80 (80%), meaning you accept a 20% chance of missing a real effect. For critical decisions, bump this to 0.90.
Significance level (α) is your false positive tolerance—the probability of claiming an effect exists when it doesn’t. The standard is 0.05, though some fields use stricter thresholds.
Effect size quantifies how large the phenomenon you’re studying actually is. This is where most researchers struggle because it requires making assumptions before collecting data.
Key Factors Affecting Sample Size
Sample size calculations involve a four-way trade-off. Understanding these relationships prevents the common mistake of plugging in numbers without grasping what they mean.
Desired power (1 - β): Higher power demands larger samples. Moving from 80% to 90% power typically increases required n by 25-30%. Don’t chase 99% power unless you have unlimited resources.
Significance level (α): Stricter alpha requires more participants. Using α = 0.01 instead of 0.05 increases sample size requirements substantially.
Expected effect size: This is the critical input most researchers underestimate. Small effects require massive samples. A “small” effect (d = 0.2) needs roughly 4x the sample size of a “medium” effect (d = 0.5).
Variance: More variable outcomes require larger samples to achieve the same precision. If you’re measuring something noisy, plan accordingly.
The fundamental relationship is straightforward: sample size increases as power increases, as alpha decreases, as effect size decreases, and as variance increases.
Sample Size for Comparing Two Means (t-tests)
The pwr package handles most standard power analyses. Install it once and load it for each session:
install.packages("pwr")
library(pwr)
For comparing two independent groups, use pwr.t.test(). The function uses Cohen’s d as the effect size metric—the standardized mean difference between groups.
# Sample size for independent samples t-test
# Detecting medium effect (d = 0.5), 80% power, α = 0.05
power_result <- pwr.t.test(
d = 0.5,
sig.level = 0.05,
power = 0.80,
type = "two.sample",
alternative = "two.sided"
)
print(power_result)
# n = 63.76561 per group
# Round up: need 64 participants per group, 128 total
The output tells you that detecting a medium effect between two groups requires 64 participants per group. Always round up—63.5 participants doesn’t make sense.
For paired designs (repeated measures on the same participants), paired tests are more powerful because they eliminate between-subject variability:
# Paired samples t-test
pwr.t.test(
d = 0.5,
sig.level = 0.05,
power = 0.80,
type = "paired"
)
# n = 33.36713
# Need 34 participants measured twice
Notice the paired design needs roughly half the participants. This is why within-subject designs are statistically efficient when feasible.
Cohen’s conventions for d: 0.2 (small), 0.5 (medium), 0.8 (large). But don’t blindly use “medium” because it sounds reasonable. Base effect size on prior research, pilot data, or the minimum effect that would be practically meaningful.
Sample Size for Proportions
A/B testing and clinical trials often compare proportions rather than means. The pwr package offers pwr.2p.test() for two-proportion comparisons.
The effect size here is Cohen’s h, calculated from the arcsine transformation of proportions. Use ES.h() to convert raw proportions to h:
# A/B test: baseline conversion 10%, want to detect increase to 15%
baseline_rate <- 0.10
expected_rate <- 0.15
# Calculate effect size h
effect_h <- ES.h(p1 = expected_rate, p2 = baseline_rate)
print(effect_h)
# h = 0.1541
# Calculate required sample size
ab_power <- pwr.2p.test(
h = effect_h,
sig.level = 0.05,
power = 0.80
)
print(ab_power)
# n = 330.77 per group
# Need 331 per variant, 662 total
That 5 percentage point lift from 10% to 15% requires 662 total users. This often surprises product teams who expect to run a quick test with a few hundred visitors.
For testing a single proportion against a known value:
# Test if proportion differs from 50%
pwr.p.test(
h = ES.h(p1 = 0.60, p2 = 0.50),
sig.level = 0.05,
power = 0.80
)
# n = 193.95, need 194 participants
Sample Size for Correlation and Regression
When your research question involves relationships between continuous variables, use pwr.r.test() for simple correlation:
# Detect correlation of r = 0.30 (medium effect)
pwr.r.test(
r = 0.30,
sig.level = 0.05,
power = 0.80
)
# n = 84.07, need 85 participants
For multiple regression, the effect size is Cohen’s f², calculated from R²:
# f² = R² / (1 - R²)
# For R² = 0.15: f² = 0.15 / 0.85 = 0.176
# Sample size for regression with 4 predictors
# Testing overall model fit (R² = 0.15)
pwr.f2.test(
u = 4, # numerator df (number of predictors)
f2 = 0.15 / (1 - 0.15), # effect size
sig.level = 0.05,
power = 0.80
)
# v = 52.31 (denominator df)
# n = v + u + 1 = 52.31 + 4 + 1 = 57.31
# Need 58 participants
The output gives you v (denominator degrees of freedom). Total sample size is v + u + 1, where u is the number of predictors.
For testing individual predictors while controlling for others, the calculation becomes more nuanced. You’re testing the increment in R² attributable to specific predictors:
# Testing 2 predictors of interest, controlling for 3 covariates
# Expected R² change = 0.05 when adding the 2 predictors
pwr.f2.test(
u = 2, # predictors being tested
f2 = 0.05 / (1 - 0.20), # assuming full model R² = 0.20
sig.level = 0.05,
power = 0.80
)
Sample Size for ANOVA Designs
One-way ANOVA compares means across three or more groups. The effect size is Cohen’s f:
# Three-group comparison, medium effect (f = 0.25)
anova_power <- pwr.anova.test(
k = 3, # number of groups
f = 0.25,
sig.level = 0.05,
power = 0.80
)
print(anova_power)
# n = 52.39 per group
# Need 53 per group, 159 total
Cohen’s f conventions: 0.10 (small), 0.25 (medium), 0.40 (large). Convert from eta-squared if needed: f = sqrt(η² / (1 - η²)).
For factorial designs, the pwr package becomes insufficient. You’ll need simulation-based approaches or the Superpower package:
install.packages("Superpower")
library(Superpower)
# 2x2 factorial design
design <- ANOVA_design(
design = "2b*2b",
n = 50,
mu = c(5, 5.5, 5, 6), # cell means
sd = 1,
labelnames = c("Factor_A", "low", "high",
"Factor_B", "control", "treatment")
)
# Run power simulation
power_result <- ANOVA_power(design, nsims = 1000)
Practical Considerations and Tools
Power calculations assume perfect data collection. Reality is messier. Build in buffers:
Attrition adjustment: If you expect 15% dropout, divide your calculated n by 0.85:
calculated_n <- 64
attrition_rate <- 0.15
adjusted_n <- ceiling(calculated_n / (1 - attrition_rate))
# adjusted_n = 76
Power curves help visualize the sample size-power trade-off and communicate with stakeholders:
library(ggplot2)
# Generate power curve for t-test
sample_sizes <- seq(20, 150, by = 5)
powers <- sapply(sample_sizes, function(n) {
pwr.t.test(n = n, d = 0.5, sig.level = 0.05)$power
})
power_data <- data.frame(n = sample_sizes, power = powers)
ggplot(power_data, aes(x = n, y = power)) +
geom_line(size = 1.2, color = "#2563eb") +
geom_hline(yintercept = 0.80, linetype = "dashed", color = "#dc2626") +
geom_vline(xintercept = 64, linetype = "dashed", color = "#059669") +
annotate("text", x = 70, y = 0.75, label = "n = 64", color = "#059669") +
labs(
x = "Sample Size per Group",
y = "Statistical Power",
title = "Power Curve for Two-Sample t-test (d = 0.5)"
) +
theme_minimal() +
scale_y_continuous(limits = c(0, 1), breaks = seq(0, 1, 0.2))
This visualization shows diminishing returns—power gains flatten as n increases beyond a certain point.
For complex designs or non-standard analyses, simulation becomes essential. The simr package handles power analysis for mixed-effects models:
library(simr)
# Fit pilot model, then extend and simulate
# model <- lmer(outcome ~ treatment + (1|subject), data = pilot_data)
# powerSim(model, nsim = 500)
The honest truth about sample size determination: it requires making assumptions you can’t fully justify until after data collection. Use the best available information—prior research, pilot studies, expert judgment—but acknowledge uncertainty. When in doubt, err on the side of larger samples. An overpowered study still produces valid results; an underpowered study produces ambiguity.