How to Perform Bonferroni Correction in R
Every time you run a statistical test at α = 0.05, you accept a 5% chance of a false positive. Run one test, and that's manageable. Run twenty tests, and you're almost guaranteed to find something...
Key Insights
- Bonferroni correction controls family-wise error rate by dividing your significance threshold (α) by the number of tests, making it simple to implement but potentially too conservative for large test batteries.
- R provides multiple approaches: manual calculation for transparency,
p.adjust()for flexibility, andpairwise.t.test()for integrated post-hoc analysis. - Consider alternatives like Holm’s method (uniformly more powerful) or Benjamini-Hochberg (controls false discovery rate) when running many tests or when Bonferroni’s conservatism causes unacceptable power loss.
The Multiple Comparisons Problem
Every time you run a statistical test at α = 0.05, you accept a 5% chance of a false positive. Run one test, and that’s manageable. Run twenty tests, and you’re almost guaranteed to find something “significant” by chance alone.
The math is straightforward. If each test has a 5% false positive rate, the probability of at least one false positive across n independent tests is:
P(at least one false positive) = 1 - (1 - α)^n
For 20 tests at α = 0.05:
1 - (1 - 0.05)^20
# [1] 0.6415141
That’s a 64% chance of at least one spurious finding. This inflated error rate across a family of tests is called the family-wise error rate (FWER). Controlling it is essential when you need to be confident that all your significant findings are real.
What is Bonferroni Correction?
Bonferroni correction is the simplest and most widely known method for controlling FWER. The logic is direct: if you’re running n tests and want to maintain an overall α level, test each individual hypothesis at α/n.
For 20 tests with a desired FWER of 0.05:
adjusted_alpha <- 0.05 / 20
adjusted_alpha
# [1] 0.0025
A p-value must fall below 0.0025 to be considered significant.
When to use Bonferroni:
- You need strict control over false positives (confirmatory research, clinical trials)
- You’re running a small number of planned comparisons
- The consequences of a false positive are severe
Assumptions and limitations:
- Works regardless of whether tests are independent or correlated
- Becomes increasingly conservative as the number of tests grows
- May substantially reduce statistical power, increasing false negatives
The conservatism is a feature, not a bug, when false positives are costly. But it’s a liability when you’re running exploratory analyses or when missing true effects matters.
Manual Bonferroni Correction in R
Understanding the manual approach builds intuition before using convenience functions. Let’s say you’ve run five t-tests comparing treatment groups and obtained these p-values:
# Raw p-values from five separate t-tests
p_values <- c(0.003, 0.012, 0.048, 0.067, 0.210)
n_tests <- length(p_values)
# Calculate Bonferroni-adjusted alpha
alpha <- 0.05
adjusted_alpha <- alpha / n_tests
adjusted_alpha
# [1] 0.01
# Determine which tests remain significant
significant <- p_values < adjusted_alpha
results <- data.frame(
test = paste0("Test_", 1:n_tests),
p_value = p_values,
significant_raw = p_values < alpha,
significant_bonferroni = significant
)
print(results)
Output:
test p_value significant_raw significant_bonferroni
1 Test_1 0.003 TRUE TRUE
2 Test_2 0.012 TRUE FALSE
3 Test_3 0.048 TRUE FALSE
4 Test_4 0.067 FALSE FALSE
5 Test_5 0.210 FALSE FALSE
Three tests were significant at the raw α = 0.05 level. After Bonferroni correction, only one survives. This illustrates both the method’s purpose and its cost.
Alternatively, you can multiply p-values by n instead of dividing alpha:
# Adjusted p-values (capped at 1)
adjusted_p <- pmin(p_values * n_tests, 1)
adjusted_p
# [1] 0.015 0.060 0.240 0.335 1.000
Both approaches are mathematically equivalent. Adjusted p-values are often preferred because they can be directly compared against the original α threshold.
Using p.adjust() for Bonferroni Correction
R’s p.adjust() function handles multiple comparison corrections cleanly. It supports several methods, including Bonferroni:
# Simulate a realistic scenario: multiple t-tests comparing groups
set.seed(42)
# Generate data: 4 treatment groups, one control
control <- rnorm(30, mean = 100, sd = 15)
treatment_a <- rnorm(30, mean = 105, sd = 15) # Small effect
treatment_b <- rnorm(30, mean = 100, sd = 15) # No effect
treatment_c <- rnorm(30, mean = 112, sd = 15) # Moderate effect
treatment_d <- rnorm(30, mean = 100, sd = 15) # No effect
# Run t-tests comparing each treatment to control
p_values <- c(
t.test(treatment_a, control)$p.value,
t.test(treatment_b, control)$p.value,
t.test(treatment_c, control)$p.value,
t.test(treatment_d, control)$p.value
)
# Apply Bonferroni correction
adjusted_p <- p.adjust(p_values, method = "bonferroni")
# Compare results
comparison <- data.frame(
treatment = c("A", "B", "C", "D"),
raw_p = round(p_values, 4),
bonferroni_p = round(adjusted_p, 4),
raw_sig = p_values < 0.05,
bonferroni_sig = adjusted_p < 0.05
)
print(comparison)
The p.adjust() function multiplies each p-value by the number of tests (capping at 1). This is cleaner than manual calculation and less error-prone.
Available methods in p.adjust():
p.adjust.methods
# [1] "holm" "hochberg" "hommel" "bonferroni" "BH" "BY" "fdr" "none"
Bonferroni Correction with pairwise.t.test()
When performing post-hoc pairwise comparisons after ANOVA, pairwise.t.test() integrates the correction directly:
# Use the iris dataset: compare Sepal.Length across species
data(iris)
# First, confirm there's a significant overall effect
anova_result <- aov(Sepal.Length ~ Species, data = iris)
summary(anova_result)
Df Sum Sq Mean Sq F value Pr(>F)
Species 2 63.21 31.606 119.3 <2e-16 ***
Residuals 147 38.96 0.265
With a significant ANOVA, we proceed to pairwise comparisons:
# Pairwise t-tests with Bonferroni correction
pairwise_results <- pairwise.t.test(
iris$Sepal.Length,
iris$Species,
p.adjust.method = "bonferroni"
)
print(pairwise_results)
Pairwise comparisons using t tests with pooled SD
data: iris$Sepal.Length and iris$Species
setosa versicolor
versicolor 1.5e-15 -
virginica < 2e-16 8.8e-09
P value adjustment method: bonferroni
All pairwise comparisons remain highly significant after correction. Compare this to the unadjusted version:
pairwise.t.test(
iris$Sepal.Length,
iris$Species,
p.adjust.method = "none"
)
In this case, the effects are large enough that correction doesn’t change conclusions. With subtler effects, you’d see differences.
Limitations and Alternatives
Bonferroni’s conservatism becomes problematic as test counts increase. With 100 tests, you need p < 0.0005 for significance. Genuine effects get missed.
Holm’s method (also called Holm-Bonferroni) provides the same FWER control with uniformly greater power. It’s a step-down procedure that applies progressively less stringent thresholds:
# Compare Bonferroni vs. Holm on the same p-values
p_values <- c(0.001, 0.008, 0.039, 0.041, 0.042, 0.06, 0.5)
comparison <- data.frame(
raw_p = p_values,
bonferroni = p.adjust(p_values, method = "bonferroni"),
holm = p.adjust(p_values, method = "holm")
)
comparison$bonf_sig <- comparison$bonferroni < 0.05
comparison$holm_sig <- comparison$holm < 0.05
print(comparison)
raw_p bonferroni holm bonf_sig holm_sig
1 0.001 0.007 0.007 TRUE TRUE
2 0.008 0.056 0.048 FALSE TRUE
3 0.039 0.273 0.156 FALSE FALSE
4 0.041 0.287 0.164 FALSE FALSE
5 0.042 0.294 0.168 FALSE FALSE
6 0.060 0.420 0.180 FALSE FALSE
7 0.500 1.000 0.500 FALSE FALSE
Holm identifies one additional significant result. There’s rarely a reason to prefer Bonferroni over Holm—Holm is never less powerful and controls the same error rate.
Benjamini-Hochberg (BH) controls the false discovery rate (FDR) instead of FWER. It’s appropriate when you can tolerate some false positives among your discoveries:
p.adjust(p_values, method = "BH")
# [1] 0.007 0.028 0.091 0.091 0.091 0.105 0.500
BH is less conservative and better suited for exploratory analyses or high-throughput studies (genomics, neuroimaging).
Tukey’s HSD is designed specifically for pairwise comparisons after ANOVA and accounts for the correlation structure:
TukeyHSD(anova_result)
Summary and Best Practices
Quick reference for Bonferroni in R:
| Task | Function |
|---|---|
| Adjust p-values | p.adjust(p_values, method = "bonferroni") |
| Pairwise comparisons | pairwise.t.test(..., p.adjust.method = "bonferroni") |
| Manual calculation | adjusted_alpha = alpha / n_tests |
Decision guidelines:
-
Use Bonferroni when: You have few planned comparisons, false positives are costly, and you want a simple, defensible method.
-
Use Holm instead when: You want FWER control but more power. There’s no downside to choosing Holm over Bonferroni.
-
Use Benjamini-Hochberg when: You’re running many tests, can tolerate some false discoveries, or are doing exploratory work.
-
Use Tukey’s HSD when: You’re doing all pairwise comparisons after a one-way ANOVA.
-
Consider no correction when: Tests address genuinely independent scientific questions that wouldn’t be interpreted together.
The multiple comparisons problem is real, but overcorrection is also a problem. Match your correction method to your research context, error tolerance, and the number of tests you’re running.