How to Perform Bonferroni Correction in R

Every time you run a statistical test at α = 0.05, you accept a 5% chance of a false positive. Run one test, and that's manageable. Run twenty tests, and you're almost guaranteed to find something...

Key Insights

  • Bonferroni correction controls family-wise error rate by dividing your significance threshold (α) by the number of tests, making it simple to implement but potentially too conservative for large test batteries.
  • R provides multiple approaches: manual calculation for transparency, p.adjust() for flexibility, and pairwise.t.test() for integrated post-hoc analysis.
  • Consider alternatives like Holm’s method (uniformly more powerful) or Benjamini-Hochberg (controls false discovery rate) when running many tests or when Bonferroni’s conservatism causes unacceptable power loss.

The Multiple Comparisons Problem

Every time you run a statistical test at α = 0.05, you accept a 5% chance of a false positive. Run one test, and that’s manageable. Run twenty tests, and you’re almost guaranteed to find something “significant” by chance alone.

The math is straightforward. If each test has a 5% false positive rate, the probability of at least one false positive across n independent tests is:

P(at least one false positive) = 1 - (1 - α)^n

For 20 tests at α = 0.05:

1 - (1 - 0.05)^20
# [1] 0.6415141

That’s a 64% chance of at least one spurious finding. This inflated error rate across a family of tests is called the family-wise error rate (FWER). Controlling it is essential when you need to be confident that all your significant findings are real.

What is Bonferroni Correction?

Bonferroni correction is the simplest and most widely known method for controlling FWER. The logic is direct: if you’re running n tests and want to maintain an overall α level, test each individual hypothesis at α/n.

For 20 tests with a desired FWER of 0.05:

adjusted_alpha <- 0.05 / 20
adjusted_alpha
# [1] 0.0025

A p-value must fall below 0.0025 to be considered significant.

When to use Bonferroni:

  • You need strict control over false positives (confirmatory research, clinical trials)
  • You’re running a small number of planned comparisons
  • The consequences of a false positive are severe

Assumptions and limitations:

  • Works regardless of whether tests are independent or correlated
  • Becomes increasingly conservative as the number of tests grows
  • May substantially reduce statistical power, increasing false negatives

The conservatism is a feature, not a bug, when false positives are costly. But it’s a liability when you’re running exploratory analyses or when missing true effects matters.

Manual Bonferroni Correction in R

Understanding the manual approach builds intuition before using convenience functions. Let’s say you’ve run five t-tests comparing treatment groups and obtained these p-values:

# Raw p-values from five separate t-tests
p_values <- c(0.003, 0.012, 0.048, 0.067, 0.210)
n_tests <- length(p_values)

# Calculate Bonferroni-adjusted alpha
alpha <- 0.05
adjusted_alpha <- alpha / n_tests
adjusted_alpha
# [1] 0.01

# Determine which tests remain significant
significant <- p_values < adjusted_alpha
results <- data.frame(
  test = paste0("Test_", 1:n_tests),
  p_value = p_values,
  significant_raw = p_values < alpha,
  significant_bonferroni = significant
)
print(results)

Output:

    test p_value significant_raw significant_bonferroni
1 Test_1   0.003            TRUE                   TRUE
2 Test_2   0.012            TRUE                  FALSE
3 Test_3   0.048            TRUE                  FALSE
4 Test_4   0.067           FALSE                  FALSE
5 Test_5   0.210           FALSE                  FALSE

Three tests were significant at the raw α = 0.05 level. After Bonferroni correction, only one survives. This illustrates both the method’s purpose and its cost.

Alternatively, you can multiply p-values by n instead of dividing alpha:

# Adjusted p-values (capped at 1)
adjusted_p <- pmin(p_values * n_tests, 1)
adjusted_p
# [1] 0.015 0.060 0.240 0.335 1.000

Both approaches are mathematically equivalent. Adjusted p-values are often preferred because they can be directly compared against the original α threshold.

Using p.adjust() for Bonferroni Correction

R’s p.adjust() function handles multiple comparison corrections cleanly. It supports several methods, including Bonferroni:

# Simulate a realistic scenario: multiple t-tests comparing groups
set.seed(42)

# Generate data: 4 treatment groups, one control
control <- rnorm(30, mean = 100, sd = 15)
treatment_a <- rnorm(30, mean = 105, sd = 15)  # Small effect
treatment_b <- rnorm(30, mean = 100, sd = 15)  # No effect
treatment_c <- rnorm(30, mean = 112, sd = 15)  # Moderate effect
treatment_d <- rnorm(30, mean = 100, sd = 15)  # No effect

# Run t-tests comparing each treatment to control
p_values <- c(
  t.test(treatment_a, control)$p.value,
  t.test(treatment_b, control)$p.value,
  t.test(treatment_c, control)$p.value,
  t.test(treatment_d, control)$p.value
)

# Apply Bonferroni correction
adjusted_p <- p.adjust(p_values, method = "bonferroni")

# Compare results
comparison <- data.frame(
  treatment = c("A", "B", "C", "D"),
  raw_p = round(p_values, 4),
  bonferroni_p = round(adjusted_p, 4),
  raw_sig = p_values < 0.05,
  bonferroni_sig = adjusted_p < 0.05
)
print(comparison)

The p.adjust() function multiplies each p-value by the number of tests (capping at 1). This is cleaner than manual calculation and less error-prone.

Available methods in p.adjust():

p.adjust.methods
# [1] "holm" "hochberg" "hommel" "bonferroni" "BH" "BY" "fdr" "none"

Bonferroni Correction with pairwise.t.test()

When performing post-hoc pairwise comparisons after ANOVA, pairwise.t.test() integrates the correction directly:

# Use the iris dataset: compare Sepal.Length across species
data(iris)

# First, confirm there's a significant overall effect
anova_result <- aov(Sepal.Length ~ Species, data = iris)
summary(anova_result)
             Df Sum Sq Mean Sq F value Pr(>F)    
Species       2  63.21  31.606   119.3 <2e-16 ***
Residuals   147  38.96   0.265                   

With a significant ANOVA, we proceed to pairwise comparisons:

# Pairwise t-tests with Bonferroni correction
pairwise_results <- pairwise.t.test(
  iris$Sepal.Length,
  iris$Species,
  p.adjust.method = "bonferroni"
)
print(pairwise_results)
	Pairwise comparisons using t tests with pooled SD 

data:  iris$Sepal.Length and iris$Species 

           setosa  versicolor
versicolor 1.5e-15 -         
virginica  < 2e-16 8.8e-09   

P value adjustment method: bonferroni 

All pairwise comparisons remain highly significant after correction. Compare this to the unadjusted version:

pairwise.t.test(
  iris$Sepal.Length,
  iris$Species,
  p.adjust.method = "none"
)

In this case, the effects are large enough that correction doesn’t change conclusions. With subtler effects, you’d see differences.

Limitations and Alternatives

Bonferroni’s conservatism becomes problematic as test counts increase. With 100 tests, you need p < 0.0005 for significance. Genuine effects get missed.

Holm’s method (also called Holm-Bonferroni) provides the same FWER control with uniformly greater power. It’s a step-down procedure that applies progressively less stringent thresholds:

# Compare Bonferroni vs. Holm on the same p-values
p_values <- c(0.001, 0.008, 0.039, 0.041, 0.042, 0.06, 0.5)

comparison <- data.frame(
  raw_p = p_values,
  bonferroni = p.adjust(p_values, method = "bonferroni"),
  holm = p.adjust(p_values, method = "holm")
)
comparison$bonf_sig <- comparison$bonferroni < 0.05
comparison$holm_sig <- comparison$holm < 0.05

print(comparison)
  raw_p bonferroni   holm bonf_sig holm_sig
1 0.001      0.007  0.007     TRUE     TRUE
2 0.008      0.056  0.048    FALSE     TRUE
3 0.039      0.273  0.156    FALSE    FALSE
4 0.041      0.287  0.164    FALSE    FALSE
5 0.042      0.294  0.168    FALSE    FALSE
6 0.060      0.420  0.180    FALSE    FALSE
7 0.500      1.000  0.500    FALSE    FALSE

Holm identifies one additional significant result. There’s rarely a reason to prefer Bonferroni over Holm—Holm is never less powerful and controls the same error rate.

Benjamini-Hochberg (BH) controls the false discovery rate (FDR) instead of FWER. It’s appropriate when you can tolerate some false positives among your discoveries:

p.adjust(p_values, method = "BH")
# [1] 0.007 0.028 0.091 0.091 0.091 0.105 0.500

BH is less conservative and better suited for exploratory analyses or high-throughput studies (genomics, neuroimaging).

Tukey’s HSD is designed specifically for pairwise comparisons after ANOVA and accounts for the correlation structure:

TukeyHSD(anova_result)

Summary and Best Practices

Quick reference for Bonferroni in R:

Task Function
Adjust p-values p.adjust(p_values, method = "bonferroni")
Pairwise comparisons pairwise.t.test(..., p.adjust.method = "bonferroni")
Manual calculation adjusted_alpha = alpha / n_tests

Decision guidelines:

  1. Use Bonferroni when: You have few planned comparisons, false positives are costly, and you want a simple, defensible method.

  2. Use Holm instead when: You want FWER control but more power. There’s no downside to choosing Holm over Bonferroni.

  3. Use Benjamini-Hochberg when: You’re running many tests, can tolerate some false discoveries, or are doing exploratory work.

  4. Use Tukey’s HSD when: You’re doing all pairwise comparisons after a one-way ANOVA.

  5. Consider no correction when: Tests address genuinely independent scientific questions that wouldn’t be interpreted together.

The multiple comparisons problem is real, but overcorrection is also a problem. Match your correction method to your research context, error tolerance, and the number of tests you’re running.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.