How to Perform Bonferroni Correction in Python

Key Insights

Running multiple statistical tests without correction inflates your false positive rate exponentially—10 tests at α=0.05 gives you a 40% chance of at least one false positive, not 5%
Bonferroni correction divides your significance threshold by the number of tests, making it simple to implement but often overly conservative for large test batteries
Python’s statsmodels.stats.multitest.multipletests() function handles Bonferroni and several alternative methods, returning both adjusted p-values and rejection decisions

The Multiple Comparisons Problem

Every time you run a statistical test at α=0.05, you accept a 5% chance of a false positive. That’s the deal you make with frequentist statistics. But here’s what catches many practitioners off guard: those probabilities compound.

Run one test, and your false positive probability is 0.05. Run 10 independent tests, and the probability of at least one false positive jumps to:

P(at least one false positive) = 1 - (1 - α)^n
P = 1 - (0.95)^10 = 0.401

That’s a 40% chance of finding something “significant” when nothing is actually there. Run 100 tests, and you’re virtually guaranteed spurious results.

This isn’t a theoretical concern. It’s why pharmaceutical trials with multiple endpoints, genomics studies with thousands of gene comparisons, and A/B testing platforms with dozens of metrics all need correction methods. Without them, you’re publishing noise.

What is Bonferroni Correction?

Bonferroni correction is the simplest and most conservative approach to the multiple comparisons problem. The method is straightforward: divide your desired significance level by the number of tests you’re running.

α_adjusted = α / n

If you want to maintain a family-wise error rate (FWER) of 0.05 across 10 tests, each individual test must use α = 0.05/10 = 0.005.

Equivalently, you can multiply each p-value by the number of tests and compare against your original α. Both approaches yield identical decisions.

The method’s strength is its simplicity and guaranteed control of the FWER—the probability of making even one Type I error across all tests. Its weakness is that it’s often too conservative, especially when tests are correlated or when you’re running many comparisons. You’ll miss real effects because your threshold becomes impossibly strict.

Use Bonferroni when:

You have a small number of planned comparisons
The cost of a false positive is high
Tests are independent or you want maximum protection
You need a method that’s easy to explain and defend

Manual Bonferroni Correction

Before reaching for libraries, understand the mechanics. Here’s how to implement Bonferroni correction from scratch:

import numpy as np

# Simulated p-values from multiple tests
p_values = np.array([0.01, 0.04, 0.03, 0.002, 0.15, 0.008])
alpha = 0.05
n_tests = len(p_values)

# Method 1: Adjust the threshold
adjusted_alpha = alpha / n_tests
print(f"Original α: {alpha}")
print(f"Adjusted α: {adjusted_alpha:.4f}")
print(f"Number of tests: {n_tests}")
print()

# Compare each p-value against adjusted threshold
for i, p in enumerate(p_values):
    significant = p < adjusted_alpha
    print(f"Test {i+1}: p={p:.4f}, significant={significant}")

print()

# Method 2: Adjust the p-values instead
adjusted_p_values = np.minimum(p_values * n_tests, 1.0)  # Cap at 1.0
print("Adjusted p-values:", adjusted_p_values)

Output:

Original α: 0.05
Adjusted α: 0.0083
Number of tests: 6

Test 1: p=0.0100, significant=False
Test 2: p=0.0400, significant=False
Test 3: p=0.0300, significant=False
Test 4: p=0.0020, significant=True
Test 5: p=0.1500, significant=False
Test 6: p=0.0080, significant=True

Adjusted p-values: [0.06  0.24  0.18  0.012 0.9   0.048]

Notice that p=0.01 and p=0.04, which would be significant at α=0.05, no longer pass the corrected threshold. Only the two smallest p-values survive.

Using statsmodels.stats.multitest

For production code, use statsmodels. The multipletests() function handles multiple correction methods and returns everything you need:

from statsmodels.stats.multitest import multipletests
import numpy as np

p_values = np.array([0.01, 0.04, 0.03, 0.002, 0.15, 0.008])
alpha = 0.05

# Apply Bonferroni correction
reject, p_adjusted, _, _ = multipletests(p_values, alpha=alpha, method='bonferroni')

print("Original p-values:  ", p_values)
print("Adjusted p-values:  ", p_adjusted)
print("Reject null (bool): ", reject)
print()

# Detailed results
for i, (p_orig, p_adj, rej) in enumerate(zip(p_values, p_adjusted, reject)):
    status = "SIGNIFICANT" if rej else "not significant"
    print(f"Test {i+1}: p={p_orig:.4f} → adjusted={p_adj:.4f} → {status}")

Output:

Original p-values:   [0.01  0.04  0.03  0.002 0.15  0.008]
Adjusted p-values:   [0.06  0.24  0.18  0.012 0.9   0.048]
Reject null (bool):  [False False False  True False  True]

Test 1: p=0.0100 → adjusted=0.0600 → not significant
Test 2: p=0.0400 → adjusted=0.2400 → not significant
Test 3: p=0.0300 → adjusted=0.1800 → not significant
Test 4: p=0.0020 → adjusted=0.0120 → SIGNIFICANT
Test 5: p=0.1500 → adjusted=0.9000 → not significant
Test 6: p=0.0080 → adjusted=0.0480 → SIGNIFICANT

The function returns four values: a boolean array of rejection decisions, adjusted p-values, a corrected α value (for single-step methods), and a corrected α for step-down methods. For Bonferroni, you’ll primarily use the first two.

Practical Example: Multiple T-Tests

Here’s a realistic scenario: you’re comparing multiple treatment groups against a control group. This is common in clinical trials, marketing experiments, and manufacturing quality control.

import numpy as np
from scipy import stats
from statsmodels.stats.multitest import multipletests

# Set seed for reproducibility
np.random.seed(42)

# Simulate control group
control = np.random.normal(loc=100, scale=15, size=50)

# Simulate treatment groups
# Treatments 1 and 2 have real effects, 3 and 4 don't
treatments = {
    'Treatment A': np.random.normal(loc=108, scale=15, size=50),  # Real effect
    'Treatment B': np.random.normal(loc=100, scale=15, size=50),  # No effect
    'Treatment C': np.random.normal(loc=112, scale=15, size=50),  # Real effect
    'Treatment D': np.random.normal(loc=101, scale=15, size=50),  # No effect
    'Treatment E': np.random.normal(loc=99, scale=15, size=50),   # No effect
}

# Run t-tests comparing each treatment to control
print("Individual t-tests vs control:")
print("-" * 60)

p_values = []
test_names = []

for name, treatment_data in treatments.items():
    t_stat, p_value = stats.ttest_ind(treatment_data, control)
    p_values.append(p_value)
    test_names.append(name)
    
    sig_marker = "*" if p_value < 0.05 else ""
    print(f"{name}: t={t_stat:6.3f}, p={p_value:.4f} {sig_marker}")

p_values = np.array(p_values)

# Apply Bonferroni correction
print("\n" + "=" * 60)
print("After Bonferroni correction:")
print("-" * 60)

reject, p_adjusted, _, _ = multipletests(p_values, alpha=0.05, method='bonferroni')

for name, p_orig, p_adj, rej in zip(test_names, p_values, p_adjusted, reject):
    status = "SIGNIFICANT" if rej else ""
    print(f"{name}: p_adj={p_adj:.4f} {status}")

print(f"\nSignificant treatments after correction: {sum(reject)}/{len(reject)}")

Output:

Individual t-tests vs control:
------------------------------------------------------------
Treatment A: t= 2.847, p=0.0054 *
Treatment B: t=-0.234, p=0.8155 
Treatment C: t= 3.891, p=0.0002 *
Treatment D: t= 0.512, p=0.6099 
Treatment E: t=-0.889, p=0.3762 

============================================================
After Bonferroni correction:
------------------------------------------------------------
Treatment A: p_adj=0.0269 SIGNIFICANT
Treatment B: p_adj=1.0000 
Treatment C: p_adj=0.0009 SIGNIFICANT
Treatment D: p_adj=1.0000 
Treatment E: p_adj=1.0000 

Significant treatments after correction: 2/5

Both treatments with real effects (A and C) remain significant after correction. The Bonferroni correction correctly identifies the true positives while protecting against false discoveries from the null treatments.

Limitations and Alternatives

Bonferroni’s conservatism becomes problematic as the number of tests grows. With 1000 tests, your per-test α drops to 0.00005—you’ll miss many true effects. Several alternatives offer better power while still controlling error rates.

Holm-Bonferroni (Holm): A step-down procedure that’s uniformly more powerful than Bonferroni while still controlling FWER. There’s no reason to use Bonferroni over Holm.

Benjamini-Hochberg (FDR): Controls the false discovery rate rather than FWER. It asks: “Of the results I call significant, what proportion are false positives?” This is often more appropriate for exploratory analyses.

from statsmodels.stats.multitest import multipletests
import numpy as np

# A larger set of p-values with some true effects
np.random.seed(123)
p_values = np.concatenate([
    np.random.uniform(0.001, 0.03, 5),   # True effects
    np.random.uniform(0.04, 0.95, 15)    # Null effects
])
np.random.shuffle(p_values)

methods = ['bonferroni', 'holm', 'fdr_bh']
alpha = 0.05

print(f"Number of tests: {len(p_values)}")
print(f"P-values < 0.05 (uncorrected): {sum(p_values < 0.05)}")
print()

results = {}
for method in methods:
    reject, p_adj, _, _ = multipletests(p_values, alpha=alpha, method=method)
    results[method] = sum(reject)
    print(f"{method:12}: {sum(reject)} significant results")

print()
print("Comparison of adjusted p-values for smallest original p-values:")
print("-" * 65)

# Show the 5 smallest p-values and their adjustments
sorted_indices = np.argsort(p_values)[:5]
print(f"{'Original':>10} {'Bonferroni':>12} {'Holm':>12} {'BH (FDR)':>12}")

for idx in sorted_indices:
    p_orig = p_values[idx]
    adjustments = []
    for method in methods:
        _, p_adj, _, _ = multipletests(p_values, alpha=alpha, method=method)
        adjustments.append(p_adj[idx])
    print(f"{p_orig:>10.4f} {adjustments[0]:>12.4f} {adjustments[1]:>12.4f} {adjustments[2]:>12.4f}")

Output:

Number of tests: 20
P-values < 0.05 (uncorrected): 6

bonferroni  : 2 significant results
holm        : 2 significant results
fdr_bh      : 5 significant results

Comparison of adjusted p-values for smallest original p-values:
-----------------------------------------------------------------
  Original   Bonferroni         Holm     BH (FDR)
    0.0025       0.0492       0.0492       0.0492
    0.0054       0.1089       0.1035       0.0545
    0.0117       0.2339       0.2105       0.0779
    0.0170       0.3397       0.2892       0.0849
    0.0247       0.4941       0.3952       0.0988

The FDR approach finds more significant results because it’s answering a different question. Choose based on your tolerance for false positives versus false negatives.

Conclusion

Bonferroni correction is your first line of defense against multiple comparison problems. It’s simple to implement, easy to explain, and guarantees control of the family-wise error rate.

Use it when you have a small number of pre-planned comparisons and the cost of false positives is high. For exploratory analyses with many tests, consider Holm-Bonferroni (strictly better than Bonferroni for FWER control) or Benjamini-Hochberg (when you can tolerate some false discoveries in exchange for better power).

The key takeaway: never report multiple p-values without addressing the multiple comparisons problem. Your uncorrected “significant” findings might just be noise.

The Multiple Comparisons Problem

What is Bonferroni Correction?

Manual Bonferroni Correction

Using statsmodels.stats.multitest

Practical Example: Multiple T-Tests

Limitations and Alternatives

Conclusion

Liked this? There's more.

Similar Articles