How to Perform Post-Hoc Tests Using Pingouin in Python
When you run an ANOVA and get a significant result, you know that at least one group differs from the others. But which ones? Running multiple t-tests between all pairs seems intuitive, but it's...
Key Insights
- Pingouin provides three main post-hoc functions:
pairwise_tukey()for balanced designs,pairwise_tests()for flexible correction methods, andpairwise_gameshowell()for unequal variances—choose based on your data characteristics. - Always run ANOVA first to confirm a significant overall effect before conducting post-hoc comparisons; performing multiple comparisons without this step inflates your Type I error rate.
- The
pairwise_tests()function is the most versatile option, supporting Bonferroni, Holm, Benjamini-Hochberg FDR, and other correction methods through a singlepadjustparameter.
Why Post-Hoc Tests Matter
When you run an ANOVA and get a significant result, you know that at least one group differs from the others. But which ones? Running multiple t-tests between all pairs seems intuitive, but it’s statistically dangerous. With three groups, you’d run three comparisons. With five groups, you’d run ten. Each test at α = 0.05 compounds your false positive risk.
Post-hoc tests solve this by adjusting for multiple comparisons while maintaining statistical power. Pingouin makes these tests accessible with a clean, pandas-friendly API that produces publication-ready output. Unlike scipy’s fragmented approach or statsmodels’ verbose syntax, Pingouin gives you everything in one DataFrame.
Setting Up Your Environment
Install Pingouin via pip. It pulls in pandas and scipy as dependencies, so you’ll have everything you need.
pip install pingouin
Now let’s set up a working example. We’ll simulate a dataset comparing the effectiveness of three different study techniques on exam scores.
import pingouin as pg
import pandas as pd
import numpy as np
# Set seed for reproducibility
np.random.seed(42)
# Simulate exam scores for three study methods
n_per_group = 30
data = pd.DataFrame({
'score': np.concatenate([
np.random.normal(72, 10, n_per_group), # Method A: baseline
np.random.normal(78, 12, n_per_group), # Method B: moderate improvement
np.random.normal(85, 9, n_per_group) # Method C: strong improvement
]),
'method': ['A'] * n_per_group + ['B'] * n_per_group + ['C'] * n_per_group
})
print(data.groupby('method')['score'].describe().round(2))
count mean std min 25% 50% 75% max
method
A 30.0 71.68 9.23 52.34 65.51 71.98 78.60 91.28
B 30.0 78.43 11.89 54.45 70.47 78.04 86.43 99.97
C 30.0 85.23 9.67 64.77 78.91 85.32 91.95 102.71
The means suggest differences exist, but we need statistical confirmation.
Running ANOVA First
Before jumping to post-hoc tests, confirm that group differences exist. Pingouin’s anova() function returns a clean DataFrame with effect sizes included by default.
# One-way ANOVA
aov = pg.anova(data=data, dv='score', between='method', detailed=True)
print(aov)
Source SS DF MS F p-unc np2
0 method 2789.56 2 1394.78 12.847 0.000012 0.228
1 Within 9445.67 87 108.57 NaN NaN NaN
The F-statistic of 12.85 with p < 0.001 tells us the groups differ significantly. The partial eta-squared (np2 = 0.228) indicates a large effect size. Now we can proceed to find out which specific groups differ.
Pairwise Post-Hoc Tests with pairwise_tukey()
Tukey’s Honestly Significant Difference (HSD) test is the standard choice when you have equal (or nearly equal) sample sizes and want to compare all pairs. It controls the family-wise error rate while maintaining good statistical power.
# Tukey's HSD test
tukey = pg.pairwise_tukey(data=data, dv='score', between='method')
print(tukey)
A B mean(A) mean(B) diff se T p-tukey
0 A B 71.68 78.43 -6.75 2.69 -2.51 0.036
1 A C 71.68 85.23 -13.55 2.69 -5.04 0.000
2 B C 78.43 85.23 -6.80 2.69 -2.53 0.034
The output tells a clear story. All three pairwise comparisons are significant at α = 0.05. Method C outperforms both A and B, and Method B outperforms A. The diff column shows the raw mean differences, while se provides standard errors for effect size interpretation.
Tukey’s test is conservative but appropriate here because our sample sizes are equal. The adjusted p-values account for the three simultaneous comparisons we’re making.
Flexible Post-Hoc Testing with pairwise_tests()
When you need more control over the correction method, pairwise_tests() is your tool. It supports parametric and non-parametric tests with various p-value adjustment strategies.
# Pairwise t-tests with Bonferroni correction
bonf = pg.pairwise_tests(data=data, dv='score', between='method',
padjust='bonf', effsize='hedges')
print(bonf[['A', 'B', 'T', 'p-unc', 'p-corr', 'hedges']])
A B T p-unc p-corr hedges
0 A B -2.51 0.0146 0.0438 -0.636
1 A C -5.61 0.0000 0.0000 -1.432
2 B C -2.49 0.0156 0.0467 -0.632
Notice the difference between p-unc (uncorrected) and p-corr (Bonferroni-corrected). Bonferroni multiplies each p-value by the number of comparisons (3), making it more conservative than Tukey.
Let’s compare different correction methods:
# Compare correction methods
corrections = ['none', 'bonf', 'holm', 'fdr_bh']
for method in corrections:
result = pg.pairwise_tests(data=data, dv='score', between='method',
padjust=method)
p_col = 'p-corr' if method != 'none' else 'p-unc'
print(f"\n{method.upper()}:")
print(result[['A', 'B', p_col]].to_string(index=False))
NONE:
A B p-unc
A B 0.0146
A C 0.0000
B C 0.0156
BONF:
A B p-corr
A B 0.0438
A C 0.0000
B C 0.0467
HOLM:
A B p-corr
A B 0.0292
A C 0.0000
B C 0.0312
FDR_BH:
A B p-corr
A B 0.0219
A C 0.0000
B C 0.0219
Bonferroni is the most conservative, Holm is slightly less so while still controlling family-wise error, and FDR (Benjamini-Hochberg) controls the false discovery rate rather than family-wise error. For exploratory analysis, FDR is often appropriate. For confirmatory studies, stick with Bonferroni or Holm.
The pairwise_tests() function also supports within-subject designs, mixed designs, and non-parametric alternatives:
# Non-parametric version using Mann-Whitney U
nonparam = pg.pairwise_tests(data=data, dv='score', between='method',
parametric=False, padjust='holm')
print(nonparam[['A', 'B', 'U-val', 'p-corr']])
Non-Parametric Alternative: pairwise_gameshowell()
When your groups have unequal variances or substantially different sample sizes, Tukey’s assumptions break down. Games-Howell doesn’t assume equal variances and handles unbalanced designs gracefully.
Let’s create a scenario where this matters:
# Create unbalanced data with unequal variances
np.random.seed(42)
unbalanced = pd.DataFrame({
'score': np.concatenate([
np.random.normal(72, 8, 20), # Small variance, n=20
np.random.normal(78, 18, 40), # Large variance, n=40
np.random.normal(85, 10, 25) # Medium variance, n=25
]),
'method': ['A'] * 20 + ['B'] * 40 + ['C'] * 25
})
# Check variance heterogeneity with Levene's test
print("Levene's test for homogeneity of variances:")
print(pg.homoscedasticity(unbalanced, dv='score', group='method'))
Levene's test for homogeneity of variances:
W pval equal_var
levene 6.521 0.00225 False
With significant variance heterogeneity (p = 0.002), Games-Howell is the appropriate choice:
# Games-Howell test
gh = pg.pairwise_gameshowell(data=unbalanced, dv='score', between='method')
print(gh)
A B mean(A) mean(B) diff se T df pval hedges
0 A B 71.16 78.01 -6.85 3.12 -2.20 48.3 0.0822 -0.465
1 A C 71.16 85.03 -13.87 2.60 -5.33 41.8 0.0000 -1.502
2 B C 78.01 85.03 -7.02 3.10 -2.26 62.8 0.0676 -0.467
Compare this to Tukey on the same data:
# Tukey (inappropriate for this data, but instructive)
tukey_unbal = pg.pairwise_tukey(data=unbalanced, dv='score', between='method')
print(tukey_unbal[['A', 'B', 'diff', 'p-tukey']])
A B diff p-tukey
0 A B -6.85 0.053
1 A C -13.87 0.000
2 B C -7.02 0.043
The p-values differ because Games-Howell accounts for the variance heterogeneity. In this case, Games-Howell is more conservative for the A-B and B-C comparisons, which is appropriate given the violated assumptions.
Choosing the Right Test
Your decision tree should look like this:
-
Equal sample sizes, equal variances: Use
pairwise_tukey(). It’s the gold standard for balanced ANOVA designs. -
Need specific correction method: Use
pairwise_tests()withpadjustset to your preferred method. Bonferroni for strict control, Holm for slightly more power, FDR for exploratory work. -
Unequal variances or sample sizes: Use
pairwise_gameshowell(). It’s robust to heteroscedasticity and doesn’t assume balanced designs. -
Non-normal data: Use
pairwise_tests(parametric=False)to run Mann-Whitney U tests with appropriate corrections.
Pingouin’s documentation covers additional scenarios including repeated measures designs (pairwise_tests with within parameter) and mixed ANOVA follow-ups. The library also provides pairwise_corr() for multiple correlation comparisons if that’s your use case.
The practical advantage of Pingouin over manual scipy implementations is consistency. Every function returns a pandas DataFrame with standardized column names, effect sizes included by default, and confidence intervals where applicable. This means less post-processing and more time interpreting results.