How to Perform Post-Hoc Tests Using Pingouin in Python

When you run an ANOVA and get a significant result, you know that at least one group differs from the others. But which ones? Running multiple t-tests between all pairs seems intuitive, but it's...

Key Insights

  • Pingouin provides three main post-hoc functions: pairwise_tukey() for balanced designs, pairwise_tests() for flexible correction methods, and pairwise_gameshowell() for unequal variances—choose based on your data characteristics.
  • Always run ANOVA first to confirm a significant overall effect before conducting post-hoc comparisons; performing multiple comparisons without this step inflates your Type I error rate.
  • The pairwise_tests() function is the most versatile option, supporting Bonferroni, Holm, Benjamini-Hochberg FDR, and other correction methods through a single padjust parameter.

Why Post-Hoc Tests Matter

When you run an ANOVA and get a significant result, you know that at least one group differs from the others. But which ones? Running multiple t-tests between all pairs seems intuitive, but it’s statistically dangerous. With three groups, you’d run three comparisons. With five groups, you’d run ten. Each test at α = 0.05 compounds your false positive risk.

Post-hoc tests solve this by adjusting for multiple comparisons while maintaining statistical power. Pingouin makes these tests accessible with a clean, pandas-friendly API that produces publication-ready output. Unlike scipy’s fragmented approach or statsmodels’ verbose syntax, Pingouin gives you everything in one DataFrame.

Setting Up Your Environment

Install Pingouin via pip. It pulls in pandas and scipy as dependencies, so you’ll have everything you need.

pip install pingouin

Now let’s set up a working example. We’ll simulate a dataset comparing the effectiveness of three different study techniques on exam scores.

import pingouin as pg
import pandas as pd
import numpy as np

# Set seed for reproducibility
np.random.seed(42)

# Simulate exam scores for three study methods
n_per_group = 30

data = pd.DataFrame({
    'score': np.concatenate([
        np.random.normal(72, 10, n_per_group),  # Method A: baseline
        np.random.normal(78, 12, n_per_group),  # Method B: moderate improvement
        np.random.normal(85, 9, n_per_group)    # Method C: strong improvement
    ]),
    'method': ['A'] * n_per_group + ['B'] * n_per_group + ['C'] * n_per_group
})

print(data.groupby('method')['score'].describe().round(2))
       count  mean   std   min   25%   50%   75%   max
method                                                
A       30.0 71.68  9.23 52.34 65.51 71.98 78.60 91.28
B       30.0 78.43 11.89 54.45 70.47 78.04 86.43 99.97
C       30.0 85.23  9.67 64.77 78.91 85.32 91.95 102.71

The means suggest differences exist, but we need statistical confirmation.

Running ANOVA First

Before jumping to post-hoc tests, confirm that group differences exist. Pingouin’s anova() function returns a clean DataFrame with effect sizes included by default.

# One-way ANOVA
aov = pg.anova(data=data, dv='score', between='method', detailed=True)
print(aov)
   Source  SS        DF      MS         F         p-unc     np2
0  method  2789.56    2   1394.78    12.847    0.000012   0.228
1  Within  9445.67   87    108.57       NaN         NaN     NaN

The F-statistic of 12.85 with p < 0.001 tells us the groups differ significantly. The partial eta-squared (np2 = 0.228) indicates a large effect size. Now we can proceed to find out which specific groups differ.

Pairwise Post-Hoc Tests with pairwise_tukey()

Tukey’s Honestly Significant Difference (HSD) test is the standard choice when you have equal (or nearly equal) sample sizes and want to compare all pairs. It controls the family-wise error rate while maintaining good statistical power.

# Tukey's HSD test
tukey = pg.pairwise_tukey(data=data, dv='score', between='method')
print(tukey)
    A  B      mean(A)   mean(B)    diff       se       T      p-tukey
0   A  B      71.68     78.43    -6.75     2.69   -2.51      0.036
1   A  C      71.68     85.23   -13.55     2.69   -5.04      0.000
2   B  C      78.43     85.23    -6.80     2.69   -2.53      0.034

The output tells a clear story. All three pairwise comparisons are significant at α = 0.05. Method C outperforms both A and B, and Method B outperforms A. The diff column shows the raw mean differences, while se provides standard errors for effect size interpretation.

Tukey’s test is conservative but appropriate here because our sample sizes are equal. The adjusted p-values account for the three simultaneous comparisons we’re making.

Flexible Post-Hoc Testing with pairwise_tests()

When you need more control over the correction method, pairwise_tests() is your tool. It supports parametric and non-parametric tests with various p-value adjustment strategies.

# Pairwise t-tests with Bonferroni correction
bonf = pg.pairwise_tests(data=data, dv='score', between='method', 
                          padjust='bonf', effsize='hedges')
print(bonf[['A', 'B', 'T', 'p-unc', 'p-corr', 'hedges']])
   A  B      T     p-unc    p-corr   hedges
0  A  B  -2.51   0.0146    0.0438   -0.636
1  A  C  -5.61   0.0000    0.0000   -1.432
2  B  C  -2.49   0.0156    0.0467   -0.632

Notice the difference between p-unc (uncorrected) and p-corr (Bonferroni-corrected). Bonferroni multiplies each p-value by the number of comparisons (3), making it more conservative than Tukey.

Let’s compare different correction methods:

# Compare correction methods
corrections = ['none', 'bonf', 'holm', 'fdr_bh']

for method in corrections:
    result = pg.pairwise_tests(data=data, dv='score', between='method', 
                                padjust=method)
    p_col = 'p-corr' if method != 'none' else 'p-unc'
    print(f"\n{method.upper()}:")
    print(result[['A', 'B', p_col]].to_string(index=False))
NONE:
 A  B    p-unc
 A  B   0.0146
 A  C   0.0000
 B  C   0.0156

BONF:
 A  B   p-corr
 A  B   0.0438
 A  C   0.0000
 B  C   0.0467

HOLM:
 A  B   p-corr
 A  B   0.0292
 A  C   0.0000
 B  C   0.0312

FDR_BH:
 A  B   p-corr
 A  B   0.0219
 A  C   0.0000
 B  C   0.0219

Bonferroni is the most conservative, Holm is slightly less so while still controlling family-wise error, and FDR (Benjamini-Hochberg) controls the false discovery rate rather than family-wise error. For exploratory analysis, FDR is often appropriate. For confirmatory studies, stick with Bonferroni or Holm.

The pairwise_tests() function also supports within-subject designs, mixed designs, and non-parametric alternatives:

# Non-parametric version using Mann-Whitney U
nonparam = pg.pairwise_tests(data=data, dv='score', between='method',
                              parametric=False, padjust='holm')
print(nonparam[['A', 'B', 'U-val', 'p-corr']])

Non-Parametric Alternative: pairwise_gameshowell()

When your groups have unequal variances or substantially different sample sizes, Tukey’s assumptions break down. Games-Howell doesn’t assume equal variances and handles unbalanced designs gracefully.

Let’s create a scenario where this matters:

# Create unbalanced data with unequal variances
np.random.seed(42)

unbalanced = pd.DataFrame({
    'score': np.concatenate([
        np.random.normal(72, 8, 20),    # Small variance, n=20
        np.random.normal(78, 18, 40),   # Large variance, n=40
        np.random.normal(85, 10, 25)    # Medium variance, n=25
    ]),
    'method': ['A'] * 20 + ['B'] * 40 + ['C'] * 25
})

# Check variance heterogeneity with Levene's test
print("Levene's test for homogeneity of variances:")
print(pg.homoscedasticity(unbalanced, dv='score', group='method'))
Levene's test for homogeneity of variances:
            W      pval  equal_var
levene  6.521  0.00225      False

With significant variance heterogeneity (p = 0.002), Games-Howell is the appropriate choice:

# Games-Howell test
gh = pg.pairwise_gameshowell(data=unbalanced, dv='score', between='method')
print(gh)
   A  B   mean(A)  mean(B)   diff      se      T    df    pval   hedges
0  A  B    71.16    78.01  -6.85    3.12  -2.20  48.3  0.0822   -0.465
1  A  C    71.16    85.03 -13.87    2.60  -5.33  41.8  0.0000   -1.502
2  B  C    78.01    85.03  -7.02    3.10  -2.26  62.8  0.0676   -0.467

Compare this to Tukey on the same data:

# Tukey (inappropriate for this data, but instructive)
tukey_unbal = pg.pairwise_tukey(data=unbalanced, dv='score', between='method')
print(tukey_unbal[['A', 'B', 'diff', 'p-tukey']])
   A  B    diff  p-tukey
0  A  B   -6.85    0.053
1  A  C  -13.87    0.000
2  B  C   -7.02    0.043

The p-values differ because Games-Howell accounts for the variance heterogeneity. In this case, Games-Howell is more conservative for the A-B and B-C comparisons, which is appropriate given the violated assumptions.

Choosing the Right Test

Your decision tree should look like this:

  1. Equal sample sizes, equal variances: Use pairwise_tukey(). It’s the gold standard for balanced ANOVA designs.

  2. Need specific correction method: Use pairwise_tests() with padjust set to your preferred method. Bonferroni for strict control, Holm for slightly more power, FDR for exploratory work.

  3. Unequal variances or sample sizes: Use pairwise_gameshowell(). It’s robust to heteroscedasticity and doesn’t assume balanced designs.

  4. Non-normal data: Use pairwise_tests(parametric=False) to run Mann-Whitney U tests with appropriate corrections.

Pingouin’s documentation covers additional scenarios including repeated measures designs (pairwise_tests with within parameter) and mixed ANOVA follow-ups. The library also provides pairwise_corr() for multiple correlation comparisons if that’s your use case.

The practical advantage of Pingouin over manual scipy implementations is consistency. Every function returns a pandas DataFrame with standardized column names, effect sizes included by default, and confidence intervals where applicable. This means less post-processing and more time interpreting results.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.