How to Perform Tukey's HSD Test in Python

Key Insights

Tukey’s HSD test is the gold-standard post-hoc analysis after ANOVA, comparing all possible group pairs while controlling the family-wise error rate—something multiple t-tests fail to do.
The pairwise_tukeyhsd function from statsmodels provides a complete solution with mean differences, confidence intervals, and p-values in a single output table.
Always run ANOVA first to confirm significant differences exist; performing Tukey’s HSD without a significant ANOVA result is statistically inappropriate and wastes computational effort.

Introduction to Tukey’s HSD Test

When you run a one-way ANOVA and get a significant result, you know that at least one group differs from the others. But which groups? ANOVA doesn’t tell you. This is where Tukey’s Honestly Significant Difference (HSD) test comes in.

Tukey’s HSD is a post-hoc test that performs pairwise comparisons between all group means while controlling the family-wise error rate. The “family-wise error rate” is the probability of making at least one Type I error (false positive) across all comparisons. If you compared three groups using individual t-tests, you’d run three comparisons, each with a 5% error rate. Your actual error rate balloons to about 14%. With ten groups, you’d have 45 comparisons and an error rate approaching 90%.

Tukey’s HSD solves this by adjusting the critical value based on the number of comparisons, keeping your overall error rate at your chosen alpha level (typically 0.05). It’s conservative enough to control errors but powerful enough to detect real differences—a balance that makes it the default choice for most researchers.

Prerequisites and Setup

You’ll need four libraries for this workflow. Install them if you haven’t:

pip install scipy statsmodels pandas numpy matplotlib seaborn

Now let’s set up our environment and create a realistic dataset. We’ll use crop yield data across four fertilizer treatments—a classic experimental design scenario.

import numpy as np
import pandas as pd
from scipy import stats
from statsmodels.stats.multicomp import pairwise_tukeyhsd
import matplotlib.pyplot as plt
import seaborn as sns

# Set random seed for reproducibility
np.random.seed(42)

# Create sample data: crop yields (kg/hectare) for four fertilizer types
n_per_group = 20

data = {
    'fertilizer': ['Control'] * n_per_group + 
                  ['Nitrogen'] * n_per_group + 
                  ['Phosphorus'] * n_per_group + 
                  ['NPK_Blend'] * n_per_group,
    'yield': np.concatenate([
        np.random.normal(45, 5, n_per_group),   # Control
        np.random.normal(52, 6, n_per_group),   # Nitrogen
        np.random.normal(48, 5, n_per_group),   # Phosphorus
        np.random.normal(58, 7, n_per_group)    # NPK Blend
    ])
}

df = pd.DataFrame(data)

# Quick look at group statistics
print(df.groupby('fertilizer')['yield'].agg(['mean', 'std', 'count']))

This gives us 80 observations across four groups with deliberately different means. The NPK Blend should show the highest yields, followed by Nitrogen, then Phosphorus, with Control performing worst.

Running One-Way ANOVA First

Never skip straight to Tukey’s HSD. You must first confirm that significant differences exist between groups using ANOVA. Running post-hoc tests after a non-significant ANOVA is fishing for results—bad science and bad statistics.

# Separate data by group for ANOVA
control = df[df['fertilizer'] == 'Control']['yield']
nitrogen = df[df['fertilizer'] == 'Nitrogen']['yield']
phosphorus = df[df['fertilizer'] == 'Phosphorus']['yield']
npk_blend = df[df['fertilizer'] == 'NPK_Blend']['yield']

# Run one-way ANOVA
f_statistic, p_value = stats.f_oneway(control, nitrogen, phosphorus, npk_blend)

print(f"F-statistic: {f_statistic:.4f}")
print(f"P-value: {p_value:.6f}")

if p_value < 0.05:
    print("\nANOVA is significant. Proceed with Tukey's HSD.")
else:
    print("\nANOVA is not significant. Do not proceed with post-hoc tests.")

With our simulated data, you should see a highly significant result (p < 0.001). The F-statistic tells you the ratio of between-group variance to within-group variance. A large F-statistic with a small p-value confirms that fertilizer type affects crop yield. Now we can legitimately ask which specific fertilizers differ from each other.

Performing Tukey’s HSD with statsmodels

The pairwise_tukeyhsd function from statsmodels is your workhorse here. It takes your data in long format—values in one column, group labels in another—and returns a comprehensive results object.

# Perform Tukey's HSD test
tukey_results = pairwise_tukeyhsd(
    endog=df['yield'],      # Response variable (dependent variable)
    groups=df['fertilizer'], # Grouping variable
    alpha=0.05              # Significance level
)

# Print the results table
print(tukey_results)

The output table contains six columns that tell you everything you need:

    Multiple Comparison of Means - Tukey HSD, FWER=0.05     
============================================================
   group1      group2   meandiff p-adj   lower   upper reject
------------------------------------------------------------
   Control    NPK_Blend  12.8934 0.0001  8.9021 16.8847   True
   Control     Nitrogen   6.5821 0.0012  2.5908 10.5734   True
   Control   Phosphorus   2.8456 0.2847 -1.1457  6.8369  False
 NPK_Blend     Nitrogen  -6.3113 0.0018-10.3026 -2.3200   True
 NPK_Blend   Phosphorus -10.0478 0.0001-14.0391 -6.0565   True
  Nitrogen   Phosphorus  -3.7365 0.0891 -7.7278  0.2548  False
------------------------------------------------------------

Let me break down each column:

group1, group2: The two groups being compared
meandiff: Mean of group2 minus mean of group1
p-adj: Adjusted p-value accounting for multiple comparisons
lower, upper: 95% confidence interval for the mean difference
reject: Whether to reject the null hypothesis (True = significant difference)

You can also access these results programmatically:

# Access results as a DataFrame for further analysis
results_df = pd.DataFrame(
    data=tukey_results._results_table.data[1:],
    columns=tukey_results._results_table.data[0]
)

print(results_df)

# Get summary statistics
print(f"\nNumber of significant differences: {results_df['reject'].sum()}")
print(f"Total comparisons: {len(results_df)}")

Visualizing Results

Statsmodels provides a built-in visualization method that creates a confidence interval plot. This is the fastest way to communicate your results visually.

# Create the Tukey HSD plot
fig, ax = plt.subplots(figsize=(10, 6))
tukey_results.plot_simultaneous(ax=ax)
plt.title("Tukey's HSD: 95% Confidence Intervals for Fertilizer Effects")
plt.xlabel("Mean Yield (kg/hectare)")
plt.tight_layout()
plt.savefig('tukey_hsd_plot.png', dpi=150)
plt.show()

This plot shows each group’s mean with horizontal confidence intervals. Groups whose intervals don’t overlap are significantly different from each other. It’s an intuitive way to see the results at a glance.

For presentations or publications, combine this with a boxplot showing the raw data:

# Create a comprehensive visualization
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Boxplot with individual points
sns.boxplot(data=df, x='fertilizer', y='yield', ax=axes[0], palette='Set2')
sns.stripplot(data=df, x='fertilizer', y='yield', ax=axes[0], 
              color='black', alpha=0.5, size=4)
axes[0].set_title('Crop Yield by Fertilizer Type')
axes[0].set_xlabel('Fertilizer')
axes[0].set_ylabel('Yield (kg/hectare)')

# Tukey HSD confidence intervals
tukey_results.plot_simultaneous(ax=axes[1])
axes[1].set_title("Tukey's HSD Pairwise Comparisons")
axes[1].set_xlabel('Mean Yield (kg/hectare)')

plt.tight_layout()
plt.savefig('complete_analysis.png', dpi=150)
plt.show()

Interpreting and Reporting Results

Reading the output table correctly is crucial. When reject=True, the two groups have statistically significant different means at your chosen alpha level. The meandiff tells you the direction and magnitude of that difference.

From our example results, we can conclude:

NPK Blend produces significantly higher yields than all other treatments
Nitrogen produces significantly higher yields than Control
Phosphorus does not significantly differ from Control or Nitrogen

For academic writing or reports, here’s how to phrase these findings:

A one-way ANOVA revealed a significant effect of fertilizer type on crop yield, F(3, 76) = 18.42, p < .001. Post-hoc comparisons using Tukey’s HSD test indicated that the NPK Blend treatment (M = 57.89, SD = 6.82) produced significantly higher yields than all other treatments (all p < .01). The Nitrogen treatment (M = 51.58, SD = 5.94) also outperformed the Control (M = 45.00, SD = 4.87), p = .001. No significant difference was found between Phosphorus (M = 47.85, SD = 5.12) and either Control or Nitrogen treatments.

Common Pitfalls and Alternatives

Tukey’s HSD assumes three things about your data:

1. Independence of observations: Each data point must be independent. Repeated measures or nested designs violate this assumption and require different tests.

2. Normality: The residuals should be approximately normally distributed. Check this with the Shapiro-Wilk test:

# Check normality for each group
for fertilizer in df['fertilizer'].unique():
    group_data = df[df['fertilizer'] == fertilizer]['yield']
    stat, p = stats.shapiro(group_data)
    print(f"{fertilizer}: W={stat:.4f}, p={p:.4f}")

3. Homogeneity of variance: Groups should have similar variances. Use Levene’s test:

# Check homogeneity of variance
stat, p = stats.levene(control, nitrogen, phosphorus, npk_blend)
print(f"Levene's test: W={stat:.4f}, p={p:.4f}")

When assumptions are violated, consider these alternatives:

Games-Howell test: Use when variances are unequal. Available in pingouin library.
Bonferroni correction: More conservative; use when you have specific planned comparisons.
Dunn’s test: Use after Kruskal-Wallis (non-parametric alternative to ANOVA).

# Example using pingouin for Games-Howell (install with: pip install pingouin)
import pingouin as pg

games_howell = pg.pairwise_gameshowell(data=df, dv='yield', between='fertilizer')
print(games_howell)

Tukey’s HSD remains the default choice when assumptions are met. It strikes the right balance between controlling Type I errors and maintaining statistical power. Master this test, and you’ll handle the vast majority of post-hoc comparison scenarios you encounter in practice.