How to Perform Tukey's HSD Test in R
When your ANOVA returns a significant p-value, you know that at least one group differs from the others. But which ones? Running multiple t-tests introduces a serious problem: each test carries a 5%...
Key Insights
- Tukey’s HSD test controls the family-wise error rate when comparing all pairs of group means, making it the go-to post-hoc test after a significant one-way ANOVA
- Base R’s
TukeyHSD()function handles most use cases, butagricolae::HSD.test()provides compact letter displays that are essential for publication-ready figures - Always verify ANOVA assumptions (normality, homogeneity of variance) before running Tukey’s test—violations can lead to inflated Type I error rates
Introduction to Tukey’s HSD Test
When your ANOVA returns a significant p-value, you know that at least one group differs from the others. But which ones? Running multiple t-tests introduces a serious problem: each test carries a 5% chance of false positives, and those errors compound quickly. With just 5 groups, you’d run 10 pairwise comparisons, inflating your actual error rate to nearly 40%.
Tukey’s Honestly Significant Difference (HSD) test solves this by controlling the family-wise error rate. It uses the studentized range distribution to calculate a single critical value that applies to all pairwise comparisons simultaneously. The result: you maintain your desired alpha level (typically 0.05) across all comparisons, not per comparison.
Use Tukey’s HSD when you want to compare all possible pairs of group means after a significant one-way ANOVA. If you’re only interested in specific planned comparisons, other methods like Dunnett’s test (comparing treatments to a control) may be more appropriate and more powerful.
Prerequisites and Data Setup
Base R includes everything you need for basic Tukey’s HSD analysis. However, two additional packages extend functionality significantly:
# Base R provides TukeyHSD() - no installation needed
# For compact letter displays and additional features
install.packages("agricolae")
# For generalized linear hypothesis testing
install.packages("multcomp")
# For visualization
install.packages("ggplot2")
# Load packages
library(agricolae)
library(multcomp)
library(ggplot2)
Let’s create a realistic dataset: crop yields (in tons per hectare) from a randomized experiment testing four fertilizer treatments.
# Set seed for reproducibility
set.seed(42)
# Create experimental data
fertilizer_data <- data.frame(
treatment = factor(rep(c("Control", "Nitrogen", "Phosphorus", "NPK"), each = 8)),
yield = c(
rnorm(8, mean = 4.2, sd = 0.5), # Control
rnorm(8, mean = 5.1, sd = 0.6), # Nitrogen
rnorm(8, mean = 4.8, sd = 0.5), # Phosphorus
rnorm(8, mean = 6.3, sd = 0.7) # NPK (combined)
)
)
# Verify structure
str(fertilizer_data)
head(fertilizer_data)
# Quick summary by group
aggregate(yield ~ treatment, data = fertilizer_data,
FUN = function(x) c(mean = mean(x), sd = sd(x)))
Running ANOVA First
Tukey’s HSD is a post-hoc test—it only makes sense after ANOVA indicates significant differences exist. Running it without first confirming ANOVA significance is statistically inappropriate and wastes your time.
# Fit one-way ANOVA model
aov_model <- aov(yield ~ treatment, data = fertilizer_data)
# View ANOVA table
summary(aov_model)
Output:
Df Sum Sq Mean Sq F value Pr(>F)
treatment 3 21.89 7.296 21.77 1.07e-07 ***
Residuals 28 9.38 0.335
Before proceeding, verify ANOVA assumptions:
# Check normality of residuals
shapiro.test(residuals(aov_model))
# Check homogeneity of variance
bartlett.test(yield ~ treatment, data = fertilizer_data)
# Visual diagnostics
par(mfrow = c(2, 2))
plot(aov_model)
par(mfrow = c(1, 1))
If Shapiro-Wilk p > 0.05 and Bartlett’s test p > 0.05, assumptions are reasonably met. With our significant ANOVA result (p < 0.001), we proceed to Tukey’s test.
Performing Tukey’s HSD Test
The TukeyHSD() function takes an aov object and returns pairwise comparisons:
# Run Tukey's HSD test
tukey_result <- TukeyHSD(aov_model, conf.level = 0.95)
# View results
print(tukey_result)
Output:
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = yield ~ treatment, data = fertilizer_data)
$treatment
diff lwr upr p adj
Nitrogen-Control 0.9284937 0.2678498 1.5891376 0.0032767
NPK-Control 2.0855498 1.4249059 2.7461937 0.0000000
Phosphorus-Control 0.5765498 -0.0840941 1.2371937 0.1029505
NPK-Nitrogen 1.1570561 0.4964122 1.8177000 0.0002698
Phosphorus-Nitrogen -0.3519439 -1.0125878 0.3087000 0.4846986
Phosphorus-NPK -1.5090000 -2.1696439 -0.8483561 0.0000042
Understanding the output columns:
- diff: Mean difference between groups (row minus column)
- lwr/upr: Lower and upper bounds of the 95% confidence interval
- p adj: Adjusted p-value controlling family-wise error rate
Interpretation: NPK fertilizer significantly outperforms all other treatments. Nitrogen beats the control, but phosphorus alone doesn’t differ significantly from the control. The confidence intervals that don’t contain zero correspond to significant differences.
Visualizing Results
Base R provides a quick visualization method:
# Built-in plot
plot(tukey_result, las = 1, col = "steelblue")
This creates a confidence interval plot where intervals not crossing zero indicate significant differences. For publication-quality figures, use ggplot2:
# Extract Tukey results into a data frame
tukey_df <- as.data.frame(tukey_result$treatment)
tukey_df$comparison <- rownames(tukey_df)
# Create publication-ready plot
ggplot(tukey_df, aes(x = reorder(comparison, diff), y = diff)) +
geom_hline(yintercept = 0, linetype = "dashed", color = "gray50") +
geom_errorbar(aes(ymin = lwr, ymax = upr), width = 0.2, linewidth = 0.8) +
geom_point(aes(color = `p adj` < 0.05), size = 3) +
scale_color_manual(values = c("TRUE" = "firebrick", "FALSE" = "gray40"),
labels = c("Not Significant", "Significant"),
name = "p < 0.05") +
coord_flip() +
labs(
x = "Pairwise Comparison",
y = "Difference in Mean Yield (tons/ha)",
title = "Tukey's HSD: Fertilizer Treatment Comparisons"
) +
theme_minimal(base_size = 12) +
theme(legend.position = "bottom")
Alternative Packages and Methods
agricolae: Compact Letter Display
The agricolae package produces grouped letters—groups sharing a letter don’t differ significantly. This format is standard in agricultural and biological research:
# HSD.test from agricolae
hsd_agricolae <- HSD.test(aov_model, "treatment", group = TRUE)
# View grouped results
print(hsd_agricolae$groups)
Output:
yield groups
NPK 6.298 a
Nitrogen 5.141 b
Phosphorus 4.789 bc
Control 4.213 c
NPK (group “a”) differs from all others. Nitrogen (group “b”) overlaps with phosphorus (“bc”), meaning they don’t differ significantly. This format works perfectly for annotating bar charts:
# Create summary for plotting
group_summary <- aggregate(yield ~ treatment, data = fertilizer_data,
FUN = function(x) c(mean = mean(x), se = sd(x)/sqrt(length(x))))
group_summary <- do.call(data.frame, group_summary)
names(group_summary) <- c("treatment", "mean", "se")
# Add letter groups
group_summary$letters <- c("c", "b", "bc", "a") # Match order
# Bar plot with significance letters
ggplot(group_summary, aes(x = treatment, y = mean)) +
geom_col(fill = "steelblue", alpha = 0.8) +
geom_errorbar(aes(ymin = mean - se, ymax = mean + se), width = 0.2) +
geom_text(aes(y = mean + se + 0.2, label = letters), size = 5) +
labs(x = "Treatment", y = "Yield (tons/ha)") +
theme_minimal()
multcomp: Maximum Flexibility
The multcomp package handles complex designs and custom contrasts:
# General linear hypothesis approach
glht_result <- glht(aov_model, linfct = mcp(treatment = "Tukey"))
summary(glht_result)
# Compact letter display
cld(glht_result)
Use multcomp when you need custom contrast matrices or when working with generalized linear models.
Common Pitfalls and Best Practices
Handling Unbalanced Designs
Tukey’s HSD assumes equal sample sizes. With unbalanced data, use Tukey-Kramer (the default in R’s TukeyHSD() actually implements this adjustment automatically). For severely unbalanced designs, consider Games-Howell:
# Games-Howell via rstatix package
library(rstatix)
games_howell_test(fertilizer_data, yield ~ treatment)
Choosing the Right Post-Hoc Test
- Tukey’s HSD: All pairwise comparisons, equal importance
- Dunnett’s: Compare treatments to a single control
- Bonferroni: Conservative, works with any comparison set
- Scheffé: Most conservative, allows complex contrasts
Reporting Results
For academic papers, report the test statistic, degrees of freedom, and adjusted p-values:
“Post-hoc analysis using Tukey’s HSD revealed that NPK fertilizer produced significantly higher yields than the control (mean difference = 2.09 tons/ha, 95% CI [1.42, 2.75], p < 0.001), nitrogen alone (mean difference = 1.16 tons/ha, p < 0.001), and phosphorus alone (mean difference = 1.51 tons/ha, p < 0.001).”
Always report effect sizes alongside p-values. The mean differences and confidence intervals from Tukey’s output serve this purpose directly.
Final Recommendations
Start with TukeyHSD() for quick analysis and interpretation. Switch to agricolae::HSD.test() when you need compact letter displays for figures. Reserve multcomp::glht() for complex experimental designs or when you need compatibility with mixed models. Whatever method you choose, always verify your ANOVA assumptions first—no post-hoc test can rescue conclusions drawn from violated assumptions.