Kruskal-Wallis Test in R: Step-by-Step Guide

Key Insights

The Kruskal-Wallis test is your go-to method when comparing three or more independent groups and your data violates ANOVA’s normality assumption—it ranks your data instead of using raw values, making it robust to outliers and skewed distributions.
A significant Kruskal-Wallis result only tells you that differences exist somewhere; you must follow up with Dunn’s test (not pairwise Wilcoxon tests) to identify which specific groups differ.
Always report effect size alongside p-values—epsilon-squared gives you practical significance, not just statistical significance.

Introduction to the Kruskal-Wallis Test

The Kruskal-Wallis test is the non-parametric alternative to one-way ANOVA. When your data doesn’t meet normality assumptions or you’re working with ordinal scales, this rank-based test becomes essential.

Use the Kruskal-Wallis test when you have:

Three or more independent groups to compare
A continuous or ordinal dependent variable
Data that fails normality tests or contains significant outliers
Small sample sizes where normality is difficult to assess

The test works by ranking all observations across groups, then comparing whether the mean ranks differ significantly between groups. The null hypothesis states that all groups come from populations with identical distributions.

Three assumptions must hold:

Independence: Observations within and between groups are independent
Scale: The dependent variable is at least ordinal
Distribution shape: Groups have similarly shaped distributions (though not necessarily normal)

That third assumption often gets overlooked. If distributions have wildly different shapes, a significant result might reflect shape differences rather than location differences.

Preparing Your Data in R

Your data must be in long format—one column for the grouping variable and one for the measured values. Here’s how to set up a proper dataset and check assumptions:

# Create sample dataset: pain scores across three treatment groups
set.seed(42)
treatment_data <- data.frame(
  treatment = rep(c("Drug_A", "Drug_B", "Placebo"), each = 25),
  pain_score = c(
    sample(1:5, 25, replace = TRUE, prob = c(0.1, 0.2, 0.3, 0.25, 0.15)),
    sample(1:5, 25, replace = TRUE, prob = c(0.3, 0.35, 0.2, 0.1, 0.05)),
    sample(1:5, 25, replace = TRUE, prob = c(0.15, 0.2, 0.25, 0.25, 0.15))
  )
)

# Convert treatment to factor
treatment_data$treatment <- factor(treatment_data$treatment)

# Check structure
str(treatment_data)
head(treatment_data)

Before running Kruskal-Wallis, verify that ANOVA assumptions are violated (justifying the non-parametric approach):

# Test normality within each group using Shapiro-Wilk
by(treatment_data$pain_score, treatment_data$treatment, shapiro.test)

# Alternative: test all groups at once with tapply
tapply(treatment_data$pain_score, treatment_data$treatment, function(x) {
  shapiro.test(x)$p.value
})

If any group shows p < 0.05 on Shapiro-Wilk, you have evidence against normality. For ordinal data like Likert scales, normality testing is arguably unnecessary—ordinal data inherently violates interval-level assumptions required by ANOVA.

Running the Kruskal-Wallis Test

The kruskal.test() function is built into base R. The syntax mirrors other statistical tests:

# Run Kruskal-Wallis test
kw_result <- kruskal.test(pain_score ~ treatment, data = treatment_data)
print(kw_result)

Output interpretation:

	Kruskal-Wallis rank sum test

data:  pain_score by treatment
Kruskal-Wallis chi-squared = 8.7234, df = 2, p-value = 0.0127

The output gives you:

Kruskal-Wallis chi-squared (H-statistic): The test statistic. Higher values indicate greater differences between group ranks.
df: Degrees of freedom, equal to number of groups minus one.
p-value: Probability of observing this result if the null hypothesis were true.

With p = 0.0127, we reject the null hypothesis at α = 0.05. At least one group differs from the others. But which ones? The Kruskal-Wallis test doesn’t tell you—that requires post-hoc analysis.

Post-Hoc Analysis with Dunn’s Test

A significant Kruskal-Wallis result demands pairwise comparisons. Dunn’s test is specifically designed for this purpose. Don’t use pairwise Wilcoxon tests—they don’t account for the ranking structure used in the omnibus test.

Install and use the dunn.test package:

# Install if needed
# install.packages("dunn.test")

library(dunn.test)

# Run Dunn's test with Bonferroni correction
dunn_result <- dunn.test(
  treatment_data$pain_score,
  treatment_data$treatment,
  method = "bonferroni"
)

Alternatively, use FSA::dunnTest() for a tidier output:

# install.packages("FSA")
library(FSA)

dunn_fsa <- dunnTest(
  pain_score ~ treatment,
  data = treatment_data,
  method = "holm"  # Holm is less conservative than Bonferroni
)

print(dunn_fsa)

Output example:

                     Comparison          Z      P.unadj        P.adj
1       Drug_A - Drug_B         2.847      0.0044       0.0132
2       Drug_A - Placebo        0.892      0.3724       0.3724
3       Drug_B - Placebo       -1.955      0.0506       0.1012

Choose your adjustment method wisely:

Bonferroni: Most conservative, multiplies p-values by number of comparisons
Holm: Less conservative, maintains family-wise error rate with more power
BH (Benjamini-Hochberg): Controls false discovery rate, best for exploratory analysis

For confirmatory research, stick with Holm. For exploratory work, BH is acceptable.

Calculating Effect Size

P-values tell you whether an effect exists; effect size tells you whether it matters. For Kruskal-Wallis, epsilon-squared (ε²) is the standard effect size measure:

# Manual calculation of epsilon-squared
H <- kw_result$statistic
n <- nrow(treatment_data)
epsilon_squared <- H / (n - 1)

cat("Epsilon-squared:", round(epsilon_squared, 4), "\n")

Interpretation guidelines (from Tomczak & Tomczak, 2014):

ε² < 0.01: Negligible
0.01 ≤ ε² < 0.06: Small
0.06 ≤ ε² < 0.14: Medium
ε² ≥ 0.14: Large

For a cleaner approach, use rstatix:

# install.packages("rstatix")
library(rstatix)

effect_size <- treatment_data %>%
  kruskal_effsize(pain_score ~ treatment)

print(effect_size)

This returns both epsilon-squared and eta-squared with magnitude labels.

Visualizing Results

Box plots are the standard visualization for group comparisons. Add significance annotations with ggpubr:

library(ggplot2)
library(ggpubr)

# Create boxplot with Kruskal-Wallis p-value
ggplot(treatment_data, aes(x = treatment, y = pain_score, fill = treatment)) +
  geom_boxplot(alpha = 0.7, outlier.shape = 21) +
  geom_jitter(width = 0.2, alpha = 0.5, size = 2) +
  stat_compare_means(method = "kruskal.test", label.y = 5.5) +
  labs(
    title = "Pain Scores by Treatment Group",
    x = "Treatment",
    y = "Pain Score (1-5)",
    caption = "Kruskal-Wallis test with Dunn's post-hoc"
  ) +
  theme_minimal() +
  theme(legend.position = "none") +
  scale_fill_brewer(palette = "Set2")

For pairwise comparisons on the plot:

# Define comparisons
my_comparisons <- list(
  c("Drug_A", "Drug_B"),
  c("Drug_A", "Placebo"),
  c("Drug_B", "Placebo")
)

ggplot(treatment_data, aes(x = treatment, y = pain_score, fill = treatment)) +
  geom_boxplot(alpha = 0.7) +
  stat_compare_means(
    comparisons = my_comparisons,
    method = "wilcox.test",
    p.adjust.method = "holm",
    label = "p.signif"
  ) +
  stat_compare_means(method = "kruskal.test", label.y = 6.5) +
  theme_minimal() +
  scale_fill_brewer(palette = "Set2")

Complete Worked Example

Here’s a full reproducible analysis comparing customer satisfaction scores across three store locations:

# Load required packages
library(tidyverse)
library(FSA)
library(rstatix)
library(ggpubr)

# Generate realistic satisfaction data (1-10 scale)
set.seed(123)
satisfaction_data <- data.frame(
  store = factor(rep(c("Downtown", "Mall", "Suburban"), each = 40)),
  satisfaction = c(
    round(rnorm(40, mean = 7.2, sd = 1.5)),
    round(rnorm(40, mean = 6.1, sd = 2.0)),
    round(rnorm(40, mean = 7.8, sd = 1.2))
  )
) %>%
  mutate(satisfaction = pmin(pmax(satisfaction, 1), 10))  # Bound to 1-10

# Step 1: Descriptive statistics
satisfaction_data %>%
  group_by(store) %>%
  summarise(
    n = n(),
    median = median(satisfaction),
    IQR = IQR(satisfaction),
    mean_rank = mean(rank(satisfaction))
  )

# Step 2: Check normality (justifying non-parametric approach)
satisfaction_data %>%
  group_by(store) %>%
  summarise(
    shapiro_p = shapiro.test(satisfaction)$p.value
  )

# Step 3: Run Kruskal-Wallis test
kw_test <- kruskal.test(satisfaction ~ store, data = satisfaction_data)
print(kw_test)

# Step 4: Effect size
eff_size <- satisfaction_data %>%
  kruskal_effsize(satisfaction ~ store)
print(eff_size)

# Step 5: Post-hoc Dunn's test (only if KW is significant)
if (kw_test$p.value < 0.05) {
  dunn_results <- dunnTest(
    satisfaction ~ store,
    data = satisfaction_data,
    method = "holm"
  )
  print(dunn_results)
}

# Step 6: Visualization
ggplot(satisfaction_data, aes(x = store, y = satisfaction, fill = store)) +
  geom_boxplot(alpha = 0.8, outlier.alpha = 0) +
  geom_jitter(width = 0.15, alpha = 0.4, size = 1.5) +
  stat_compare_means(
    method = "kruskal.test",
    label.x = 1.5,
    label.y = 11
  ) +
  labs(
    title = "Customer Satisfaction by Store Location",
    subtitle = paste0("Effect size (ε²) = ", round(eff_size$effsize, 3)),
    x = "Store Location",
    y = "Satisfaction Score"
  ) +
  theme_minimal() +
  theme(legend.position = "none") +
  coord_cartesian(ylim = c(0, 12))

This script produces a complete analysis: descriptive statistics, assumption checking, the omnibus test, effect size, post-hoc comparisons, and publication-ready visualization. Adapt the variable names and you have a template for any Kruskal-Wallis analysis.