Wilcoxon Signed-Rank Test in R: Step-by-Step Guide

The Wilcoxon signed-rank test is a non-parametric statistical test that serves as the robust alternative to the paired t-test. Developed by Frank Wilcoxon in 1945, it tests whether the median...

Key Insights

  • The Wilcoxon signed-rank test is your go-to method when paired data violates normality assumptions—it tests whether the median difference between paired observations equals zero without requiring normally distributed differences.
  • Always use paired = TRUE in wilcox.test() for matched samples; forgetting this parameter runs a completely different test (Mann-Whitney U) and invalidates your analysis.
  • Calculate effect size manually using r = Z/√N since R doesn’t provide it automatically—this metric is essential for practical significance beyond just statistical significance.

Introduction to the Wilcoxon Signed-Rank Test

The Wilcoxon signed-rank test is a non-parametric statistical test that serves as the robust alternative to the paired t-test. Developed by Frank Wilcoxon in 1945, it tests whether the median difference between paired observations equals zero. Unlike its parametric counterpart, it doesn’t assume your data follows a normal distribution.

Use this test when you have:

  • Paired or matched samples: Before/after measurements, matched case-control studies, or repeated measures on the same subjects
  • Non-normal distributions: When your difference scores are skewed, have outliers, or fail normality tests
  • Ordinal data: When measurements are ranked rather than continuous
  • Small sample sizes: When you can’t rely on the Central Limit Theorem to save your t-test

The test works by ranking the absolute differences between pairs, then comparing the sum of positive ranks against negative ranks. If there’s no treatment effect, these sums should be roughly equal.

Assumptions and Prerequisites

Before running the test, verify these assumptions:

  1. Paired observations: Each observation in one group has a corresponding observation in the other group
  2. Independence: The pairs themselves are independent of each other
  3. Symmetric distribution of differences: The differences between pairs should be roughly symmetric around the median (not necessarily normal, but symmetric)
  4. Continuous or ordinal measurement scale: The data should be at least ordinal

Compare this to the paired t-test, which additionally requires normally distributed differences. The Wilcoxon test trades some statistical power for robustness—a worthwhile exchange when normality is questionable.

Let’s set up our R environment:

# Base R includes stats package by default
# Install visualization packages if needed
if (!require("ggplot2")) install.packages("ggplot2")
if (!require("dplyr")) install.packages("dplyr")

library(ggplot2)
library(dplyr)

Preparing Your Data

The Wilcoxon signed-rank test expects paired observations. You can structure your data in two ways: two separate vectors or a data frame with columns for each condition.

Let’s create a realistic scenario: measuring pain scores (1-10 scale) before and after a new treatment for 15 patients.

# Set seed for reproducibility
set.seed(42)

# Create sample dataset: pain scores before and after treatment
pain_data <- data.frame(
  patient_id = 1:15,
  before = c(7, 8, 6, 9, 5, 8, 7, 6, 9, 8, 7, 6, 8, 7, 9),
  after = c(5, 6, 5, 7, 4, 5, 6, 4, 6, 5, 5, 4, 6, 5, 7)
)

# Calculate differences
pain_data$difference <- pain_data$after - pain_data$before

# View the data
print(pain_data)

Handle missing values carefully. The Wilcoxon test requires complete pairs:

# Check for missing values
sum(is.na(pain_data$before))
sum(is.na(pain_data$after))

# Remove incomplete pairs if necessary
pain_data_clean <- pain_data %>%
  filter(!is.na(before) & !is.na(after))

# Verify data structure
str(pain_data_clean)
summary(pain_data_clean)

Before proceeding, check the symmetry assumption by examining the distribution of differences:

# Quick visual check for symmetry
hist(pain_data$difference, 
     main = "Distribution of Differences",
     xlab = "After - Before",
     col = "steelblue")

# Descriptive statistics
cat("Mean difference:", mean(pain_data$difference), "\n")
cat("Median difference:", median(pain_data$difference), "\n")
cat("Skewness check - if mean ≈ median, distribution is roughly symmetric\n")

Running the Test in R

The wilcox.test() function handles the Wilcoxon signed-rank test. The critical parameter is paired = TRUE—without it, you’re running the Mann-Whitney U test instead.

# Basic Wilcoxon signed-rank test
result <- wilcox.test(
  pain_data$before, 
  pain_data$after, 
  paired = TRUE,
  conf.int = TRUE
)

print(result)

Key parameters you should know:

# Full specification with all important parameters
result_full <- wilcox.test(
  pain_data$before,
  pain_data$after,
  paired = TRUE,
  alternative = "two.sided",  # or "less", "greater"
  conf.int = TRUE,            # request confidence interval
  conf.level = 0.95,          # confidence level
  exact = NULL                # NULL lets R decide; TRUE/FALSE forces choice
)

print(result_full)

The alternative parameter specifies your hypothesis:

  • "two.sided": Tests if median difference ≠ 0 (default)
  • "less": Tests if median difference < 0
  • "greater": Tests if median difference > 0

For our pain study, we expect treatment to reduce pain, so before > after, meaning after - before < 0. Alternatively, test if before - after > 0:

# One-sided test: does treatment reduce pain?
result_onesided <- wilcox.test(
  pain_data$before,
  pain_data$after,
  paired = TRUE,
  alternative = "greater",  # before > after
  conf.int = TRUE
)

print(result_onesided)

Interpreting Results

The output contains several components. Let’s break them down:

# Extract and interpret components
cat("Test statistic (V):", result$statistic, "\n")
cat("P-value:", result$p.value, "\n")
cat("Confidence interval:", result$conf.int[1], "to", result$conf.int[2], "\n")
cat("Pseudomedian:", result$estimate, "\n")

V statistic: This is the sum of positive ranks. Under the null hypothesis (no difference), V should be close to n(n+1)/4. Large deviations indicate a significant effect.

P-value: The probability of observing results this extreme if the null hypothesis were true. With α = 0.05, p < 0.05 indicates statistical significance.

Confidence interval: The Hodges-Lehmann estimate provides a confidence interval for the pseudomedian of differences—a robust measure of central tendency.

R doesn’t provide effect size automatically, so calculate it manually:

# Calculate effect size (r = Z / sqrt(N))
# First, get the Z-score from the p-value
z_score <- qnorm(result$p.value / 2)  # Divide by 2 for two-tailed

# Number of pairs
n_pairs <- nrow(pain_data)

# Effect size r
effect_size_r <- abs(z_score) / sqrt(n_pairs)

cat("Effect size (r):", round(effect_size_r, 3), "\n")

# Interpretation guidelines (Cohen's conventions adapted)
interpret_r <- function(r) {
  if (r < 0.1) return("negligible")
  if (r < 0.3) return("small")
  if (r < 0.5) return("medium")
  return("large")
}

cat("Interpretation:", interpret_r(effect_size_r), "effect\n")

Visualizing Paired Differences

Visualizations strengthen your statistical findings. Here are two effective approaches:

# 1. Boxplot of differences
ggplot(pain_data, aes(x = "", y = difference)) +
  geom_boxplot(fill = "steelblue", alpha = 0.7) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red", size = 1) +
  geom_jitter(width = 0.1, alpha = 0.5, size = 2) +
  labs(
    title = "Distribution of Pain Score Differences",
    subtitle = "After Treatment - Before Treatment",
    y = "Difference in Pain Score",
    x = ""
  ) +
  theme_minimal() +
  theme(axis.text.x = element_blank())

The paired line plot shows individual trajectories:

# 2. Paired line plot (spaghetti plot)
pain_long <- pain_data %>%
  tidyr::pivot_longer(
    cols = c(before, after),
    names_to = "timepoint",
    values_to = "pain_score"
  ) %>%
  mutate(timepoint = factor(timepoint, levels = c("before", "after")))

ggplot(pain_long, aes(x = timepoint, y = pain_score, group = patient_id)) +
  geom_line(alpha = 0.5, color = "gray50") +
  geom_point(aes(color = timepoint), size = 3) +
  stat_summary(
    aes(group = 1),
    fun = median,
    geom = "line",
    color = "red",
    size = 1.5,
    linetype = "dashed"
  ) +
  scale_color_manual(values = c("before" = "coral", "after" = "steelblue")) +
  labs(
    title = "Individual Pain Score Trajectories",
    subtitle = "Red dashed line shows median trajectory",
    x = "Timepoint",
    y = "Pain Score"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

Common Pitfalls and Best Practices

Handling ties: When multiple differences have the same absolute value, R assigns average ranks. With many ties, exact p-values become unreliable:

# For large samples with ties, use normal approximation
result_approx <- wilcox.test(
  pain_data$before,
  pain_data$after,
  paired = TRUE,
  exact = FALSE,  # Force normal approximation
  correct = TRUE  # Apply continuity correction
)

print(result_approx)

Zero differences: Pairs with identical values (difference = 0) are excluded from analysis. If you have many zeros, consider whether your measurement is sensitive enough.

Sample size considerations:

  • n < 20: Use exact p-values (exact = TRUE)
  • n ≥ 20 or ties present: Normal approximation is acceptable
  • R automatically chooses when exact = NULL

Reporting standards: Include these elements in your write-up:

# Generate a complete report
report_wilcoxon <- function(test_result, n, effect_r) {
  cat("Wilcoxon Signed-Rank Test Results\n")
  cat("==================================\n")
  cat(sprintf("V = %.0f, p %s %.3f\n", 
              test_result$statistic,
              ifelse(test_result$p.value < 0.001, "<", "="),
              max(test_result$p.value, 0.001)))
  cat(sprintf("Effect size r = %.2f\n", effect_r))
  cat(sprintf("95%% CI for pseudomedian: [%.2f, %.2f]\n",
              test_result$conf.int[1], test_result$conf.int[2]))
  cat(sprintf("N pairs = %d\n", n))
}

report_wilcoxon(result, nrow(pain_data), effect_size_r)

When to use the paired t-test instead: If your differences are normally distributed (check with Shapiro-Wilk test) and you have no extreme outliers, the paired t-test has more statistical power. The Wilcoxon test is your safety net, not your default choice.

# Test normality of differences
shapiro.test(pain_data$difference)
# If p > 0.05, normality assumption holds; consider paired t-test

The Wilcoxon signed-rank test is a workhorse for paired data analysis. Master it, and you’ll have a reliable tool for situations where parametric assumptions fail—which happens more often than most analysts admit.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.