How to Perform the Wilcoxon Signed-Rank Test in R

Key Insights

The Wilcoxon signed-rank test is your go-to method when paired data violates normality assumptions—it compares medians rather than means and makes no distributional assumptions about your data.
R’s wilcox.test() function handles the heavy lifting, but you must set paired = TRUE and understand when to use exact versus asymptotic p-values for reliable results.
Effect size matters more than p-values for practical interpretation—always calculate and report the rank-biserial correlation (r = Z/√N) alongside your test statistics.

Introduction to the Wilcoxon Signed-Rank Test

The Wilcoxon signed-rank test is a non-parametric statistical method for comparing two related samples. When your paired data doesn’t meet the normality requirements of a paired t-test, this test provides a robust alternative that focuses on ranks rather than raw values.

Frank Wilcoxon introduced this test in 1945, and it remains one of the most widely used non-parametric methods in applied research. The test works by ranking the absolute differences between paired observations, then comparing the sum of positive ranks to the sum of negative ranks. If there’s no systematic difference between conditions, these sums should be roughly equal.

The key distinction from its parametric cousin: while the paired t-test compares means and assumes normally distributed differences, the Wilcoxon signed-rank test compares the median of differences and only assumes symmetry in the difference distribution. This makes it considerably more robust to outliers and skewed data.

When to Use This Test

Choose the Wilcoxon signed-rank test when you have paired or matched observations and at least one of these conditions applies:

Non-normal distributions: Your difference scores fail normality tests or show obvious skewness in visual inspection. This is especially common with small samples where the Central Limit Theorem can’t save you.

Ordinal data: You’re working with Likert scales, rankings, or other ordered categorical data where arithmetic means don’t make conceptual sense.

Small sample sizes: With fewer than 30 pairs, normality assumptions become critical for t-tests. The Wilcoxon test sidesteps this concern entirely.

Outlier sensitivity: Your data contains extreme values that would unduly influence mean-based tests.

Real-world applications abound. Clinical trials comparing patient symptoms before and after treatment. Educational research measuring student performance pre- and post-intervention. User experience studies comparing task completion times across interface versions. Any scenario where you’re measuring the same subjects twice fits this framework.

Assumptions and Requirements

The Wilcoxon signed-rank test isn’t assumption-free—it just has different assumptions than parametric alternatives:

Paired observations: Each observation in one group must have a corresponding observation in the other group.
Independence: The pairs themselves must be independent of each other.
Continuous or ordinal measurement: The dependent variable must be at least ordinal.
Symmetric difference distribution: The distribution of differences should be roughly symmetric around the median (not necessarily normal).

Before running the test, verify whether a non-parametric approach is actually necessary:

# Generate example paired data
set.seed(42)
before <- c(23, 45, 67, 32, 54, 78, 43, 29, 61, 38, 52, 47)
after <- c(28, 51, 69, 38, 62, 75, 55, 34, 70, 45, 58, 53)

# Calculate differences
differences <- after - before

# Test normality of differences using Shapiro-Wilk
shapiro_result <- shapiro.test(differences)
print(shapiro_result)

# Visual inspection with histogram and Q-Q plot
par(mfrow = c(1, 2))
hist(differences, main = "Distribution of Differences", 
     xlab = "After - Before", col = "steelblue")
qqnorm(differences)
qqline(differences, col = "red", lwd = 2)

If the Shapiro-Wilk p-value exceeds 0.05 and your Q-Q plot looks reasonably linear, you could use a paired t-test. Otherwise, proceed with Wilcoxon.

Performing the Test in R

R’s built-in wilcox.test() function handles the Wilcoxon signed-rank test. The critical parameter is paired = TRUE:

# Basic two-tailed test
result <- wilcox.test(after, before, paired = TRUE)
print(result)

# Alternative syntax using formula notation with data frame
df <- data.frame(
  subject = rep(1:12, 2),
  time = rep(c("before", "after"), each = 12),
  score = c(before, after)
)

# Reshape for paired comparison
library(tidyr)
df_wide <- pivot_wider(df, names_from = time, values_from = score)
wilcox.test(df_wide$after, df_wide$before, paired = TRUE)

For directional hypotheses, specify the alternative parameter:

# One-tailed test: after scores are greater than before
result_greater <- wilcox.test(after, before, 
                               paired = TRUE, 
                               alternative = "greater")
print(result_greater)

# One-tailed test: after scores are less than before
result_less <- wilcox.test(after, before, 
                            paired = TRUE, 
                            alternative = "less")
print(result_less)

To obtain confidence intervals for the pseudo-median of differences:

# Include confidence interval
result_ci <- wilcox.test(after, before, 
                          paired = TRUE, 
                          conf.int = TRUE, 
                          conf.level = 0.95)
print(result_ci)

Interpreting Results

The test output includes several components that require careful interpretation:

# Run test with confidence interval
result <- wilcox.test(after, before, paired = TRUE, conf.int = TRUE)

# Extract individual components
v_statistic <- result$statistic
p_value <- result$p.value
pseudo_median <- result$estimate
conf_interval <- result$conf.int

cat("V statistic:", v_statistic, "\n")
cat("P-value:", round(p_value, 4), "\n")
cat("Pseudo-median of differences:", round(pseudo_median, 2), "\n")
cat("95% CI:", round(conf_interval[1], 2), "to", round(conf_interval[2], 2), "\n")

The V statistic represents the sum of positive ranks. Under the null hypothesis (no difference), V should equal approximately n(n+1)/4, where n is the number of non-zero differences.

The p-value indicates the probability of observing a V statistic as extreme as yours if the null hypothesis were true. Standard threshold: reject null if p < 0.05.

The pseudo-median (Hodges-Lehmann estimate) represents the median of all pairwise averages of differences—a robust measure of central tendency.

Effect size calculation is essential for practical significance:

# Calculate effect size (r = Z / sqrt(N))
# First, get the Z approximation
n <- length(differences[differences != 0])  # Exclude zero differences
z_value <- qnorm(p_value / 2)  # Two-tailed conversion

effect_size_r <- abs(z_value) / sqrt(n)

cat("Effect size (r):", round(effect_size_r, 3), "\n")

# Interpretation guidelines
# r = 0.1: small effect
# r = 0.3: medium effect  
# r = 0.5: large effect

interpret_effect <- function(r) {
  if (r < 0.1) return("negligible")
  if (r < 0.3) return("small")
  if (r < 0.5) return("medium")
  return("large")
}

cat("Effect interpretation:", interpret_effect(effect_size_r), "\n")

Complete Worked Example

Let’s analyze a realistic scenario: a pharmaceutical company testing whether a new anxiety medication reduces symptom scores. Twelve patients complete an anxiety inventory before treatment and again after 8 weeks.

# Patient anxiety scores (higher = more anxiety)
patient_data <- data.frame(
  patient_id = 1:12,
  pre_treatment = c(42, 58, 35, 67, 51, 73, 48, 39, 62, 55, 44, 69),
  post_treatment = c(35, 49, 32, 54, 43, 61, 41, 35, 51, 48, 38, 58)
)

# Calculate differences
patient_data$difference <- patient_data$post_treatment - patient_data$pre_treatment

# Display the data
print(patient_data)

# Summary statistics
cat("\n--- Descriptive Statistics ---\n")
cat("Pre-treatment median:", median(patient_data$pre_treatment), "\n")
cat("Post-treatment median:", median(patient_data$post_treatment), "\n")
cat("Median difference:", median(patient_data$difference), "\n")

Visualize the paired data:

# Create visualization
par(mfrow = c(1, 2))

# Paired line plot
plot(rep(1, 12), patient_data$pre_treatment, 
     xlim = c(0.5, 2.5), ylim = c(25, 80),
     xlab = "", ylab = "Anxiety Score", 
     main = "Individual Patient Trajectories",
     xaxt = "n", pch = 16, col = "coral")
points(rep(2, 12), patient_data$post_treatment, pch = 16, col = "steelblue")
segments(rep(1, 12), patient_data$pre_treatment, 
         rep(2, 12), patient_data$post_treatment, col = "gray60")
axis(1, at = c(1, 2), labels = c("Pre", "Post"))
legend("topright", legend = c("Pre", "Post"), 
       pch = 16, col = c("coral", "steelblue"))

# Boxplot of differences
boxplot(patient_data$difference, 
        main = "Distribution of Differences",
        ylab = "Post - Pre Score",
        col = "lightgreen")
abline(h = 0, lty = 2, col = "red")

Check assumptions and run the test:

# Check normality of differences
shapiro_test <- shapiro.test(patient_data$difference)
cat("\n--- Normality Check ---\n")
cat("Shapiro-Wilk p-value:", round(shapiro_test$p.value, 4), "\n")
cat("Conclusion:", ifelse(shapiro_test$p.value < 0.05, 
                           "Non-normal - use Wilcoxon", 
                           "Approximately normal"), "\n")

# Perform Wilcoxon signed-rank test
# Hypothesis: post-treatment scores are LOWER than pre-treatment
wilcox_result <- wilcox.test(patient_data$post_treatment, 
                              patient_data$pre_treatment,
                              paired = TRUE,
                              alternative = "less",
                              conf.int = TRUE,
                              exact = FALSE)  # Use normal approximation

print(wilcox_result)

# Calculate effect size
n_pairs <- sum(patient_data$difference != 0)
z_score <- qnorm(wilcox_result$p.value)
effect_r <- abs(z_score) / sqrt(n_pairs)

cat("\n--- Effect Size ---\n")
cat("r =", round(effect_r, 3), "(", interpret_effect(effect_r), "effect)\n")

Common Pitfalls and Best Practices

Handling ties: When multiple differences share the same absolute value, R assigns average ranks by default. This is generally appropriate, but be aware it affects the exact p-value calculation:

# Data with ties
data_with_ties <- data.frame(
  before = c(10, 15, 20, 25, 30, 35),
  after = c(15, 20, 25, 30, 35, 40)  # All differences = 5
)

# This will produce a warning about ties
result_ties <- wilcox.test(data_with_ties$after, data_with_ties$before, 
                            paired = TRUE, exact = TRUE)

Handling zero differences: Pairs with identical values contribute no information and are excluded from analysis. With many zeros, consider whether your measurement is sensitive enough:

# Control exact vs asymptotic calculation
# exact = TRUE: Exact p-value (fails with ties)
# exact = FALSE: Normal approximation
# correct = TRUE: Continuity correction (default)

result_controlled <- wilcox.test(after, before, 
                                  paired = TRUE,
                                  exact = FALSE,  # Use approximation
                                  correct = TRUE)  # Apply continuity correction

Reporting standards: For publication, include the test statistic, sample size, p-value, effect size, and confidence interval:

# Generate publication-ready summary
report_wilcoxon <- function(x, y, alpha = 0.05) {
  result <- wilcox.test(x, y, paired = TRUE, conf.int = TRUE, exact = FALSE)
  n <- sum((x - y) != 0)
  z <- qnorm(result$p.value / 2)
  r <- abs(z) / sqrt(n)
  
  cat(sprintf(
    "Wilcoxon signed-rank test: V = %.0f, n = %d, p %s %.3f, r = %.2f, 95%% CI [%.2f, %.2f]",
    result$statistic, n,
    ifelse(result$p.value < 0.001, "<", "="),
    ifelse(result$p.value < 0.001, 0.001, result$p.value),
    r, result$conf.int[1], result$conf.int[2]
  ))
}

report_wilcoxon(patient_data$post_treatment, patient_data$pre_treatment)

The Wilcoxon signed-rank test belongs in every analyst’s toolkit. It’s robust, well-understood, and appropriate for the messy real-world data that rarely conforms to textbook assumptions. Master its implementation in R, and you’ll have a reliable method for paired comparisons regardless of your data’s distributional quirks.