Mann-Whitney U Test in R: Step-by-Step Guide

Key Insights

The Mann-Whitney U test compares two independent groups without assuming normal distributions, making it your go-to choice when A/B test data is skewed or contains outliers.
Always report effect size (rank-biserial correlation) alongside p-values—statistical significance tells you whether an effect exists, but effect size tells you whether it matters.
R’s wilcox.test() function handles the heavy lifting, but you need to understand the exact vs. asymptotic calculation trade-offs for accurate p-values with your sample size.

Introduction to the Mann-Whitney U Test

The Mann-Whitney U test (also called the Wilcoxon rank-sum test) answers a simple question: do two independent groups differ in their central tendency? It’s the non-parametric cousin of the independent samples t-test, but instead of comparing means, it compares the ranks of observations.

Here’s when you should reach for it:

Your data violates normality assumptions (common with response times, revenue, or engagement metrics)
You have ordinal data (satisfaction ratings, Likert scales)
Your sample size is small and you can’t rely on the Central Limit Theorem
You have significant outliers you don’t want to remove

In practice, this test appears constantly in A/B testing. Response times are almost never normally distributed—they’re right-skewed with a long tail of slow responses. Conversion values follow similar patterns. The Mann-Whitney U test handles these gracefully.

Assumptions and Prerequisites

The Mann-Whitney U test has fewer assumptions than the t-test, but it still has some:

Independence: Observations must be independent within and between groups
Ordinal or continuous data: The outcome variable must be at least ordinal
Similar distribution shapes: Both groups should have similarly shaped distributions (though not necessarily normal)

That third assumption often gets overlooked. If one group is heavily skewed and the other is symmetric, the test becomes harder to interpret. It’s no longer purely a test of central tendency.

Let’s verify when to use Mann-Whitney over a t-test by checking normality:

# Generate some realistic response time data (typically right-skewed)
set.seed(42)
version_a <- rlnorm(50, meanlog = 5, sdlog = 0.8)  # Log-normal distribution
version_b <- rlnorm(50, meanlog = 5.2, sdlog = 0.8)

# Test normality with Shapiro-Wilk
shapiro.test(version_a)
# W = 0.8234, p-value = 2.1e-06

shapiro.test(version_b)
# W = 0.8456, p-value = 8.3e-06

# Both p-values < 0.05, so we reject normality
# Mann-Whitney U test is appropriate here

When Shapiro-Wilk returns p < 0.05, your data significantly deviates from normality. That’s your signal to use Mann-Whitney.

Preparing Your Data in R

R expects your data in one of two formats: separate vectors for each group, or a single data frame with a grouping variable. The data frame approach is cleaner and scales better.

# Create a realistic A/B test dataset
set.seed(123)

ab_test_data <- data.frame(
  user_id = 1:100,
  version = rep(c("control", "treatment"), each = 50),
  response_time_ms = c(
    rlnorm(50, meanlog = 6.5, sdlog = 0.7),  # Control group
    rlnorm(50, meanlog = 6.3, sdlog = 0.7)   # Treatment (faster)
  )
)

# Check the structure
str(ab_test_data)
# 'data.frame': 100 obs. of 3 variables:
#  $ user_id         : int  1 2 3 4 5 ...
#  $ version         : chr  "control" "control" ...
#  $ response_time_ms: num  892 445 1205 ...

# Ensure grouping variable is a factor with correct reference level
ab_test_data$version <- factor(ab_test_data$version, 
                                levels = c("control", "treatment"))

# Quick summary by group
aggregate(response_time_ms ~ version, data = ab_test_data, 
          FUN = function(x) c(median = median(x), 
                              mean = mean(x), 
                              sd = sd(x)))

Notice I’m using median in the summary. For non-parametric tests, median is your measure of central tendency, not mean.

Running the Test with wilcox.test()

R’s wilcox.test() function handles the Mann-Whitney U test. There are two syntax options:

# Method 1: Formula notation (preferred)
result <- wilcox.test(response_time_ms ~ version, data = ab_test_data)
print(result)

# Wilcoxon rank sum test with continuity correction
# 
# data:  response_time_ms by version
# W = 1587, p-value = 0.0312
# alternative hypothesis: true location shift is not equal to 0

# Method 2: Separate vectors
control <- ab_test_data$response_time_ms[ab_test_data$version == "control"]
treatment <- ab_test_data$response_time_ms[ab_test_data$version == "treatment"]

wilcox.test(control, treatment)

Key parameters you should know:

# Exact p-value (better for small samples, slower for large)
wilcox.test(response_time_ms ~ version, data = ab_test_data, exact = TRUE)

# Without continuity correction (matches some other software)
wilcox.test(response_time_ms ~ version, data = ab_test_data, correct = FALSE)

# One-sided test (treatment is less than control)
wilcox.test(response_time_ms ~ version, data = ab_test_data, 
            alternative = "less")

The W statistic (1587 in our example) is the sum of ranks for the first group. The p-value (0.0312) tells us the groups differ significantly at α = 0.05.

Calculating and Reporting Effect Size

A p-value of 0.0312 tells you the difference is unlikely due to chance. It doesn’t tell you if the difference matters. For that, you need effect size.

The standard effect size for Mann-Whitney U is the rank-biserial correlation (r), which ranges from -1 to 1:

# Manual calculation of rank-biserial correlation
n1 <- sum(ab_test_data$version == "control")
n2 <- sum(ab_test_data$version == "treatment")
U <- result$statistic

# r = 1 - (2U)/(n1*n2)
r_manual <- 1 - (2 * U) / (n1 * n2)
print(r_manual)
# W: 0.2652

# Using rstatix for cleaner output
library(rstatix)

effect_size <- wilcox_effsize(ab_test_data, response_time_ms ~ version)
print(effect_size)
# .y.              group1  group2    effsize n1 n2 magnitude
# response_time_ms control treatment   0.265 50 50 small

Interpret effect sizes using these guidelines:

|r| < 0.1: negligible
|r| 0.1-0.3: small
|r| 0.3-0.5: medium
|r| > 0.5: large

Our effect size of 0.265 is small—statistically significant but practically modest.

Visualizing the Results

Box plots and violin plots work well for non-parametric comparisons. Here’s how to create publication-ready visualizations:

library(ggplot2)
library(ggpubr)

# Basic box plot with jittered points
p1 <- ggplot(ab_test_data, aes(x = version, y = response_time_ms, fill = version)) +
  geom_boxplot(alpha = 0.7, outlier.shape = NA) +
  geom_jitter(width = 0.2, alpha = 0.4, size = 1.5) +
  scale_fill_manual(values = c("#E69F00", "#56B4E9")) +
  labs(
    x = "App Version",
    y = "Response Time (ms)",
    title = "Response Time by App Version"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

# Add statistical annotation
p2 <- p1 + 
  stat_compare_means(method = "wilcox.test", 
                     label = "p.format",
                     label.x = 1.5,
                     label.y = max(ab_test_data$response_time_ms) * 1.1)

print(p2)

# Violin plot alternative (shows distribution shape)
ggplot(ab_test_data, aes(x = version, y = response_time_ms, fill = version)) +
  geom_violin(alpha = 0.7, trim = FALSE) +
  geom_boxplot(width = 0.15, fill = "white", alpha = 0.8) +
  scale_fill_manual(values = c("#E69F00", "#56B4E9")) +
  stat_compare_means(method = "wilcox.test", label.y = 2500) +
  labs(x = "App Version", y = "Response Time (ms)") +
  theme_minimal() +
  theme(legend.position = "none")

Complete Worked Example

Let’s put everything together with a realistic scenario: comparing page load times between two CDN configurations.

# ============================================
# Complete Mann-Whitney U Test Analysis
# Scenario: Comparing CDN performance
# ============================================

library(ggplot2)
library(ggpubr)
library(rstatix)

# 1. Load and prepare data
set.seed(2024)

cdn_data <- data.frame(
  request_id = 1:200,
  cdn = factor(rep(c("current", "new_provider"), each = 100),
               levels = c("current", "new_provider")),
  load_time_ms = c(
    rlnorm(100, meanlog = 5.8, sdlog = 0.6),  # Current CDN
    rlnorm(100, meanlog = 5.5, sdlog = 0.55)  # New provider (faster)
  )
)

# 2. Exploratory analysis
summary_stats <- cdn_data %>%
  group_by(cdn) %>%
  summarise(
    n = n(),
    median = median(load_time_ms),
    iqr = IQR(load_time_ms),
    mean = mean(load_time_ms),
    sd = sd(load_time_ms)
  )
print(summary_stats)

# 3. Check normality assumption
normality_check <- cdn_data %>%
  group_by(cdn) %>%
  summarise(
    shapiro_stat = shapiro.test(load_time_ms)$statistic,
    shapiro_p = shapiro.test(load_time_ms)$p.value
  )
print(normality_check)
# Both p < 0.05, confirming non-normality

# 4. Run Mann-Whitney U test
mw_result <- wilcox.test(load_time_ms ~ cdn, data = cdn_data, 
                          exact = FALSE, correct = TRUE)
print(mw_result)

# 5. Calculate effect size
effect <- wilcox_effsize(cdn_data, load_time_ms ~ cdn)
print(effect)

# 6. Create visualization
final_plot <- ggplot(cdn_data, aes(x = cdn, y = load_time_ms, fill = cdn)) +
  geom_violin(alpha = 0.6, trim = FALSE) +
  geom_boxplot(width = 0.15, fill = "white", alpha = 0.9) +
  scale_fill_manual(values = c("#D55E00", "#009E73"),
                    labels = c("Current CDN", "New Provider")) +
  scale_x_discrete(labels = c("Current CDN", "New Provider")) +
  annotate("text", x = 1.5, y = max(cdn_data$load_time_ms) * 0.95,
           label = sprintf("Mann-Whitney U: p = %.4f\nEffect size (r) = %.3f (%s)",
                          mw_result$p.value, effect$effsize, effect$magnitude),
           size = 3.5) +
  labs(
    x = NULL,
    y = "Page Load Time (ms)",
    title = "CDN Performance Comparison",
    subtitle = "New provider shows significantly faster load times"
  ) +
  theme_minimal(base_size = 12) +
  theme(legend.position = "none")

print(final_plot)

# 7. Report results
cat("\n=== RESULTS SUMMARY ===\n")
cat(sprintf("Current CDN: Median = %.1f ms (IQR: %.1f)\n", 
            summary_stats$median[1], summary_stats$iqr[1]))
cat(sprintf("New Provider: Median = %.1f ms (IQR: %.1f)\n", 
            summary_stats$median[2], summary_stats$iqr[2]))
cat(sprintf("Mann-Whitney U = %.0f, p = %.4f\n", 
            mw_result$statistic, mw_result$p.value))
cat(sprintf("Effect size (r) = %.3f (%s)\n", 
            effect$effsize, effect$magnitude))

The output gives you everything needed for a technical report: descriptive statistics with medians and IQRs (appropriate for non-parametric data), the test statistic and p-value, effect size with interpretation, and a publication-ready visualization.

When reporting these results, write something like: “Page load times were significantly lower with the new CDN provider (Mdn = 245.3 ms) compared to the current CDN (Mdn = 312.8 ms), U = 3847, p < .001, r = 0.42 (medium effect).”

That’s the Mann-Whitney U test from start to finish. Use it whenever your data doesn’t meet t-test assumptions, report effect sizes alongside p-values, and let the visualization tell the story your statistics support.