How to Perform the Mann-Whitney U Test in Python

Key Insights

The Mann-Whitney U test compares two independent groups without assuming normal distribution, making it the go-to alternative when your data violates t-test assumptions or uses ordinal scales.
Always report effect size (rank-biserial correlation) alongside the p-value—statistical significance tells you whether an effect exists, but effect size tells you whether it matters.
SciPy’s mannwhitneyu() defaults to a one-sided test in older versions, so explicitly set alternative='two-sided' to avoid incorrect conclusions.

Introduction to the Mann-Whitney U Test

The Mann-Whitney U test (also called the Wilcoxon rank-sum test) answers a straightforward question: do two independent groups differ in their central tendency? Unlike the independent samples t-test, it doesn’t assume your data follows a normal distribution. Instead, it works with ranks, making it robust to outliers and suitable for ordinal data.

Here’s the core idea: combine both groups, rank all observations from lowest to highest, then check whether one group’s ranks are systematically higher than the other’s. If treatment group participants consistently rank higher than control group participants, something meaningful is happening.

Use the Mann-Whitney U test when:

Your data isn’t normally distributed (and transformation doesn’t help)
You’re working with ordinal data (Likert scales, rankings)
Sample sizes are small and you can’t invoke the Central Limit Theorem
You have significant outliers you don’t want to remove

The test trades some statistical power for flexibility. With truly normal data, the t-test will detect smaller effects. But with non-normal data, the Mann-Whitney U test often outperforms the t-test substantially.

Assumptions and Prerequisites

The Mann-Whitney U test has four key assumptions. Violating them doesn’t always invalidate your results, but you need to understand the implications.

1. Independent samples: Observations in one group can’t influence observations in the other. This is non-negotiable—violate it and your results are meaningless.

2. Ordinal or continuous data: The test requires data that can be meaningfully ranked. Nominal categories don’t work.

3. Similar distribution shapes: This assumption is often misunderstood. The test doesn’t require identical distributions, but if the shapes differ dramatically, you’re testing for any distributional difference, not just location shift.

4. Independence within groups: Each observation must be independent of others within its own group. No repeated measures, no clustering.

Before running the test, check your data distribution:

import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

# Sample data: response times in milliseconds
group_a = np.array([245, 312, 278, 356, 289, 401, 267, 334, 298, 445])
group_b = np.array([189, 234, 201, 267, 223, 298, 178, 245, 212, 289])

# Visual inspection with histograms
fig, axes = plt.subplots(1, 2, figsize=(10, 4))

axes[0].hist(group_a, bins=6, edgecolor='black', alpha=0.7)
axes[0].set_title('Group A Distribution')
axes[0].set_xlabel('Response Time (ms)')

axes[1].hist(group_b, bins=6, edgecolor='black', alpha=0.7)
axes[1].set_title('Group B Distribution')
axes[1].set_xlabel('Response Time (ms)')

plt.tight_layout()
plt.show()

# Shapiro-Wilk test for normality
stat_a, p_a = stats.shapiro(group_a)
stat_b, p_b = stats.shapiro(group_b)

print(f"Group A: Shapiro-Wilk statistic={stat_a:.4f}, p={p_a:.4f}")
print(f"Group B: Shapiro-Wilk statistic={stat_b:.4f}, p={p_b:.4f}")

# If p < 0.05, reject normality assumption

A non-significant Shapiro-Wilk test doesn’t prove normality—it just fails to disprove it. With small samples, the test has low power. Visual inspection often tells you more than the p-value.

Setting Up Your Environment

You need three libraries for a complete Mann-Whitney U analysis:

import numpy as np
import pandas as pd
from scipy import stats
import matplotlib.pyplot as plt

# Optional: for more detailed statistical output
# pip install pingouin
import pingouin as pg

Let’s create a realistic dataset to work with throughout this article:

# Simulating a drug trial: reaction times for treatment vs. placebo
np.random.seed(42)

# Treatment group: generally faster reactions (skewed distribution)
treatment = np.random.exponential(scale=250, size=35) + 150

# Placebo group: slower reactions (also skewed)
placebo = np.random.exponential(scale=300, size=30) + 180

# Create a DataFrame for easier manipulation
df = pd.DataFrame({
    'reaction_time': np.concatenate([treatment, placebo]),
    'group': ['treatment'] * len(treatment) + ['placebo'] * len(placebo)
})

print(df.groupby('group')['reaction_time'].describe())

Performing the Test with SciPy

The scipy.stats.mannwhitneyu() function is your primary tool. Here’s how to use it correctly:

from scipy.stats import mannwhitneyu

# Extract groups
treatment_times = df[df['group'] == 'treatment']['reaction_time']
placebo_times = df[df['group'] == 'placebo']['reaction_time']

# Perform the test
# IMPORTANT: Always specify 'alternative' explicitly
statistic, p_value = mannwhitneyu(
    treatment_times, 
    placebo_times,
    alternative='two-sided',  # or 'less', 'greater'
    method='auto'  # 'exact', 'asymptotic', or 'auto'
)

print(f"U statistic: {statistic:.2f}")
print(f"P-value: {p_value:.4f}")

Understanding the parameters:

alternative: Specifies your hypothesis. Use 'two-sided' unless you have a strong directional prediction before seeing the data.
method: Controls p-value calculation. 'exact' computes the exact distribution (slow for large samples), 'asymptotic' uses normal approximation, and 'auto' chooses based on sample size.

Interpreting the U statistic:

The U statistic represents the number of times an observation from group 1 ranks higher than an observation from group 2. The maximum possible U equals n₁ × n₂ (where n₁ and n₂ are sample sizes). A U near this maximum or near zero suggests strong group separation.

# Calculate maximum possible U for context
n1, n2 = len(treatment_times), len(placebo_times)
max_u = n1 * n2

print(f"Sample sizes: n1={n1}, n2={n2}")
print(f"Maximum possible U: {max_u}")
print(f"Observed U: {statistic:.2f}")
print(f"U as proportion of maximum: {statistic/max_u:.2%}")

Calculating Effect Size

P-values tell you whether an effect exists; effect size tells you whether it matters. For the Mann-Whitney U test, the rank-biserial correlation (r) is the standard effect size measure.

The rank-biserial correlation ranges from -1 to +1:

0: No difference between groups
±0.1: Small effect
±0.3: Medium effect
±0.5: Large effect

def rank_biserial_correlation(u_statistic, n1, n2):
    """
    Calculate rank-biserial correlation from U statistic.
    
    Formula: r = 1 - (2U)/(n1*n2)
    or equivalently: r = (2U)/(n1*n2) - 1 (depending on which U is used)
    """
    # SciPy returns U for the first group
    # r = 1 - (2*U)/(n1*n2) gives correlation where positive means
    # first group tends to have higher values
    r = 1 - (2 * u_statistic) / (n1 * n2)
    return r

# Calculate effect size
r = rank_biserial_correlation(statistic, n1, n2)
print(f"Rank-biserial correlation: {r:.3f}")

# Interpret the effect size
if abs(r) < 0.1:
    interpretation = "negligible"
elif abs(r) < 0.3:
    interpretation = "small"
elif abs(r) < 0.5:
    interpretation = "medium"
else:
    interpretation = "large"

print(f"Effect size interpretation: {interpretation}")

Alternatively, use the pingouin library for automatic effect size calculation:

import pingouin as pg

# pingouin provides a more complete output
results = pg.mwu(treatment_times, placebo_times, alternative='two-sided')
print(results)

Complete Practical Example

Let’s walk through a complete analysis. Scenario: you’re comparing user task completion times between a new interface design and the existing design.

import numpy as np
import pandas as pd
from scipy import stats
import matplotlib.pyplot as plt

# Set random seed for reproducibility
np.random.seed(123)

# Simulate task completion times (seconds)
# New design: hypothesized to be faster
new_design = np.array([
    45, 52, 38, 67, 41, 55, 48, 62, 39, 71,
    44, 58, 36, 49, 53, 42, 61, 47, 56, 40,
    51, 43, 59, 37, 54
])

# Old design: baseline
old_design = np.array([
    58, 72, 61, 85, 67, 79, 54, 91, 63, 77,
    69, 82, 56, 74, 88, 65, 71, 59, 83, 68,
    76, 64, 80, 57, 73, 66, 78, 62
])

# Step 1: Visualize the distributions
fig, axes = plt.subplots(1, 3, figsize=(14, 4))

# Histograms
axes[0].hist(new_design, bins=8, alpha=0.7, label='New Design', color='steelblue')
axes[0].hist(old_design, bins=8, alpha=0.7, label='Old Design', color='coral')
axes[0].set_xlabel('Completion Time (seconds)')
axes[0].set_ylabel('Frequency')
axes[0].set_title('Distribution Comparison')
axes[0].legend()

# Box plots
axes[1].boxplot([new_design, old_design], labels=['New', 'Old'])
axes[1].set_ylabel('Completion Time (seconds)')
axes[1].set_title('Box Plot Comparison')

# Strip plot with jitter
for i, (data, label) in enumerate([(new_design, 'New'), (old_design, 'Old')]):
    x = np.random.normal(i + 1, 0.04, size=len(data))
    axes[2].scatter(x, data, alpha=0.6, label=label)
axes[2].set_xticks([1, 2])
axes[2].set_xticklabels(['New', 'Old'])
axes[2].set_ylabel('Completion Time (seconds)')
axes[2].set_title('Individual Data Points')

plt.tight_layout()
plt.show()

# Step 2: Check normality
print("=" * 50)
print("NORMALITY ASSESSMENT")
print("=" * 50)
_, p_new = stats.shapiro(new_design)
_, p_old = stats.shapiro(old_design)
print(f"New Design Shapiro-Wilk p-value: {p_new:.4f}")
print(f"Old Design Shapiro-Wilk p-value: {p_old:.4f}")

# Step 3: Descriptive statistics
print("\n" + "=" * 50)
print("DESCRIPTIVE STATISTICS")
print("=" * 50)
print(f"New Design: Median={np.median(new_design):.1f}, "
      f"IQR={np.percentile(new_design, 75) - np.percentile(new_design, 25):.1f}")
print(f"Old Design: Median={np.median(old_design):.1f}, "
      f"IQR={np.percentile(old_design, 75) - np.percentile(old_design, 25):.1f}")

# Step 4: Perform Mann-Whitney U test
print("\n" + "=" * 50)
print("MANN-WHITNEY U TEST RESULTS")
print("=" * 50)

u_stat, p_value = stats.mannwhitneyu(
    new_design, 
    old_design, 
    alternative='two-sided'
)

n1, n2 = len(new_design), len(old_design)
r = 1 - (2 * u_stat) / (n1 * n2)

print(f"U statistic: {u_stat:.2f}")
print(f"P-value: {p_value:.6f}")
print(f"Rank-biserial correlation (r): {r:.3f}")

# Step 5: Interpret and report
print("\n" + "=" * 50)
print("INTERPRETATION")
print("=" * 50)

alpha = 0.05
if p_value < alpha:
    print(f"Result: Statistically significant (p < {alpha})")
    print("The two designs differ significantly in task completion time.")
else:
    print(f"Result: Not statistically significant (p >= {alpha})")
    print("No significant difference detected between designs.")

print(f"\nEffect size: {abs(r):.3f} ({'large' if abs(r) >= 0.5 else 'medium' if abs(r) >= 0.3 else 'small'})")

# APA-style reporting
print("\n" + "=" * 50)
print("APA-STYLE REPORT")
print("=" * 50)
print(f"A Mann-Whitney U test indicated that task completion times were "
      f"significantly {'lower' if np.median(new_design) < np.median(old_design) else 'higher'} "
      f"for the new design (Mdn = {np.median(new_design):.1f}) than the old design "
      f"(Mdn = {np.median(old_design):.1f}), U = {u_stat:.2f}, p < .001, r = {r:.2f}.")

Common Pitfalls and Best Practices

Handling tied values: When observations have identical values, they receive averaged ranks. SciPy handles this automatically, but extensive ties reduce the test’s discriminating power. If more than 10-15% of your data consists of ties, consider whether your measurement precision is adequate.

One-tailed vs. two-tailed tests: Only use one-tailed tests when you have a directional hypothesis specified before data collection. Post-hoc switching to one-tailed testing to achieve significance is p-hacking.

Sample size considerations: The Mann-Whitney U test works with small samples, but power suffers. With fewer than 20 observations per group, only large effects will be detectable. Use power analysis to determine required sample sizes:

# Rough power calculation for Mann-Whitney U
# For 80% power with medium effect (r=0.3), you need approximately:
# n per group ≈ 67 for two-tailed test at alpha=0.05

Reporting standards: Always report:

Sample sizes for both groups
Medians (not means) and IQR or range
U statistic
Exact p-value (or p < .001 if very small)
Effect size with interpretation

Don’t confuse statistical and practical significance: A p-value of 0.001 with r = 0.08 means you’ve reliably detected a trivially small effect. The effect size matters more for decision-making than the p-value.