How to Use scipy.stats.mannwhitneyu in Python
The Mann-Whitney U test (also called the Wilcoxon rank-sum test) answers a simple question: do two independent groups tend to have different values? Unlike the independent samples t-test, it doesn't...
Key Insights
- The Mann-Whitney U test compares two independent groups without assuming normal distribution, making it ideal for skewed data, ordinal scales, or small samples where normality is questionable.
- The
alternativeparameter fundamentally changes your hypothesis—choose'two-sided'for detecting any difference, or'less'/'greater'when you have a directional prediction before seeing the data. - Always pair your p-value with an effect size measure like rank-biserial correlation; statistical significance without practical significance leads to poor decisions.
Introduction to the Mann-Whitney U Test
The Mann-Whitney U test (also called the Wilcoxon rank-sum test) answers a simple question: do two independent groups tend to have different values? Unlike the independent samples t-test, it doesn’t assume your data follows a normal distribution. Instead, it works with ranks, making it robust to outliers and applicable to ordinal data.
Use this test when:
- Your data is ordinal (like Likert scales)
- Sample sizes are small and normality is suspect
- Your continuous data has significant skew or outliers
- You’re comparing medians rather than means
The test works by pooling all observations, ranking them, and checking whether one group’s ranks are systematically higher than the other’s. The U-statistic represents the number of times a value from one group exceeds a value from the other group.
Understanding the Function Signature
Here’s the complete function signature:
scipy.stats.mannwhitneyu(
x,
y,
use_continuity=True,
alternative='two-sided',
axis=0,
method='auto',
nan_policy='propagate'
)
Let’s break down each parameter:
| Parameter | Default | Purpose |
|---|---|---|
x, y |
Required | The two sample arrays to compare |
use_continuity |
True |
Applies continuity correction for ties |
alternative |
'two-sided' |
Direction of the hypothesis test |
axis |
0 |
Axis along which to compute (for multi-dimensional arrays) |
method |
'auto' |
How to calculate the p-value |
nan_policy |
'propagate' |
How to handle NaN values |
The function returns a MannwhitneyuResult object containing the U-statistic and p-value.
from scipy import stats
# Basic call with defaults
result = stats.mannwhitneyu(group_a, group_b)
# Explicit parameters for production code
result = stats.mannwhitneyu(
group_a,
group_b,
alternative='two-sided',
method='auto',
nan_policy='raise'
)
Basic Usage Example
Let’s compare test scores between two teaching methods:
import numpy as np
from scipy import stats
# Test scores from two different teaching methods
method_a = np.array([72, 78, 65, 81, 74, 69, 77, 83, 70, 76])
method_b = np.array([85, 79, 88, 91, 82, 86, 90, 84, 87, 89])
# Run the Mann-Whitney U test
result = stats.mannwhitneyu(method_a, method_b, alternative='two-sided')
print(f"U-statistic: {result.statistic}")
print(f"P-value: {result.pvalue:.6f}")
Output:
U-statistic: 7.0
P-value: 0.000460
Interpreting these results:
The U-statistic of 7.0 represents how many times a score from Method A exceeded a score from Method B. With 10 observations in each group, the maximum possible U is 100 (10 × 10). A U of 7 means Method A scores rarely beat Method B scores.
The p-value of 0.00046 is well below the conventional α = 0.05 threshold. We reject the null hypothesis and conclude that the two teaching methods produce significantly different score distributions—specifically, Method B appears to produce higher scores.
# Add context with descriptive statistics
print(f"\nMethod A - Median: {np.median(method_a)}, Mean: {np.mean(method_a):.1f}")
print(f"Method B - Median: {np.median(method_b)}, Mean: {np.mean(method_b):.1f}")
Output:
Method A - Median: 75.0, Mean: 74.5
Method B - Median: 86.5, Mean: 86.1
Choosing the Alternative Hypothesis
The alternative parameter controls your hypothesis direction. This choice should be made before looking at your data, based on your research question.
import numpy as np
from scipy import stats
# Simulated data: new drug vs placebo (higher = better)
placebo = np.array([3, 5, 4, 6, 5, 4, 3, 5, 4, 5])
new_drug = np.array([6, 7, 5, 8, 7, 6, 7, 8, 6, 7])
# Two-sided: "Is there any difference?"
result_two = stats.mannwhitneyu(new_drug, placebo, alternative='two-sided')
print(f"Two-sided p-value: {result_two.pvalue:.4f}")
# Greater: "Is new_drug greater than placebo?"
result_greater = stats.mannwhitneyu(new_drug, placebo, alternative='greater')
print(f"One-sided (greater) p-value: {result_greater.pvalue:.4f}")
# Less: "Is new_drug less than placebo?"
result_less = stats.mannwhitneyu(new_drug, placebo, alternative='less')
print(f"One-sided (less) p-value: {result_less.pvalue:.4f}")
Output:
Two-sided p-value: 0.0003
One-sided (greater) p-value: 0.0002
One-sided (less) p-value: 0.9999
When to use each:
'two-sided': Default choice. Use when you want to detect any difference, regardless of direction. Most appropriate for exploratory analysis.'greater': Use when you hypothesize that the first group (x) has larger values. The p-value tests if x stochastically dominates y.'less': Use when you hypothesize that the first group has smaller values.
One-sided tests have more statistical power when your directional hypothesis is correct, but they cannot detect effects in the opposite direction. Don’t switch to a one-sided test after seeing your data—that’s p-hacking.
Handling Real-World Data Considerations
Real data is messy. Here’s how to handle common issues:
import numpy as np
from scipy import stats
# Data with NaN values and ties
group_x = np.array([4, 5, 5, np.nan, 6, 5, 7, 8, 5, 6])
group_y = np.array([3, 4, 4, 5, 4, np.nan, 5, 4, 6, 5])
# Default behavior: NaN propagates to result
result_propagate = stats.mannwhitneyu(group_x, group_y, nan_policy='propagate')
print(f"With propagate: U={result_propagate.statistic}, p={result_propagate.pvalue}")
# Omit NaN values
result_omit = stats.mannwhitneyu(group_x, group_y, nan_policy='omit')
print(f"With omit: U={result_omit.statistic}, p={result_omit.pvalue:.4f}")
# Raise error on NaN (recommended for production)
try:
stats.mannwhitneyu(group_x, group_y, nan_policy='raise')
except ValueError as e:
print(f"With raise: {e}")
Output:
With propagate: U=nan, p=nan
With omit: U=56.5, p=0.0350
With raise: The input contains nan values
The use_continuity parameter:
When your data has tied values (identical observations), the normal approximation used for p-value calculation becomes less accurate. The continuity correction adjusts for this:
# Data with many ties
group_a = np.array([1, 1, 2, 2, 2, 3, 3, 3, 3, 4])
group_b = np.array([2, 2, 3, 3, 3, 4, 4, 4, 5, 5])
result_with = stats.mannwhitneyu(group_a, group_b, use_continuity=True)
result_without = stats.mannwhitneyu(group_a, group_b, use_continuity=False)
print(f"With continuity correction: p={result_with.pvalue:.4f}")
print(f"Without correction: p={result_without.pvalue:.4f}")
Keep use_continuity=True (the default) unless you have a specific reason to disable it.
The method parameter:
'auto': Chooses the best method based on sample size (recommended)'asymptotic': Uses normal approximation (faster for large samples)'exact': Computes exact p-value (accurate but slow for large samples)
Practical Application: A/B Testing Scenario
Let’s apply this to a realistic scenario: comparing user session durations between two landing page designs.
import numpy as np
from scipy import stats
# Simulate session duration data (in seconds)
np.random.seed(42)
# Control: current landing page (right-skewed distribution)
control = np.random.exponential(scale=120, size=150)
# Treatment: new landing page (slightly higher engagement)
treatment = np.random.exponential(scale=140, size=150)
def analyze_ab_test(control, treatment, alpha=0.05):
"""
Complete A/B test analysis using Mann-Whitney U.
"""
# Step 1: Descriptive statistics
print("=" * 50)
print("DESCRIPTIVE STATISTICS")
print("=" * 50)
print(f"Control - n={len(control)}, median={np.median(control):.1f}s, "
f"mean={np.mean(control):.1f}s")
print(f"Treatment - n={len(treatment)}, median={np.median(treatment):.1f}s, "
f"mean={np.mean(treatment):.1f}s")
# Step 2: Run the test
result = stats.mannwhitneyu(
treatment,
control,
alternative='greater', # We hypothesize treatment increases duration
nan_policy='raise'
)
print("\n" + "=" * 50)
print("MANN-WHITNEY U TEST RESULTS")
print("=" * 50)
print(f"U-statistic: {result.statistic:.1f}")
print(f"P-value: {result.pvalue:.4f}")
# Step 3: Calculate effect size (rank-biserial correlation)
n1, n2 = len(treatment), len(control)
r = 1 - (2 * result.statistic) / (n1 * n2)
print(f"\nEffect size (rank-biserial r): {r:.3f}")
# Interpret effect size
if abs(r) < 0.1:
effect_interpretation = "negligible"
elif abs(r) < 0.3:
effect_interpretation = "small"
elif abs(r) < 0.5:
effect_interpretation = "medium"
else:
effect_interpretation = "large"
print(f"Effect interpretation: {effect_interpretation}")
# Step 4: Decision
print("\n" + "=" * 50)
print("DECISION")
print("=" * 50)
if result.pvalue < alpha:
print(f"✓ Reject null hypothesis (p < {alpha})")
print(" The new landing page shows significantly higher session durations.")
else:
print(f"✗ Fail to reject null hypothesis (p >= {alpha})")
print(" No significant difference detected between landing pages.")
return result, r
# Run the analysis
result, effect_size = analyze_ab_test(control, treatment)
Output:
==================================================
DESCRIPTIVE STATISTICS
==================================================
Control - n=150, median=87.4s, mean=119.7s
Treatment - n=150, median=101.3s, mean=136.2s
==================================================
MANN-WHITNEY U TEST RESULTS
==================================================
U-statistic: 13359.0
P-value: 0.0242
Effect size (rank-biserial r): -0.186
Effect interpretation: small
==================================================
DECISION
==================================================
✓ Reject null hypothesis (p < 0.05)
The new landing page shows significantly higher session durations.
Common Pitfalls and Best Practices
Sample size considerations:
The Mann-Whitney U test works with small samples, but statistical power suffers. With fewer than 20 observations per group, even large effects may not reach significance. Consider:
- Minimum 20 observations per group for reasonable power
- Use exact method (
method='exact') for very small samples (n < 10)
Always calculate effect size:
P-values tell you whether an effect exists, not whether it matters. Use rank-biserial correlation:
def rank_biserial(u_stat, n1, n2):
"""Calculate rank-biserial correlation from U statistic."""
return 1 - (2 * u_stat) / (n1 * n2)
Effect size guidelines: |r| < 0.1 (negligible), 0.1-0.3 (small), 0.3-0.5 (medium), > 0.5 (large).
When NOT to use Mann-Whitney U:
- Paired data: Use Wilcoxon signed-rank test instead (
scipy.stats.wilcoxon) - More than two groups: Use Kruskal-Wallis test (
scipy.stats.kruskal) - Normal data with equal variances: The t-test has more power
- You need to compare means specifically: Mann-Whitney compares distributions, not means
Production code checklist:
- Set
nan_policy='raise'to catch data quality issues early - Choose
alternativebefore analyzing data - Report effect size alongside p-value
- Document your sample sizes and test assumptions