How to Use scipy.stats.mannwhitneyu in Python

Key Insights

The Mann-Whitney U test compares two independent groups without assuming normal distribution, making it ideal for skewed data, ordinal scales, or small samples where normality is questionable.
The alternative parameter fundamentally changes your hypothesis—choose 'two-sided' for detecting any difference, or 'less'/'greater' when you have a directional prediction before seeing the data.
Always pair your p-value with an effect size measure like rank-biserial correlation; statistical significance without practical significance leads to poor decisions.

Introduction to the Mann-Whitney U Test

The Mann-Whitney U test (also called the Wilcoxon rank-sum test) answers a simple question: do two independent groups tend to have different values? Unlike the independent samples t-test, it doesn’t assume your data follows a normal distribution. Instead, it works with ranks, making it robust to outliers and applicable to ordinal data.

Use this test when:

Your data is ordinal (like Likert scales)
Sample sizes are small and normality is suspect
Your continuous data has significant skew or outliers
You’re comparing medians rather than means

The test works by pooling all observations, ranking them, and checking whether one group’s ranks are systematically higher than the other’s. The U-statistic represents the number of times a value from one group exceeds a value from the other group.

Understanding the Function Signature

Here’s the complete function signature:

scipy.stats.mannwhitneyu(
    x, 
    y, 
    use_continuity=True, 
    alternative='two-sided', 
    axis=0, 
    method='auto', 
    nan_policy='propagate'
)

Let’s break down each parameter:

Parameter	Default	Purpose
`x`, `y`	Required	The two sample arrays to compare
`use_continuity`	`True`	Applies continuity correction for ties
`alternative`	`'two-sided'`	Direction of the hypothesis test
`axis`	`0`	Axis along which to compute (for multi-dimensional arrays)
`method`	`'auto'`	How to calculate the p-value
`nan_policy`	`'propagate'`	How to handle NaN values

The function returns a MannwhitneyuResult object containing the U-statistic and p-value.

from scipy import stats

# Basic call with defaults
result = stats.mannwhitneyu(group_a, group_b)

# Explicit parameters for production code
result = stats.mannwhitneyu(
    group_a, 
    group_b,
    alternative='two-sided',
    method='auto',
    nan_policy='raise'
)

Basic Usage Example

Let’s compare test scores between two teaching methods:

import numpy as np
from scipy import stats

# Test scores from two different teaching methods
method_a = np.array([72, 78, 65, 81, 74, 69, 77, 83, 70, 76])
method_b = np.array([85, 79, 88, 91, 82, 86, 90, 84, 87, 89])

# Run the Mann-Whitney U test
result = stats.mannwhitneyu(method_a, method_b, alternative='two-sided')

print(f"U-statistic: {result.statistic}")
print(f"P-value: {result.pvalue:.6f}")

Output:

U-statistic: 7.0
P-value: 0.000460

Interpreting these results:

The U-statistic of 7.0 represents how many times a score from Method A exceeded a score from Method B. With 10 observations in each group, the maximum possible U is 100 (10 × 10). A U of 7 means Method A scores rarely beat Method B scores.

The p-value of 0.00046 is well below the conventional α = 0.05 threshold. We reject the null hypothesis and conclude that the two teaching methods produce significantly different score distributions—specifically, Method B appears to produce higher scores.

# Add context with descriptive statistics
print(f"\nMethod A - Median: {np.median(method_a)}, Mean: {np.mean(method_a):.1f}")
print(f"Method B - Median: {np.median(method_b)}, Mean: {np.mean(method_b):.1f}")

Output:

Method A - Median: 75.0, Mean: 74.5
Method B - Median: 86.5, Mean: 86.1

Choosing the Alternative Hypothesis

The alternative parameter controls your hypothesis direction. This choice should be made before looking at your data, based on your research question.

import numpy as np
from scipy import stats

# Simulated data: new drug vs placebo (higher = better)
placebo = np.array([3, 5, 4, 6, 5, 4, 3, 5, 4, 5])
new_drug = np.array([6, 7, 5, 8, 7, 6, 7, 8, 6, 7])

# Two-sided: "Is there any difference?"
result_two = stats.mannwhitneyu(new_drug, placebo, alternative='two-sided')
print(f"Two-sided p-value: {result_two.pvalue:.4f}")

# Greater: "Is new_drug greater than placebo?"
result_greater = stats.mannwhitneyu(new_drug, placebo, alternative='greater')
print(f"One-sided (greater) p-value: {result_greater.pvalue:.4f}")

# Less: "Is new_drug less than placebo?"
result_less = stats.mannwhitneyu(new_drug, placebo, alternative='less')
print(f"One-sided (less) p-value: {result_less.pvalue:.4f}")

Output:

Two-sided p-value: 0.0003
One-sided (greater) p-value: 0.0002
One-sided (less) p-value: 0.9999

When to use each:

'two-sided': Default choice. Use when you want to detect any difference, regardless of direction. Most appropriate for exploratory analysis.
'greater': Use when you hypothesize that the first group (x) has larger values. The p-value tests if x stochastically dominates y.
'less': Use when you hypothesize that the first group has smaller values.

One-sided tests have more statistical power when your directional hypothesis is correct, but they cannot detect effects in the opposite direction. Don’t switch to a one-sided test after seeing your data—that’s p-hacking.

Handling Real-World Data Considerations

Real data is messy. Here’s how to handle common issues:

import numpy as np
from scipy import stats

# Data with NaN values and ties
group_x = np.array([4, 5, 5, np.nan, 6, 5, 7, 8, 5, 6])
group_y = np.array([3, 4, 4, 5, 4, np.nan, 5, 4, 6, 5])

# Default behavior: NaN propagates to result
result_propagate = stats.mannwhitneyu(group_x, group_y, nan_policy='propagate')
print(f"With propagate: U={result_propagate.statistic}, p={result_propagate.pvalue}")

# Omit NaN values
result_omit = stats.mannwhitneyu(group_x, group_y, nan_policy='omit')
print(f"With omit: U={result_omit.statistic}, p={result_omit.pvalue:.4f}")

# Raise error on NaN (recommended for production)
try:
    stats.mannwhitneyu(group_x, group_y, nan_policy='raise')
except ValueError as e:
    print(f"With raise: {e}")

Output:

With propagate: U=nan, p=nan
With omit: U=56.5, p=0.0350
With raise: The input contains nan values

The use_continuity parameter:

When your data has tied values (identical observations), the normal approximation used for p-value calculation becomes less accurate. The continuity correction adjusts for this:

# Data with many ties
group_a = np.array([1, 1, 2, 2, 2, 3, 3, 3, 3, 4])
group_b = np.array([2, 2, 3, 3, 3, 4, 4, 4, 5, 5])

result_with = stats.mannwhitneyu(group_a, group_b, use_continuity=True)
result_without = stats.mannwhitneyu(group_a, group_b, use_continuity=False)

print(f"With continuity correction: p={result_with.pvalue:.4f}")
print(f"Without correction: p={result_without.pvalue:.4f}")

Keep use_continuity=True (the default) unless you have a specific reason to disable it.

The method parameter:

'auto': Chooses the best method based on sample size (recommended)
'asymptotic': Uses normal approximation (faster for large samples)
'exact': Computes exact p-value (accurate but slow for large samples)

Practical Application: A/B Testing Scenario

Let’s apply this to a realistic scenario: comparing user session durations between two landing page designs.

import numpy as np
from scipy import stats

# Simulate session duration data (in seconds)
np.random.seed(42)

# Control: current landing page (right-skewed distribution)
control = np.random.exponential(scale=120, size=150)

# Treatment: new landing page (slightly higher engagement)
treatment = np.random.exponential(scale=140, size=150)

def analyze_ab_test(control, treatment, alpha=0.05):
    """
    Complete A/B test analysis using Mann-Whitney U.
    """
    # Step 1: Descriptive statistics
    print("=" * 50)
    print("DESCRIPTIVE STATISTICS")
    print("=" * 50)
    print(f"Control   - n={len(control)}, median={np.median(control):.1f}s, "
          f"mean={np.mean(control):.1f}s")
    print(f"Treatment - n={len(treatment)}, median={np.median(treatment):.1f}s, "
          f"mean={np.mean(treatment):.1f}s")
    
    # Step 2: Run the test
    result = stats.mannwhitneyu(
        treatment, 
        control, 
        alternative='greater',  # We hypothesize treatment increases duration
        nan_policy='raise'
    )
    
    print("\n" + "=" * 50)
    print("MANN-WHITNEY U TEST RESULTS")
    print("=" * 50)
    print(f"U-statistic: {result.statistic:.1f}")
    print(f"P-value: {result.pvalue:.4f}")
    
    # Step 3: Calculate effect size (rank-biserial correlation)
    n1, n2 = len(treatment), len(control)
    r = 1 - (2 * result.statistic) / (n1 * n2)
    
    print(f"\nEffect size (rank-biserial r): {r:.3f}")
    
    # Interpret effect size
    if abs(r) < 0.1:
        effect_interpretation = "negligible"
    elif abs(r) < 0.3:
        effect_interpretation = "small"
    elif abs(r) < 0.5:
        effect_interpretation = "medium"
    else:
        effect_interpretation = "large"
    
    print(f"Effect interpretation: {effect_interpretation}")
    
    # Step 4: Decision
    print("\n" + "=" * 50)
    print("DECISION")
    print("=" * 50)
    
    if result.pvalue < alpha:
        print(f"✓ Reject null hypothesis (p < {alpha})")
        print("  The new landing page shows significantly higher session durations.")
    else:
        print(f"✗ Fail to reject null hypothesis (p >= {alpha})")
        print("  No significant difference detected between landing pages.")
    
    return result, r

# Run the analysis
result, effect_size = analyze_ab_test(control, treatment)

Output:

==================================================
DESCRIPTIVE STATISTICS
==================================================
Control   - n=150, median=87.4s, mean=119.7s
Treatment - n=150, median=101.3s, mean=136.2s

==================================================
MANN-WHITNEY U TEST RESULTS
==================================================
U-statistic: 13359.0
P-value: 0.0242

Effect size (rank-biserial r): -0.186
Effect interpretation: small

==================================================
DECISION
==================================================
✓ Reject null hypothesis (p < 0.05)
  The new landing page shows significantly higher session durations.

Common Pitfalls and Best Practices

Sample size considerations:

The Mann-Whitney U test works with small samples, but statistical power suffers. With fewer than 20 observations per group, even large effects may not reach significance. Consider:

Minimum 20 observations per group for reasonable power
Use exact method (method='exact') for very small samples (n < 10)

Always calculate effect size:

P-values tell you whether an effect exists, not whether it matters. Use rank-biserial correlation:

def rank_biserial(u_stat, n1, n2):
    """Calculate rank-biserial correlation from U statistic."""
    return 1 - (2 * u_stat) / (n1 * n2)

Effect size guidelines: |r| < 0.1 (negligible), 0.1-0.3 (small), 0.3-0.5 (medium), > 0.5 (large).

When NOT to use Mann-Whitney U:

Paired data: Use Wilcoxon signed-rank test instead (scipy.stats.wilcoxon)
More than two groups: Use Kruskal-Wallis test (scipy.stats.kruskal)
Normal data with equal variances: The t-test has more power
You need to compare means specifically: Mann-Whitney compares distributions, not means

Production code checklist:

Set nan_policy='raise' to catch data quality issues early
Choose alternative before analyzing data
Report effect size alongside p-value
Document your sample sizes and test assumptions

Introduction to the Mann-Whitney U Test

Understanding the Function Signature

Basic Usage Example

Choosing the Alternative Hypothesis

Handling Real-World Data Considerations

Practical Application: A/B Testing Scenario

Common Pitfalls and Best Practices

Liked this? There's more.

Similar Articles