How to Use scipy.stats.wilcoxon in Python

Key Insights

The Wilcoxon signed-rank test is your go-to method when paired data violates normality assumptions—it tests whether the median difference between pairs equals zero without requiring normally distributed differences.
Use alternative='two-sided' for exploratory analysis and switch to 'greater' or 'less' only when you have a directional hypothesis established before seeing the data.
Always check for ties and zero differences in your data; the zero_method parameter significantly affects results with small samples, and mode='exact' gives more reliable p-values when n < 25.

Introduction to the Wilcoxon Signed-Rank Test

The Wilcoxon signed-rank test solves a common problem: you have paired measurements, but your data doesn’t meet the normality assumptions required by the paired t-test. Maybe you’re comparing user engagement before and after a feature change, or measuring patient outcomes pre- and post-treatment. The differences between pairs are skewed, contain outliers, or come from ordinal scales.

Unlike the paired t-test, which assumes differences follow a normal distribution, the Wilcoxon test works with ranks. It asks: are positive differences systematically larger (or smaller) than negative differences? This makes it robust to outliers and appropriate for ordinal data.

Here’s when to reach for scipy.stats.wilcoxon:

Your paired differences aren’t normally distributed (check with Shapiro-Wilk)
You have ordinal data (ratings, rankings, Likert scales)
Outliers exist and you can’t justify removing them
Sample sizes are small and normality is hard to verify

The trade-off? You lose some statistical power compared to the t-test when data actually is normal. In practice, this power loss is often minimal—around 5%—making the Wilcoxon test a safe default for paired comparisons.

Function Syntax and Parameters

The scipy.stats.wilcoxon function packs considerable flexibility into its interface:

from scipy.stats import wilcoxon

result = wilcoxon(
    x,                    # First sample or differences
    y=None,               # Second sample (optional)
    zero_method='wilcox', # How to handle zero differences
    correction=False,     # Continuity correction for normal approx
    alternative='two-sided',  # 'two-sided', 'greater', or 'less'
    mode='auto',          # 'auto', 'exact', or 'approx'
    method=None,          # Deprecated, use mode instead
)

Key parameters explained:

x and y: Pass either the differences directly in x, or two paired samples in x and y. The function computes x - y internally.
alternative: Controls the hypothesis direction. Use 'greater' to test if the median difference is positive, 'less' for negative.
zero_method: Critical for handling ties at zero. Options are 'wilcox' (discard zeros), 'pratt' (include zeros in ranking), and 'zsplit' (split zero ranks between positive and negative).
mode: Determines p-value calculation. 'exact' computes the exact distribution (slow for large n), 'approx' uses normal approximation.

from scipy.stats import wilcoxon
import numpy as np

# Quick demonstration of parameter options
before = np.array([85, 90, 78, 92, 88, 76, 95, 89])
after = np.array([88, 92, 80, 91, 90, 79, 96, 91])

# Two-sided test (default)
result = wilcoxon(before, after, alternative='two-sided')
print(f"Statistic: {result.statistic}, p-value: {result.pvalue:.4f}")

One-Sample Test: Testing Against a Hypothesized Median

When you have a single sample and want to test whether its median differs from a hypothesized value, compute differences from that value and pass them to wilcoxon:

import numpy as np
from scipy.stats import wilcoxon

# Customer satisfaction scores (1-10 scale)
# Null hypothesis: median satisfaction equals 7 (neutral)
satisfaction_scores = np.array([8, 6, 9, 7, 8, 5, 9, 8, 7, 6, 8, 9, 7, 8, 6])

# Compute differences from hypothesized median
hypothesized_median = 7
differences = satisfaction_scores - hypothesized_median

# Test if median differs from 7
result = wilcoxon(differences, alternative='two-sided')
print(f"Test statistic: {result.statistic}")
print(f"P-value: {result.pvalue:.4f}")

# One-sided: test if median is greater than 7
result_greater = wilcoxon(differences, alternative='greater')
print(f"One-sided p-value (greater): {result_greater.pvalue:.4f}")

This approach works well for before/after studies where you’re testing whether the treatment effect differs from zero—just pass the raw differences without specifying a hypothesized value.

The most common use case involves comparing two related measurements. Here’s a realistic scenario: measuring task completion time before and after a UI redesign:

import numpy as np
from scipy.stats import wilcoxon, shapiro

# Task completion times in seconds (paired by user)
np.random.seed(42)

# Simulating realistic data: times are often right-skewed
before_redesign = np.array([45, 62, 38, 71, 55, 48, 89, 52, 67, 43, 
                            58, 75, 41, 63, 50, 82, 47, 69, 54, 61])
after_redesign = np.array([42, 58, 35, 65, 51, 45, 78, 48, 61, 40,
                           54, 68, 38, 59, 47, 73, 44, 62, 50, 56])

# First, check if differences are normally distributed
differences = before_redesign - after_redesign
shapiro_stat, shapiro_p = shapiro(differences)
print(f"Shapiro-Wilk test for normality: p = {shapiro_p:.4f}")

# Normality rejected or questionable? Use Wilcoxon
result = wilcoxon(before_redesign, after_redesign, alternative='greater')
print(f"\nWilcoxon signed-rank test:")
print(f"Statistic: {result.statistic}")
print(f"P-value: {result.pvalue:.4f}")

if result.pvalue < 0.05:
    print("Result: Significant reduction in task completion time")
else:
    print("Result: No significant difference detected")

Note the alternative='greater' parameter: we’re testing whether before - after > 0, meaning times decreased after the redesign.

Interpreting Results

The test returns two values that require careful interpretation:

import numpy as np
from scipy.stats import wilcoxon

# Response times (ms) for same users on old vs new system
old_system = np.array([234, 256, 198, 287, 245, 312, 223, 267, 289, 241])
new_system = np.array([198, 234, 187, 256, 223, 278, 201, 245, 267, 219])

result = wilcoxon(old_system, new_system, alternative='two-sided')

print(f"Test Statistic (W): {result.statistic}")
print(f"P-value: {result.pvalue:.4f}")

# Calculate effect size: rank-biserial correlation
differences = old_system - new_system
n = len(differences[differences != 0])

# Rank-biserial correlation for Wilcoxon
# r = 1 - (2W) / (n(n+1)/2)
W = result.statistic
r = 1 - (2 * W) / (n * (n + 1) / 2)
print(f"Rank-biserial correlation (effect size): {r:.3f}")

# Decision logic
alpha = 0.05
if result.pvalue < alpha:
    if r > 0.5:
        print("Large effect: Strong evidence of improvement")
    elif r > 0.3:
        print("Medium effect: Moderate evidence of improvement")
    else:
        print("Small effect: Weak but significant improvement")
else:
    print("No statistically significant difference")

The statistic represents the smaller of the sum of positive ranks and sum of negative ranks. A very small value suggests the differences consistently lean one direction. The rank-biserial correlation provides standardized effect size, interpretable like Cohen’s d: 0.1 (small), 0.3 (medium), 0.5 (large).

Handling Edge Cases and Common Pitfalls

Zero differences and ties require explicit handling. The zero_method parameter controls this:

import numpy as np
from scipy.stats import wilcoxon

# Data with zeros and ties
before = np.array([5, 5, 6, 7, 5, 8, 6, 5, 7, 6])
after = np.array([5, 6, 6, 8, 5, 7, 7, 5, 8, 6])

differences = before - after
print(f"Differences: {differences}")
print(f"Zero differences: {np.sum(differences == 0)}")

# Compare zero_method options
for method in ['wilcox', 'pratt', 'zsplit']:
    try:
        result = wilcoxon(before, after, zero_method=method)
        print(f"\nzero_method='{method}':")
        print(f"  Statistic: {result.statistic}, p-value: {result.pvalue:.4f}")
    except ValueError as e:
        print(f"\nzero_method='{method}': {e}")

# For small samples, use exact mode
small_before = np.array([5, 7, 6, 8, 5])
small_after = np.array([6, 8, 7, 9, 6])

exact_result = wilcoxon(small_before, small_after, mode='exact')
approx_result = wilcoxon(small_before, small_after, mode='approx')

print(f"\nSmall sample (n=5):")
print(f"  Exact p-value: {exact_result.pvalue:.4f}")
print(f"  Approximate p-value: {approx_result.pvalue:.4f}")

Rules of thumb:

Use mode='exact' when n < 25
Use zero_method='pratt' when zeros are meaningful (not measurement error)
Use zero_method='wilcox' (default) when zeros indicate no true difference

Practical Example: A/B Testing with Non-Normal Data

Here’s a complete analysis pipeline for comparing user session durations in a paired A/B test:

import numpy as np
from scipy.stats import wilcoxon, shapiro
import matplotlib.pyplot as plt

# Session durations (seconds) - same users exposed to both variants
np.random.seed(123)
n_users = 50

# Simulating right-skewed session data (realistic for web analytics)
control_sessions = np.random.exponential(scale=120, size=n_users) + 30
treatment_sessions = np.random.exponential(scale=140, size=n_users) + 35

differences = treatment_sessions - control_sessions

# Step 1: Visualize the paired differences
fig, axes = plt.subplots(1, 3, figsize=(12, 4))

axes[0].hist(differences, bins=15, edgecolor='black', alpha=0.7)
axes[0].axvline(x=0, color='red', linestyle='--', label='No difference')
axes[0].set_xlabel('Difference (Treatment - Control)')
axes[0].set_ylabel('Frequency')
axes[0].set_title('Distribution of Paired Differences')
axes[0].legend()

axes[1].boxplot([control_sessions, treatment_sessions], labels=['Control', 'Treatment'])
axes[1].set_ylabel('Session Duration (s)')
axes[1].set_title('Session Durations by Variant')

# Paired plot
for i in range(min(20, n_users)):  # Show first 20 pairs
    axes[2].plot([0, 1], [control_sessions[i], treatment_sessions[i]], 
                 'o-', alpha=0.5, color='steelblue')
axes[2].set_xticks([0, 1])
axes[2].set_xticklabels(['Control', 'Treatment'])
axes[2].set_ylabel('Session Duration (s)')
axes[2].set_title('Paired Observations (first 20)')

plt.tight_layout()
plt.savefig('wilcoxon_analysis.png', dpi=150)
plt.close()

# Step 2: Test normality of differences
shapiro_stat, shapiro_p = shapiro(differences)
print("=" * 50)
print("A/B Test Analysis: Session Duration")
print("=" * 50)
print(f"\nNormality check (Shapiro-Wilk): p = {shapiro_p:.4f}")
print(f"Normality assumption: {'Violated' if shapiro_p < 0.05 else 'Satisfied'}")

# Step 3: Run Wilcoxon test
result = wilcoxon(treatment_sessions, control_sessions, 
                  alternative='greater', mode='exact')

print(f"\nWilcoxon Signed-Rank Test Results:")
print(f"  H0: Median difference = 0")
print(f"  Ha: Treatment sessions > Control sessions")
print(f"  Test statistic: {result.statistic:.1f}")
print(f"  P-value: {result.pvalue:.4f}")

# Step 4: Effect size
n_nonzero = np.sum(differences != 0)
r = 1 - (2 * result.statistic) / (n_nonzero * (n_nonzero + 1) / 2)
print(f"  Effect size (rank-biserial r): {r:.3f}")

# Step 5: Practical summary
median_diff = np.median(differences)
pct_improved = np.mean(differences > 0) * 100

print(f"\nPractical Summary:")
print(f"  Median improvement: {median_diff:.1f} seconds")
print(f"  Users with longer treatment sessions: {pct_improved:.1f}%")
print(f"  Recommendation: {'Deploy treatment' if result.pvalue < 0.05 else 'No action'}")

This pipeline gives you everything needed for a production analysis: visualization, normality checking, hypothesis testing, effect size calculation, and actionable recommendations. The Wilcoxon test handles the non-normal session duration data appropriately, giving you reliable inference without parametric assumptions.

Introduction to the Wilcoxon Signed-Rank Test

Function Syntax and Parameters

One-Sample Test: Testing Against a Hypothesized Median

Two-Sample Paired Test: Comparing Related Groups

Interpreting Results

Handling Edge Cases and Common Pitfalls

Practical Example: A/B Testing with Non-Normal Data

Liked this? There's more.

Similar Articles