How to Use scipy.stats.wilcoxon in Python
The Wilcoxon signed-rank test solves a common problem: you have paired measurements, but your data doesn't meet the normality assumptions required by the paired t-test. Maybe you're comparing user...
Key Insights
- The Wilcoxon signed-rank test is your go-to method when paired data violates normality assumptions—it tests whether the median difference between pairs equals zero without requiring normally distributed differences.
- Use
alternative='two-sided'for exploratory analysis and switch to'greater'or'less'only when you have a directional hypothesis established before seeing the data. - Always check for ties and zero differences in your data; the
zero_methodparameter significantly affects results with small samples, andmode='exact'gives more reliable p-values when n < 25.
Introduction to the Wilcoxon Signed-Rank Test
The Wilcoxon signed-rank test solves a common problem: you have paired measurements, but your data doesn’t meet the normality assumptions required by the paired t-test. Maybe you’re comparing user engagement before and after a feature change, or measuring patient outcomes pre- and post-treatment. The differences between pairs are skewed, contain outliers, or come from ordinal scales.
Unlike the paired t-test, which assumes differences follow a normal distribution, the Wilcoxon test works with ranks. It asks: are positive differences systematically larger (or smaller) than negative differences? This makes it robust to outliers and appropriate for ordinal data.
Here’s when to reach for scipy.stats.wilcoxon:
- Your paired differences aren’t normally distributed (check with Shapiro-Wilk)
- You have ordinal data (ratings, rankings, Likert scales)
- Outliers exist and you can’t justify removing them
- Sample sizes are small and normality is hard to verify
The trade-off? You lose some statistical power compared to the t-test when data actually is normal. In practice, this power loss is often minimal—around 5%—making the Wilcoxon test a safe default for paired comparisons.
Function Syntax and Parameters
The scipy.stats.wilcoxon function packs considerable flexibility into its interface:
from scipy.stats import wilcoxon
result = wilcoxon(
x, # First sample or differences
y=None, # Second sample (optional)
zero_method='wilcox', # How to handle zero differences
correction=False, # Continuity correction for normal approx
alternative='two-sided', # 'two-sided', 'greater', or 'less'
mode='auto', # 'auto', 'exact', or 'approx'
method=None, # Deprecated, use mode instead
)
Key parameters explained:
xandy: Pass either the differences directly inx, or two paired samples inxandy. The function computesx - yinternally.alternative: Controls the hypothesis direction. Use'greater'to test if the median difference is positive,'less'for negative.zero_method: Critical for handling ties at zero. Options are'wilcox'(discard zeros),'pratt'(include zeros in ranking), and'zsplit'(split zero ranks between positive and negative).mode: Determines p-value calculation.'exact'computes the exact distribution (slow for large n),'approx'uses normal approximation.
from scipy.stats import wilcoxon
import numpy as np
# Quick demonstration of parameter options
before = np.array([85, 90, 78, 92, 88, 76, 95, 89])
after = np.array([88, 92, 80, 91, 90, 79, 96, 91])
# Two-sided test (default)
result = wilcoxon(before, after, alternative='two-sided')
print(f"Statistic: {result.statistic}, p-value: {result.pvalue:.4f}")
One-Sample Test: Testing Against a Hypothesized Median
When you have a single sample and want to test whether its median differs from a hypothesized value, compute differences from that value and pass them to wilcoxon:
import numpy as np
from scipy.stats import wilcoxon
# Customer satisfaction scores (1-10 scale)
# Null hypothesis: median satisfaction equals 7 (neutral)
satisfaction_scores = np.array([8, 6, 9, 7, 8, 5, 9, 8, 7, 6, 8, 9, 7, 8, 6])
# Compute differences from hypothesized median
hypothesized_median = 7
differences = satisfaction_scores - hypothesized_median
# Test if median differs from 7
result = wilcoxon(differences, alternative='two-sided')
print(f"Test statistic: {result.statistic}")
print(f"P-value: {result.pvalue:.4f}")
# One-sided: test if median is greater than 7
result_greater = wilcoxon(differences, alternative='greater')
print(f"One-sided p-value (greater): {result_greater.pvalue:.4f}")
This approach works well for before/after studies where you’re testing whether the treatment effect differs from zero—just pass the raw differences without specifying a hypothesized value.
Two-Sample Paired Test: Comparing Related Groups
The most common use case involves comparing two related measurements. Here’s a realistic scenario: measuring task completion time before and after a UI redesign:
import numpy as np
from scipy.stats import wilcoxon, shapiro
# Task completion times in seconds (paired by user)
np.random.seed(42)
# Simulating realistic data: times are often right-skewed
before_redesign = np.array([45, 62, 38, 71, 55, 48, 89, 52, 67, 43,
58, 75, 41, 63, 50, 82, 47, 69, 54, 61])
after_redesign = np.array([42, 58, 35, 65, 51, 45, 78, 48, 61, 40,
54, 68, 38, 59, 47, 73, 44, 62, 50, 56])
# First, check if differences are normally distributed
differences = before_redesign - after_redesign
shapiro_stat, shapiro_p = shapiro(differences)
print(f"Shapiro-Wilk test for normality: p = {shapiro_p:.4f}")
# Normality rejected or questionable? Use Wilcoxon
result = wilcoxon(before_redesign, after_redesign, alternative='greater')
print(f"\nWilcoxon signed-rank test:")
print(f"Statistic: {result.statistic}")
print(f"P-value: {result.pvalue:.4f}")
if result.pvalue < 0.05:
print("Result: Significant reduction in task completion time")
else:
print("Result: No significant difference detected")
Note the alternative='greater' parameter: we’re testing whether before - after > 0, meaning times decreased after the redesign.
Interpreting Results
The test returns two values that require careful interpretation:
import numpy as np
from scipy.stats import wilcoxon
# Response times (ms) for same users on old vs new system
old_system = np.array([234, 256, 198, 287, 245, 312, 223, 267, 289, 241])
new_system = np.array([198, 234, 187, 256, 223, 278, 201, 245, 267, 219])
result = wilcoxon(old_system, new_system, alternative='two-sided')
print(f"Test Statistic (W): {result.statistic}")
print(f"P-value: {result.pvalue:.4f}")
# Calculate effect size: rank-biserial correlation
differences = old_system - new_system
n = len(differences[differences != 0])
# Rank-biserial correlation for Wilcoxon
# r = 1 - (2W) / (n(n+1)/2)
W = result.statistic
r = 1 - (2 * W) / (n * (n + 1) / 2)
print(f"Rank-biserial correlation (effect size): {r:.3f}")
# Decision logic
alpha = 0.05
if result.pvalue < alpha:
if r > 0.5:
print("Large effect: Strong evidence of improvement")
elif r > 0.3:
print("Medium effect: Moderate evidence of improvement")
else:
print("Small effect: Weak but significant improvement")
else:
print("No statistically significant difference")
The statistic represents the smaller of the sum of positive ranks and sum of negative ranks. A very small value suggests the differences consistently lean one direction. The rank-biserial correlation provides standardized effect size, interpretable like Cohen’s d: 0.1 (small), 0.3 (medium), 0.5 (large).
Handling Edge Cases and Common Pitfalls
Zero differences and ties require explicit handling. The zero_method parameter controls this:
import numpy as np
from scipy.stats import wilcoxon
# Data with zeros and ties
before = np.array([5, 5, 6, 7, 5, 8, 6, 5, 7, 6])
after = np.array([5, 6, 6, 8, 5, 7, 7, 5, 8, 6])
differences = before - after
print(f"Differences: {differences}")
print(f"Zero differences: {np.sum(differences == 0)}")
# Compare zero_method options
for method in ['wilcox', 'pratt', 'zsplit']:
try:
result = wilcoxon(before, after, zero_method=method)
print(f"\nzero_method='{method}':")
print(f" Statistic: {result.statistic}, p-value: {result.pvalue:.4f}")
except ValueError as e:
print(f"\nzero_method='{method}': {e}")
# For small samples, use exact mode
small_before = np.array([5, 7, 6, 8, 5])
small_after = np.array([6, 8, 7, 9, 6])
exact_result = wilcoxon(small_before, small_after, mode='exact')
approx_result = wilcoxon(small_before, small_after, mode='approx')
print(f"\nSmall sample (n=5):")
print(f" Exact p-value: {exact_result.pvalue:.4f}")
print(f" Approximate p-value: {approx_result.pvalue:.4f}")
Rules of thumb:
- Use
mode='exact'when n < 25 - Use
zero_method='pratt'when zeros are meaningful (not measurement error) - Use
zero_method='wilcox'(default) when zeros indicate no true difference
Practical Example: A/B Testing with Non-Normal Data
Here’s a complete analysis pipeline for comparing user session durations in a paired A/B test:
import numpy as np
from scipy.stats import wilcoxon, shapiro
import matplotlib.pyplot as plt
# Session durations (seconds) - same users exposed to both variants
np.random.seed(123)
n_users = 50
# Simulating right-skewed session data (realistic for web analytics)
control_sessions = np.random.exponential(scale=120, size=n_users) + 30
treatment_sessions = np.random.exponential(scale=140, size=n_users) + 35
differences = treatment_sessions - control_sessions
# Step 1: Visualize the paired differences
fig, axes = plt.subplots(1, 3, figsize=(12, 4))
axes[0].hist(differences, bins=15, edgecolor='black', alpha=0.7)
axes[0].axvline(x=0, color='red', linestyle='--', label='No difference')
axes[0].set_xlabel('Difference (Treatment - Control)')
axes[0].set_ylabel('Frequency')
axes[0].set_title('Distribution of Paired Differences')
axes[0].legend()
axes[1].boxplot([control_sessions, treatment_sessions], labels=['Control', 'Treatment'])
axes[1].set_ylabel('Session Duration (s)')
axes[1].set_title('Session Durations by Variant')
# Paired plot
for i in range(min(20, n_users)): # Show first 20 pairs
axes[2].plot([0, 1], [control_sessions[i], treatment_sessions[i]],
'o-', alpha=0.5, color='steelblue')
axes[2].set_xticks([0, 1])
axes[2].set_xticklabels(['Control', 'Treatment'])
axes[2].set_ylabel('Session Duration (s)')
axes[2].set_title('Paired Observations (first 20)')
plt.tight_layout()
plt.savefig('wilcoxon_analysis.png', dpi=150)
plt.close()
# Step 2: Test normality of differences
shapiro_stat, shapiro_p = shapiro(differences)
print("=" * 50)
print("A/B Test Analysis: Session Duration")
print("=" * 50)
print(f"\nNormality check (Shapiro-Wilk): p = {shapiro_p:.4f}")
print(f"Normality assumption: {'Violated' if shapiro_p < 0.05 else 'Satisfied'}")
# Step 3: Run Wilcoxon test
result = wilcoxon(treatment_sessions, control_sessions,
alternative='greater', mode='exact')
print(f"\nWilcoxon Signed-Rank Test Results:")
print(f" H0: Median difference = 0")
print(f" Ha: Treatment sessions > Control sessions")
print(f" Test statistic: {result.statistic:.1f}")
print(f" P-value: {result.pvalue:.4f}")
# Step 4: Effect size
n_nonzero = np.sum(differences != 0)
r = 1 - (2 * result.statistic) / (n_nonzero * (n_nonzero + 1) / 2)
print(f" Effect size (rank-biserial r): {r:.3f}")
# Step 5: Practical summary
median_diff = np.median(differences)
pct_improved = np.mean(differences > 0) * 100
print(f"\nPractical Summary:")
print(f" Median improvement: {median_diff:.1f} seconds")
print(f" Users with longer treatment sessions: {pct_improved:.1f}%")
print(f" Recommendation: {'Deploy treatment' if result.pvalue < 0.05 else 'No action'}")
This pipeline gives you everything needed for a production analysis: visualization, normality checking, hypothesis testing, effect size calculation, and actionable recommendations. The Wilcoxon test handles the non-normal session duration data appropriately, giving you reliable inference without parametric assumptions.