How to Perform the Wilcoxon Signed-Rank Test in Python

Key Insights

The Wilcoxon signed-rank test is your go-to method when comparing paired samples that violate normality assumptions—it’s more robust than the paired t-test and preserves statistical power for non-normal distributions.
Always check for symmetry in the distribution of differences, not just non-normality; this is the test’s core assumption that many practitioners overlook.
Effect size matters more than p-values for practical significance—use the matched-pairs rank biserial correlation to quantify the magnitude of your findings.

Introduction to the Wilcoxon Signed-Rank Test

The Wilcoxon signed-rank test is a non-parametric statistical test that compares two related samples. Think of it as the paired t-test’s distribution-free cousin. While the paired t-test assumes your data follows a normal distribution, the Wilcoxon test makes no such assumption—it works with ranks instead of raw values.

You should reach for this test in three scenarios: when your paired data clearly violates normality assumptions, when you’re working with ordinal data (like Likert scales), or when you have small sample sizes where normality is difficult to verify.

Real-world applications are everywhere. A/B testing where you measure the same users before and after a feature change. Clinical trials comparing patient outcomes pre- and post-treatment. UX studies measuring task completion times with different interface designs. Any time you have paired observations and can’t trust normality, the Wilcoxon signed-rank test is your tool.

Assumptions and Requirements

Before running the test, verify these assumptions:

Paired samples: Each observation in one group must have a corresponding observation in the other group.
Continuous or ordinal dependent variable: The differences between pairs must be rankable.
Symmetrical distribution of differences: This is the critical assumption. The differences don’t need to be normal, but they should be roughly symmetric around the median.

Let’s check whether our data warrants using Wilcoxon over a paired t-test:

import numpy as np
from scipy import stats

# Sample data: response times (ms) before and after optimization
before = np.array([245, 312, 278, 401, 289, 356, 267, 445, 298, 334,
                   512, 287, 369, 421, 276, 398, 315, 467, 289, 352])
after = np.array([198, 287, 245, 356, 234, 312, 223, 398, 256, 289,
                  467, 245, 323, 378, 234, 356, 278, 412, 245, 312])

# Calculate differences
differences = before - after

# Test normality of differences using Shapiro-Wilk
stat, p_value = stats.shapiro(differences)
print(f"Shapiro-Wilk test statistic: {stat:.4f}")
print(f"P-value: {p_value:.4f}")

if p_value < 0.05:
    print("Differences are NOT normally distributed → Use Wilcoxon")
else:
    print("Differences appear normally distributed → Paired t-test is valid")

Even if Shapiro-Wilk doesn’t reject normality, you might still prefer Wilcoxon for robustness against outliers or when your sample size is small (n < 30).

Implementing the Test with SciPy

SciPy’s wilcoxon() function handles the heavy lifting. Here’s the basic implementation:

from scipy.stats import wilcoxon

# Perform the Wilcoxon signed-rank test
statistic, p_value = wilcoxon(before, after)

print(f"Test statistic (W): {statistic:.4f}")
print(f"P-value: {p_value:.4f}")

alpha = 0.05
if p_value < alpha:
    print(f"Result: Reject null hypothesis (p < {alpha})")
    print("There is a significant difference between before and after.")
else:
    print(f"Result: Fail to reject null hypothesis (p >= {alpha})")

Understanding the key parameters:

zero_method: How to handle pairs with zero difference. Options are 'wilcox' (discard), 'pratt' (include in ranking), or 'zsplit' (split ranks between positive and negative). Default is 'wilcox'.
correction: Apply continuity correction for the normal approximation. Set to True for small samples.
alternative: 'two-sided' (default), 'greater', or 'less'.

# More explicit call with parameters
statistic, p_value = wilcoxon(
    before, 
    after,
    zero_method='wilcox',
    correction=True,
    alternative='two-sided'
)

Interpreting Results

The test statistic W represents the smaller of the sum of positive ranks and the sum of negative ranks. A smaller W indicates a larger difference between groups.

But p-values alone don’t tell you how meaningful the difference is. You need effect size:

def wilcoxon_effect_size(x, y):
    """
    Calculate matched-pairs rank biserial correlation (effect size for Wilcoxon).
    
    Returns:
        r: Effect size (-1 to 1)
        interpretation: Verbal description of effect magnitude
    """
    differences = x - y
    
    # Remove zero differences
    differences = differences[differences != 0]
    n = len(differences)
    
    # Get ranks of absolute differences
    abs_diff = np.abs(differences)
    ranks = stats.rankdata(abs_diff)
    
    # Sum of ranks for positive and negative differences
    r_plus = np.sum(ranks[differences > 0])
    r_minus = np.sum(ranks[differences < 0])
    
    # Matched-pairs rank biserial correlation
    r = (r_plus - r_minus) / (r_plus + r_minus)
    
    # Interpret effect size (using Cohen's conventions adapted for r)
    abs_r = abs(r)
    if abs_r < 0.1:
        interpretation = "negligible"
    elif abs_r < 0.3:
        interpretation = "small"
    elif abs_r < 0.5:
        interpretation = "medium"
    else:
        interpretation = "large"
    
    return r, interpretation

# Calculate and display
effect_size, interpretation = wilcoxon_effect_size(before, after)
print(f"Effect size (r): {effect_size:.4f}")
print(f"Interpretation: {interpretation} effect")

A positive effect size indicates that the first sample tends to have larger values; negative indicates the opposite.

One-Tailed vs Two-Tailed Tests

Use a one-tailed test when you have a directional hypothesis before collecting data. Don’t switch to one-tailed after seeing your results—that’s p-hacking.

# Hypothesis: Optimization REDUCES response time (before > after)
stat, p_greater = wilcoxon(before, after, alternative='greater')
print(f"One-tailed (greater): p = {p_greater:.4f}")
print("Tests if 'before' values tend to be greater than 'after'")

# Hypothesis: Optimization INCREASES response time (before < after)
stat, p_less = wilcoxon(before, after, alternative='less')
print(f"One-tailed (less): p = {p_less:.4f}")
print("Tests if 'before' values tend to be less than 'after'")

# Two-tailed (any difference)
stat, p_two = wilcoxon(before, after, alternative='two-sided')
print(f"Two-tailed: p = {p_two:.4f}")
print("Tests if there's any difference in either direction")

Note that for symmetric distributions, p_two ≈ 2 * min(p_greater, p_less).

Complete Practical Example

Let’s walk through a realistic scenario: measuring API response times before and after a caching optimization.

import numpy as np
from scipy import stats
import matplotlib.pyplot as plt

# Set random seed for reproducibility
np.random.seed(42)

# Simulate realistic API response times (milliseconds)
# Before: log-normal distribution (common for response times)
n_samples = 50
before_times = np.random.lognormal(mean=5.5, sigma=0.4, size=n_samples)
# After: reduced times with some noise
after_times = before_times * np.random.uniform(0.6, 0.9, size=n_samples)

# Round to realistic precision
before_times = np.round(before_times, 1)
after_times = np.round(after_times, 1)

print("=" * 60)
print("API Response Time Analysis: Before vs After Caching")
print("=" * 60)

# Step 1: Descriptive statistics
print("\n1. DESCRIPTIVE STATISTICS")
print("-" * 40)
print(f"{'Metric':<20} {'Before':>12} {'After':>12}")
print(f"{'Mean (ms)':<20} {np.mean(before_times):>12.1f} {np.mean(after_times):>12.1f}")
print(f"{'Median (ms)':<20} {np.median(before_times):>12.1f} {np.median(after_times):>12.1f}")
print(f"{'Std Dev':<20} {np.std(before_times):>12.1f} {np.std(after_times):>12.1f}")

# Step 2: Check normality of differences
print("\n2. NORMALITY CHECK (Shapiro-Wilk on differences)")
print("-" * 40)
differences = before_times - after_times
shapiro_stat, shapiro_p = stats.shapiro(differences)
print(f"Shapiro-Wilk statistic: {shapiro_stat:.4f}")
print(f"P-value: {shapiro_p:.4f}")
print(f"Conclusion: {'Non-normal' if shapiro_p < 0.05 else 'Normal'} distribution")

# Step 3: Perform Wilcoxon signed-rank test
print("\n3. WILCOXON SIGNED-RANK TEST")
print("-" * 40)
statistic, p_value = stats.wilcoxon(
    before_times, 
    after_times, 
    alternative='greater'  # We expect before > after
)
print(f"Test statistic (W): {statistic:.1f}")
print(f"P-value (one-tailed): {p_value:.6f}")

alpha = 0.05
if p_value < alpha:
    print(f"✓ Significant at α = {alpha}: Caching reduced response times")
else:
    print(f"✗ Not significant at α = {alpha}")

# Step 4: Effect size
print("\n4. EFFECT SIZE")
print("-" * 40)
diff_nonzero = differences[differences != 0]
ranks = stats.rankdata(np.abs(diff_nonzero))
r_plus = np.sum(ranks[diff_nonzero > 0])
r_minus = np.sum(ranks[diff_nonzero < 0])
effect_r = (r_plus - r_minus) / (r_plus + r_minus)
print(f"Matched-pairs rank biserial r: {effect_r:.4f}")
print(f"Effect magnitude: {'Large' if abs(effect_r) > 0.5 else 'Medium' if abs(effect_r) > 0.3 else 'Small'}")

# Step 5: Visualization
fig, axes = plt.subplots(1, 3, figsize=(14, 5))

# Box plot comparison
axes[0].boxplot([before_times, after_times], labels=['Before', 'After'])
axes[0].set_ylabel('Response Time (ms)')
axes[0].set_title('Response Time Distribution')
axes[0].grid(axis='y', alpha=0.3)

# Paired observations plot
for i in range(len(before_times)):
    axes[1].plot([0, 1], [before_times[i], after_times[i]], 
                 'o-', color='steelblue', alpha=0.4, markersize=4)
axes[1].set_xticks([0, 1])
axes[1].set_xticklabels(['Before', 'After'])
axes[1].set_ylabel('Response Time (ms)')
axes[1].set_title('Paired Observations')
axes[1].grid(axis='y', alpha=0.3)

# Distribution of differences
axes[2].hist(differences, bins=15, edgecolor='black', alpha=0.7)
axes[2].axvline(x=0, color='red', linestyle='--', label='No difference')
axes[2].axvline(x=np.median(differences), color='green', linestyle='-', 
                label=f'Median: {np.median(differences):.1f}ms')
axes[2].set_xlabel('Difference (Before - After) in ms')
axes[2].set_ylabel('Frequency')
axes[2].set_title('Distribution of Differences')
axes[2].legend()

plt.tight_layout()
plt.savefig('wilcoxon_analysis.png', dpi=150, bbox_inches='tight')
plt.show()

print("\n5. CONCLUSION")
print("-" * 40)
median_improvement = np.median(differences)
pct_improvement = (median_improvement / np.median(before_times)) * 100
print(f"Median improvement: {median_improvement:.1f}ms ({pct_improvement:.1f}%)")

Common Pitfalls and Best Practices

Handling ties and zero differences: When pairs have identical values (zero difference), the default behavior discards them. This is usually fine, but if you have many zeros, consider zero_method='pratt' to include them in the analysis.

# Check for zero differences
zero_count = np.sum(before == after)
print(f"Pairs with zero difference: {zero_count}")

if zero_count > len(before) * 0.1:  # More than 10% zeros
    print("Warning: Many zero differences. Consider using zero_method='pratt'")
    stat, p = wilcoxon(before, after, zero_method='pratt')

Sample size considerations: The Wilcoxon test needs at least 5 non-zero differences to produce meaningful results. For very small samples (n < 20), use exact p-values by setting mode='exact' in newer SciPy versions.

Choosing between tests:

Use paired t-test when differences are normally distributed and you want maximum power.
Use Wilcoxon signed-rank when differences are symmetric but non-normal, or when you have ordinal data.
Use sign test when you can’t assume symmetry of differences—it only considers the direction of change, not magnitude.

The Wilcoxon signed-rank test strikes a balance between the assumptions of the paired t-test and the simplicity of the sign test. It’s robust, widely applicable, and should be in every data scientist’s toolkit for paired comparisons.