How to Perform the Kruskal-Wallis Test in Python

Key Insights

The Kruskal-Wallis test is your go-to method when comparing three or more groups with non-normal data or ordinal measurements—it ranks all observations and compares median ranks rather than means.
A significant Kruskal-Wallis result only tells you that differences exist somewhere; you must follow up with Dunn’s test or similar post-hoc analysis to identify which specific groups differ.
Always visualize your data with box plots before and after testing—statistical significance means nothing without understanding the practical magnitude of group differences.

Introduction to the Kruskal-Wallis Test

The Kruskal-Wallis test is the non-parametric equivalent of one-way ANOVA. When your data violates normality assumptions or you’re working with ordinal scales (like survey ratings), this test becomes essential.

Use Kruskal-Wallis when you need to compare three or more independent groups and at least one of these conditions applies:

Your data isn’t normally distributed
Sample sizes are small (making normality hard to verify)
You’re working with ordinal data (rankings, Likert scales)
You have significant outliers you can’t justify removing

The test works by ranking all observations across all groups, then comparing whether the mean ranks differ significantly between groups. It tests the null hypothesis that all groups come from populations with the same median.

Three key assumptions must hold:

Independence: Observations within and between groups are independent
Ordinal scale: Data must be at least ordinal (rankable)
Similar shapes: Group distributions should have similar shapes (though not necessarily normal)

That third assumption often gets overlooked. If distributions have vastly different shapes, a significant result might reflect shape differences rather than location differences.

Prerequisites and Setup

You’ll need a few standard libraries. SciPy handles the main test, while scikit-posthocs provides post-hoc comparisons:

import numpy as np
import pandas as pd
from scipy import stats
import scikit_posthocs as sp
import matplotlib.pyplot as plt
import seaborn as sns

# Set random seed for reproducibility
np.random.seed(42)

# Create sample dataset: treatment effectiveness scores (0-100)
# Three treatment groups with intentionally non-normal distributions
treatment_a = np.concatenate([
    np.random.normal(65, 10, 25),
    np.random.normal(85, 5, 5)  # Some high responders
])
treatment_b = np.random.exponential(scale=15, size=30) + 40
treatment_c = np.random.normal(75, 8, 35)

# Clip to realistic range
treatment_a = np.clip(treatment_a, 0, 100)
treatment_b = np.clip(treatment_b, 0, 100)
treatment_c = np.clip(treatment_c, 0, 100)

Install scikit-posthocs if you haven’t already:

pip install scikit-posthocs

Performing the Test with SciPy

The scipy.stats.kruskal() function makes execution straightforward. Pass your groups as separate arrays:

# Perform Kruskal-Wallis test
h_statistic, p_value = stats.kruskal(treatment_a, treatment_b, treatment_c)

print(f"Kruskal-Wallis H-statistic: {h_statistic:.4f}")
print(f"P-value: {p_value:.4f}")

# Interpretation
alpha = 0.05
if p_value < alpha:
    print(f"\nResult: Reject null hypothesis (p < {alpha})")
    print("At least one group differs significantly from the others.")
else:
    print(f"\nResult: Fail to reject null hypothesis (p >= {alpha})")
    print("No significant difference detected between groups.")

The H-statistic follows a chi-square distribution with k-1 degrees of freedom (where k is the number of groups). Larger H values indicate greater differences between group ranks.

A common mistake: treating a significant p-value as proof that all groups differ. It only means at least one group is different—you don’t know which ones yet.

Preparing Real-World Data

Real data rarely comes in separate arrays. Here’s how to handle typical DataFrame structures:

# Simulate loading data from CSV
data = pd.DataFrame({
    'patient_id': range(1, 96),
    'treatment': ['A'] * 30 + ['B'] * 30 + ['C'] * 35,
    'effectiveness_score': np.concatenate([treatment_a, treatment_b, treatment_c])
})

# Preview the data
print(data.groupby('treatment')['effectiveness_score'].describe())

# Method 1: Extract arrays using groupby
groups = [group['effectiveness_score'].values 
          for name, group in data.groupby('treatment')]

h_stat, p_val = stats.kruskal(*groups)
print(f"\nH-statistic: {h_stat:.4f}, p-value: {p_val:.4f}")

# Method 2: Direct extraction by condition
group_a = data[data['treatment'] == 'A']['effectiveness_score']
group_b = data[data['treatment'] == 'B']['effectiveness_score']
group_c = data[data['treatment'] == 'C']['effectiveness_score']

h_stat, p_val = stats.kruskal(group_a, group_b, group_c)

Unequal group sizes aren’t a problem—Kruskal-Wallis handles them naturally. However, extremely unbalanced designs (like 10 vs 10 vs 200) can affect power and interpretation.

Post-Hoc Analysis with Dunn’s Test

A significant Kruskal-Wallis result demands follow-up testing. Dunn’s test performs pairwise comparisons while controlling for multiple testing:

# Perform Dunn's test with Bonferroni correction
dunn_results = sp.posthoc_dunn(
    data, 
    val_col='effectiveness_score', 
    group_col='treatment', 
    p_adjust='bonferroni'
)

print("Dunn's Test Results (Bonferroni-corrected p-values):")
print(dunn_results.round(4))

# Identify significant pairs
alpha = 0.05
significant_pairs = []
for i in range(len(dunn_results.columns)):
    for j in range(i + 1, len(dunn_results.columns)):
        group1 = dunn_results.columns[i]
        group2 = dunn_results.columns[j]
        p = dunn_results.iloc[i, j]
        if p < alpha:
            significant_pairs.append((group1, group2, p))
            
print(f"\nSignificant pairwise differences (p < {alpha}):")
for g1, g2, p in significant_pairs:
    print(f"  {g1} vs {g2}: p = {p:.4f}")

The Bonferroni correction is conservative—it reduces false positives but may miss real differences. Alternatives include:

Holm: Less conservative, still controls family-wise error rate
Benjamini-Hochberg: Controls false discovery rate, more powerful

# Compare different correction methods
for method in ['bonferroni', 'holm', 'fdr_bh']:
    results = sp.posthoc_dunn(
        data, val_col='effectiveness_score', 
        group_col='treatment', p_adjust=method
    )
    print(f"\n{method.upper()} correction:")
    print(results.round(4))

Visualizing Results

Numbers without visuals tell an incomplete story. Box plots reveal distribution characteristics that summary statistics hide:

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Box plot
sns.boxplot(data=data, x='treatment', y='effectiveness_score', ax=axes[0])
axes[0].set_title('Treatment Effectiveness by Group')
axes[0].set_xlabel('Treatment')
axes[0].set_ylabel('Effectiveness Score')

# Add median annotations
medians = data.groupby('treatment')['effectiveness_score'].median()
for i, treatment in enumerate(['A', 'B', 'C']):
    axes[0].annotate(f'Mdn: {medians[treatment]:.1f}', 
                     xy=(i, medians[treatment]), 
                     xytext=(i + 0.2, medians[treatment] + 3),
                     fontsize=9)

# Violin plot for distribution shape
sns.violinplot(data=data, x='treatment', y='effectiveness_score', ax=axes[1])
axes[1].set_title('Distribution Shape by Treatment')
axes[1].set_xlabel('Treatment')
axes[1].set_ylabel('Effectiveness Score')

plt.tight_layout()
plt.savefig('kruskal_wallis_visualization.png', dpi=150)
plt.show()

For publication-ready figures, add significance annotations:

def add_significance_bracket(ax, x1, x2, y, p_value, height=2):
    """Add significance bracket between two x positions."""
    # Determine significance stars
    if p_value < 0.001:
        sig_text = '***'
    elif p_value < 0.01:
        sig_text = '**'
    elif p_value < 0.05:
        sig_text = '*'
    else:
        sig_text = 'ns'
    
    ax.plot([x1, x1, x2, x2], [y, y + height, y + height, y], 'k-', lw=1.5)
    ax.text((x1 + x2) / 2, y + height, sig_text, ha='center', va='bottom', fontsize=12)

# Create annotated plot
fig, ax = plt.subplots(figsize=(8, 6))
sns.boxplot(data=data, x='treatment', y='effectiveness_score', ax=ax)

# Add significance brackets for significant pairs
y_max = data['effectiveness_score'].max()
add_significance_bracket(ax, 0, 1, y_max + 5, dunn_results.loc['A', 'B'])
add_significance_bracket(ax, 1, 2, y_max + 12, dunn_results.loc['B', 'C'])

ax.set_ylim(top=y_max + 25)
ax.set_title('Treatment Effectiveness with Statistical Comparisons')
plt.tight_layout()
plt.show()

Complete Workflow Example

Here’s a production-ready script combining everything. This analyzes customer satisfaction scores across three regional offices:

"""
Complete Kruskal-Wallis Analysis Pipeline
Analyzes customer satisfaction across regional offices
"""

import numpy as np
import pandas as pd
from scipy import stats
import scikit_posthocs as sp
import matplotlib.pyplot as plt
import seaborn as sns

def run_kruskal_wallis_analysis(data, value_col, group_col, alpha=0.05):
    """
    Complete Kruskal-Wallis analysis with post-hoc tests and visualization.
    
    Parameters:
    -----------
    data : pd.DataFrame
        Input data with value and group columns
    value_col : str
        Name of the column containing measurements
    group_col : str
        Name of the column containing group labels
    alpha : float
        Significance level (default 0.05)
    
    Returns:
    --------
    dict : Results containing statistics, p-values, and post-hoc comparisons
    """
    results = {}
    
    # Step 1: Descriptive statistics
    print("=" * 60)
    print("DESCRIPTIVE STATISTICS")
    print("=" * 60)
    desc_stats = data.groupby(group_col)[value_col].agg([
        'count', 'median', 'mean', 'std', 'min', 'max'
    ]).round(2)
    print(desc_stats)
    results['descriptive'] = desc_stats
    
    # Step 2: Check distribution shapes (Shapiro-Wilk test)
    print("\n" + "=" * 60)
    print("NORMALITY TESTS (Shapiro-Wilk)")
    print("=" * 60)
    for group_name, group_data in data.groupby(group_col):
        stat, p = stats.shapiro(group_data[value_col])
        normality = "Normal" if p >= alpha else "Non-normal"
        print(f"{group_name}: W={stat:.4f}, p={p:.4f} ({normality})")
    
    # Step 3: Kruskal-Wallis test
    print("\n" + "=" * 60)
    print("KRUSKAL-WALLIS TEST")
    print("=" * 60)
    groups = [group[value_col].values for _, group in data.groupby(group_col)]
    h_stat, p_value = stats.kruskal(*groups)
    
    print(f"H-statistic: {h_stat:.4f}")
    print(f"P-value: {p_value:.6f}")
    print(f"Degrees of freedom: {len(groups) - 1}")
    
    results['h_statistic'] = h_stat
    results['p_value'] = p_value
    
    if p_value < alpha:
        print(f"\n>>> SIGNIFICANT (p < {alpha}): Groups differ significantly")
        
        # Step 4: Post-hoc analysis
        print("\n" + "=" * 60)
        print("POST-HOC ANALYSIS (Dunn's Test, Bonferroni)")
        print("=" * 60)
        posthoc = sp.posthoc_dunn(
            data, val_col=value_col, group_col=group_col, p_adjust='bonferroni'
        )
        print(posthoc.round(4))
        results['posthoc'] = posthoc
        
        # Identify significant pairs
        print("\nSignificant pairwise comparisons:")
        group_names = posthoc.columns.tolist()
        for i, g1 in enumerate(group_names):
            for g2 in group_names[i+1:]:
                p = posthoc.loc[g1, g2]
                if p < alpha:
                    print(f"  {g1} vs {g2}: p = {p:.4f} *")
    else:
        print(f"\n>>> NOT SIGNIFICANT (p >= {alpha}): No evidence of group differences")
        results['posthoc'] = None
    
    # Step 5: Visualization
    fig, axes = plt.subplots(1, 2, figsize=(12, 5))
    
    sns.boxplot(data=data, x=group_col, y=value_col, ax=axes[0], palette='Set2')
    axes[0].set_title(f'Distribution by {group_col}\n(Kruskal-Wallis p={p_value:.4f})')
    
    sns.violinplot(data=data, x=group_col, y=value_col, ax=axes[1], palette='Set2')
    axes[1].set_title(f'Density by {group_col}')
    
    plt.tight_layout()
    plt.savefig('kruskal_wallis_results.png', dpi=150, bbox_inches='tight')
    plt.show()
    
    return results

# Generate sample data: Customer satisfaction by region
np.random.seed(123)
satisfaction_data = pd.DataFrame({
    'region': ['North'] * 45 + ['South'] * 50 + ['West'] * 40,
    'satisfaction': np.concatenate([
        np.random.choice([3, 4, 4, 5, 5, 5], 45),  # North: higher satisfaction
        np.random.choice([2, 3, 3, 4, 4], 50),     # South: moderate
        np.random.choice([1, 2, 2, 3, 3, 4], 40)   # West: lower satisfaction
    ])
})

# Run the analysis
results = run_kruskal_wallis_analysis(
    satisfaction_data, 
    value_col='satisfaction', 
    group_col='region'
)

This script produces comprehensive output: descriptive statistics, normality checks, the main test, post-hoc comparisons, and publication-quality visualizations. Adapt the run_kruskal_wallis_analysis function for your specific datasets—it handles the complete workflow while remaining readable and maintainable.