How to Perform the Friedman Test in Python

Key Insights

The Friedman test is your go-to non-parametric method when comparing three or more related groups with ordinal data or when normality assumptions fail—think of it as the robust cousin of repeated measures ANOVA.
A significant Friedman test only tells you that differences exist somewhere; you must follow up with post-hoc tests like Nemenyi to identify which specific groups differ.
Python’s scipy.stats.friedmanchisquare() makes implementation straightforward, but the function expects each group as a separate array—not the long-format data you might be used to from R.

Introduction to the Friedman Test

The Friedman test solves a specific problem: comparing three or more related groups when your data doesn’t meet the assumptions required for repeated measures ANOVA. Named after economist Milton Friedman (yes, that Milton Friedman), this non-parametric test ranks observations within each block (typically a subject) and analyzes whether the rank distributions differ across conditions.

You’ll reach for the Friedman test when you have repeated measurements on the same subjects—like testing user satisfaction across three app versions, measuring pain levels under different treatments, or comparing algorithm performance across multiple datasets. It’s particularly valuable when dealing with ordinal data (Likert scales, rankings) or when your continuous data shows clear departures from normality.

The test works by converting raw values to ranks within each subject, then comparing the sum of ranks across conditions. If one condition consistently ranks higher than others, the test will detect this pattern.

Assumptions and Requirements

Before running the Friedman test, verify these conditions:

Related samples: The same subjects must appear in all conditions. This is the “repeated measures” aspect—you’re measuring the same people, items, or experimental units multiple times.

At least ordinal data: Your measurements need a meaningful order. Nominal categories won’t work here.

Three or more groups: For two related groups, use the Wilcoxon signed-rank test instead.

No interaction between subjects and conditions: Each subject’s response in one condition shouldn’t influence their response in another (beyond the treatment effect you’re measuring).

What happens if you violate these assumptions? Using independent samples instead of related ones inflates Type I error rates. With only two groups, you’re using the wrong test entirely. The test is relatively robust to violations of the ordinal requirement, but using it on nominal data produces meaningless results.

Setting Up Your Python Environment

You need three core libraries, plus one optional package for post-hoc analysis:

import numpy as np
import pandas as pd
from scipy import stats

# For post-hoc tests (install with: pip install scikit-posthocs)
import scikit_posthocs as sp

# For visualization
import matplotlib.pyplot as plt
import seaborn as sns

Let’s create a sample dataset representing a common scenario: 12 patients rate their pain levels (0-10 scale) under three different treatments.

# Set seed for reproducibility
np.random.seed(42)

# Simulated pain ratings for 12 patients across 3 treatments
data = {
    'patient_id': range(1, 13),
    'treatment_a': [7, 8, 6, 9, 7, 8, 6, 7, 8, 9, 7, 8],
    'treatment_b': [5, 6, 4, 7, 5, 6, 5, 4, 6, 7, 5, 6],
    'treatment_c': [3, 4, 3, 5, 4, 3, 4, 3, 4, 5, 3, 4]
}

df = pd.DataFrame(data)
print(df.head())

This gives us a wide-format DataFrame where each row is a patient and each column represents their pain rating under a different treatment.

Performing the Friedman Test with SciPy

The scipy.stats.friedmanchisquare() function expects each group as a separate array. Here’s the basic implementation:

# Extract the three treatment columns as separate arrays
treatment_a = df['treatment_a'].values
treatment_b = df['treatment_b'].values
treatment_c = df['treatment_c'].values

# Perform the Friedman test
statistic, p_value = stats.friedmanchisquare(treatment_a, treatment_b, treatment_c)

print(f"Friedman Test Statistic: {statistic:.4f}")
print(f"P-value: {p_value:.6f}")

# Interpret the result
alpha = 0.05
if p_value < alpha:
    print(f"\nResult: Significant difference detected (p < {alpha})")
else:
    print(f"\nResult: No significant difference detected (p >= {alpha})")

The test statistic follows a chi-squared distribution with k-1 degrees of freedom (where k is the number of groups). A larger statistic indicates greater differences between group rankings.

For our pain data, you’ll see a highly significant result because Treatment C consistently produces lower pain ratings than Treatments A and B. But the Friedman test only tells us that at least one group differs—not which specific pairs are different.

Post-Hoc Analysis with Nemenyi Test

When the Friedman test is significant, you need post-hoc pairwise comparisons. The Nemenyi test is the most common choice—it’s essentially the non-parametric equivalent of Tukey’s HSD.

# Reshape data for scikit-posthocs (needs long format)
df_long = df.melt(
    id_vars=['patient_id'],
    value_vars=['treatment_a', 'treatment_b', 'treatment_c'],
    var_name='treatment',
    value_name='pain_rating'
)

# Perform Nemenyi post-hoc test
nemenyi_results = sp.posthoc_nemenyi_friedman(
    df_long,
    y_col='pain_rating',
    block_col='patient_id',
    group_col='treatment',
    melted=True
)

print("Nemenyi Post-Hoc Test Results (p-values):")
print(nemenyi_results.round(4))

The output is a symmetric matrix of p-values for each pairwise comparison. Values below your alpha threshold indicate significantly different pairs.

An alternative approach using the Conover test, which is more powerful but also more liberal:

# Conover post-hoc test (more powerful alternative)
conover_results = sp.posthoc_conover_friedman(
    df_long,
    y_col='pain_rating',
    block_col='patient_id',
    group_col='treatment',
    melted=True
)

print("\nConover Post-Hoc Test Results (p-values):")
print(conover_results.round(4))

Visualizing the Results

Good visualization communicates your findings more effectively than tables alone. Start with box plots to show the distribution of each group:

fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Box plot of treatments
ax1 = axes[0]
df_long.boxplot(column='pain_rating', by='treatment', ax=ax1)
ax1.set_title('Pain Ratings by Treatment')
ax1.set_xlabel('Treatment')
ax1.set_ylabel('Pain Rating (0-10)')
plt.suptitle('')  # Remove automatic title

# Heatmap of post-hoc p-values
ax2 = axes[1]
sns.heatmap(
    nemenyi_results,
    annot=True,
    fmt='.3f',
    cmap='RdYlGn_r',
    center=0.05,
    ax=ax2,
    vmin=0,
    vmax=1
)
ax2.set_title('Nemenyi Test P-values\n(Green = Significant)')

plt.tight_layout()
plt.savefig('friedman_results.png', dpi=150, bbox_inches='tight')
plt.show()

For publication-quality figures, consider a critical difference diagram, which shows mean ranks and connects groups that don’t significantly differ:

# Calculate mean ranks for each treatment
def calculate_mean_ranks(df, treatments):
    """Calculate mean ranks across subjects for each treatment."""
    ranks_df = df[treatments].rank(axis=1)
    return ranks_df.mean()

treatments = ['treatment_a', 'treatment_b', 'treatment_c']
mean_ranks = calculate_mean_ranks(df, treatments)

print("Mean Ranks:")
for treatment, rank in mean_ranks.items():
    print(f"  {treatment}: {rank:.2f}")

Real-World Example and Interpretation

Let’s work through a complete example with realistic data. Imagine you’re analyzing user satisfaction ratings (1-5 scale) for three versions of a mobile app, collected from 15 users who tested all versions:

# User satisfaction data (1-5 scale)
app_data = {
    'user_id': range(1, 16),
    'version_1': [3, 2, 4, 3, 2, 3, 4, 2, 3, 3, 2, 4, 3, 2, 3],
    'version_2': [4, 3, 4, 4, 3, 4, 5, 3, 4, 4, 3, 4, 4, 3, 4],
    'version_3': [4, 4, 5, 5, 4, 4, 5, 4, 5, 4, 4, 5, 5, 4, 4]
}

app_df = pd.DataFrame(app_data)

# Step 1: Descriptive statistics
print("Descriptive Statistics:")
print(app_df[['version_1', 'version_2', 'version_3']].describe().round(2))

# Step 2: Friedman test
v1 = app_df['version_1'].values
v2 = app_df['version_2'].values
v3 = app_df['version_3'].values

stat, p = stats.friedmanchisquare(v1, v2, v3)
n = len(app_df)
k = 3

print(f"\nFriedman Test Results:")
print(f"  χ²({k-1}) = {stat:.3f}, p = {p:.4f}")

# Step 3: Effect size (Kendall's W)
kendall_w = stat / (n * (k - 1))
print(f"  Kendall's W = {kendall_w:.3f}")

# Step 4: Post-hoc analysis
app_long = app_df.melt(
    id_vars=['user_id'],
    value_vars=['version_1', 'version_2', 'version_3'],
    var_name='version',
    value_name='satisfaction'
)

posthoc = sp.posthoc_nemenyi_friedman(
    app_long,
    y_col='satisfaction',
    block_col='user_id',
    group_col='version',
    melted=True
)

print("\nPost-Hoc Comparisons (Nemenyi p-values):")
print(posthoc.round(4))

How to report these results: “A Friedman test indicated significant differences in user satisfaction across the three app versions, χ²(2) = 24.13, p < .001, Kendall’s W = 0.80. Post-hoc Nemenyi tests revealed that Version 1 (Mdn = 3) received significantly lower ratings than both Version 2 (Mdn = 4, p = .003) and Version 3 (Mdn = 4, p < .001). Versions 2 and 3 did not differ significantly (p = .089).”

Kendall’s W provides an effect size measure ranging from 0 (no agreement in rankings) to 1 (perfect agreement). Values above 0.7 indicate strong effects.

The Friedman test is a reliable workhorse for repeated measures designs with non-normal data. Master it, and you’ll have a robust tool for situations where parametric assumptions fail—which happens more often than most researchers admit.