How to Perform a Repeated Measures ANOVA in Python

Key Insights

Repeated measures ANOVA is your go-to test when the same subjects are measured across multiple conditions or time points, offering more statistical power than between-subjects designs by controlling for individual differences.
Sphericity is the critical assumption that trips up most analysts—always check it with Mauchly’s test and apply Greenhouse-Geisser corrections when violated.
The pingouin library makes repeated measures ANOVA in Python straightforward, handling sphericity corrections automatically and providing effect sizes out of the box.

When to Use Repeated Measures ANOVA

Standard one-way ANOVA compares means across independent groups—different people in each condition. Repeated measures ANOVA handles a fundamentally different scenario: the same subjects measured multiple times.

Consider a clinical trial testing a new anxiety medication. You measure patients’ anxiety scores at baseline, after 2 weeks, after 4 weeks, and after 8 weeks. Each patient provides four data points. A standard ANOVA would ignore the fact that measurements from the same person are correlated, wasting statistical power and violating independence assumptions.

Repeated measures ANOVA accounts for this within-subject correlation. By partitioning out individual differences, it becomes more sensitive to treatment effects. This is why repeated measures designs are popular in psychology, medicine, and any field where longitudinal tracking matters.

Use repeated measures ANOVA when:

The same subjects are measured under all conditions
You have three or more measurement points (use paired t-test for two)
Your dependent variable is continuous
You want to detect changes within subjects over time or conditions

Assumptions and Prerequisites

Before running the analysis, verify these assumptions:

Sphericity: The variances of differences between all pairs of conditions should be equal. This is the repeated measures equivalent of homogeneity of variance. Violations inflate Type I error rates.

Normality: The dependent variable should be approximately normally distributed within each condition. Repeated measures ANOVA is reasonably robust to mild violations with larger samples.

No significant outliers: Extreme values can distort results, especially with smaller samples.

Let’s test sphericity using Mauchly’s test with pingouin:

import pandas as pd
import pingouin as pg
import numpy as np

# Create sample dataset: anxiety scores across 4 time points
np.random.seed(42)
n_subjects = 30

data = pd.DataFrame({
    'subject': np.repeat(range(1, n_subjects + 1), 4),
    'time': np.tile(['baseline', 'week2', 'week4', 'week8'], n_subjects),
    'anxiety': np.concatenate([
        np.random.normal(65, 10, n_subjects),  # baseline
        np.random.normal(55, 12, n_subjects),  # week 2
        np.random.normal(45, 11, n_subjects),  # week 4
        np.random.normal(40, 10, n_subjects),  # week 8
    ])
})

# Mauchly's test for sphericity
# Note: pingouin's rm_anova automatically computes this
spher, W, chi2, dof, pval = pg.sphericity(
    data, dv='anxiety', subject='subject', within='time'
)

print(f"Sphericity assumption met: {spher}")
print(f"Mauchly's W: {W:.3f}")
print(f"Chi-square: {chi2:.3f}, p-value: {pval:.3f}")

If pval < 0.05, sphericity is violated and you’ll need corrections (covered below).

Preparing Your Data

Repeated measures ANOVA in pingouin requires long format data, where each row represents one observation. Many datasets arrive in wide format, with each condition as a separate column.

# Wide format example (common in spreadsheets)
wide_data = pd.DataFrame({
    'subject': range(1, 31),
    'baseline': np.random.normal(65, 10, 30),
    'week2': np.random.normal(55, 12, 30),
    'week4': np.random.normal(45, 11, 30),
    'week8': np.random.normal(40, 10, 30)
})

print("Wide format:")
print(wide_data.head())

# Convert to long format using pandas.melt()
long_data = pd.melt(
    wide_data,
    id_vars=['subject'],
    value_vars=['baseline', 'week2', 'week4', 'week8'],
    var_name='time',
    value_name='anxiety'
)

print("\nLong format:")
print(long_data.head(8))

The long format output will have columns: subject, time, and anxiety. Each subject appears in four rows, one for each time point. This structure is essential for pingouin to correctly identify the repeated measures structure.

Running the Analysis with Pingouin

Now for the actual analysis. The rm_anova() function handles everything:

import pandas as pd
import pingouin as pg
import numpy as np

# Recreate our dataset with controlled randomness
np.random.seed(42)
n_subjects = 30

# Simulate decreasing anxiety over treatment period
subject_baseline = np.random.normal(0, 8, n_subjects)  # individual differences

data = pd.DataFrame({
    'subject': np.repeat(range(1, n_subjects + 1), 4),
    'time': np.tile(['baseline', 'week2', 'week4', 'week8'], n_subjects),
    'anxiety': np.concatenate([
        65 + subject_baseline + np.random.normal(0, 5, n_subjects),
        55 + subject_baseline + np.random.normal(0, 5, n_subjects),
        45 + subject_baseline + np.random.normal(0, 5, n_subjects),
        40 + subject_baseline + np.random.normal(0, 5, n_subjects),
    ])
})

# Run repeated measures ANOVA
rm_results = pg.rm_anova(
    data=data,
    dv='anxiety',
    within='time',
    subject='subject',
    detailed=True
)

print(rm_results.to_string())

The output includes:

F: The F-statistic testing whether time points differ significantly
p-unc: Uncorrected p-value
ng2: Generalized eta-squared (effect size)
eps: Epsilon value for sphericity (Greenhouse-Geisser)
p-GG-corr: Greenhouse-Geisser corrected p-value
sphericity: Whether sphericity assumption is met
W-spher: Mauchly’s W statistic

Interpreting effect sizes (η²):

0.01 = small effect
0.06 = medium effect
0.14 = large effect

A significant result (p < 0.05) tells you that at least one time point differs from the others, but not which ones. That requires post-hoc tests.

Handling Sphericity Violations

When Mauchly’s test is significant (sphericity violated), the standard F-test becomes liberal—you’ll get too many false positives. Two corrections adjust the degrees of freedom:

Greenhouse-Geisser (GG): More conservative. Use when epsilon < 0.75 or when in doubt.

Huynh-Feldt (HF): Less conservative. Use when epsilon > 0.75.

pingouin calculates both automatically:

# The rm_anova output already includes corrections
# Access specific corrected p-values:

rm_results = pg.rm_anova(
    data=data,
    dv='anxiety',
    within='time',
    subject='subject',
    correction=True  # This is True by default
)

# Extract epsilon and corrected p-values
epsilon = rm_results['eps'].values[0]
p_gg = rm_results['p-GG-corr'].values[0]

print(f"Epsilon (Greenhouse-Geisser): {epsilon:.3f}")
print(f"Corrected p-value: {p_gg:.4f}")

# Decision rule
if epsilon < 0.75:
    print("Epsilon < 0.75: Report Greenhouse-Geisser corrected values")
else:
    print("Epsilon >= 0.75: Huynh-Feldt correction acceptable")

Always report which correction you used and why. When sphericity holds (Mauchly’s p > 0.05), report the uncorrected values.

Post-Hoc Pairwise Comparisons

A significant omnibus test demands follow-up comparisons. Use pairwise_tests() with appropriate corrections for multiple comparisons:

# Pairwise t-tests with Bonferroni correction
posthoc = pg.pairwise_tests(
    data=data,
    dv='anxiety',
    within='time',
    subject='subject',
    padjust='bonf',  # Bonferroni correction
    effsize='cohen'  # Include Cohen's d
)

print(posthoc[['A', 'B', 'T', 'p-unc', 'p-corr', 'cohen']].to_string())

This compares all pairs: baseline vs. week2, baseline vs. week4, baseline vs. week8, week2 vs. week4, and so on. The p-corr column shows Bonferroni-adjusted p-values.

Alternative corrections include:

'holm': Holm-Bonferroni (less conservative, recommended)
'fdr_bh': Benjamini-Hochberg false discovery rate
'none': No correction (not recommended)

Cohen’s d interpretation:

0.2 = small
0.5 = medium
0.8 = large

Visualizing Results

Clear visualization communicates your findings effectively. A line plot with error bars shows the trajectory across time points:

import matplotlib.pyplot as plt
import seaborn as sns

# Calculate means and confidence intervals
summary = data.groupby('time')['anxiety'].agg(['mean', 'std', 'count'])
summary['se'] = summary['std'] / np.sqrt(summary['count'])
summary['ci95'] = 1.96 * summary['se']

# Reorder time points correctly
time_order = ['baseline', 'week2', 'week4', 'week8']
summary = summary.reindex(time_order)

# Create the plot
fig, ax = plt.subplots(figsize=(10, 6))

# Line plot with individual subject lines (light gray)
for subject in data['subject'].unique():
    subject_data = data[data['subject'] == subject]
    subject_data = subject_data.set_index('time').reindex(time_order)
    ax.plot(time_order, subject_data['anxiety'].values, 
            color='gray', alpha=0.2, linewidth=0.5)

# Mean line with error bars
ax.errorbar(
    x=time_order,
    y=summary['mean'],
    yerr=summary['ci95'],
    marker='o',
    markersize=10,
    linewidth=2,
    capsize=5,
    capthick=2,
    color='#2563eb',
    label='Mean ± 95% CI'
)

ax.set_xlabel('Time Point', fontsize=12)
ax.set_ylabel('Anxiety Score', fontsize=12)
ax.set_title('Anxiety Scores Across Treatment Period', fontsize=14)
ax.legend()
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('rm_anova_results.png', dpi=150)
plt.show()

Best practices for repeated measures visualizations:

Show individual trajectories in light gray to reveal variability
Use error bars representing 95% confidence intervals or standard errors
Order conditions logically (chronologically for time series)
Connect the means with lines to emphasize the repeated nature
Consider violin plots or box plots for distributional information

For publication-quality figures, also consider adding significance brackets between time points that showed significant pairwise differences.

Putting It All Together

Here’s a complete workflow you can adapt:

import pandas as pd
import pingouin as pg
import numpy as np

# 1. Load and prepare data (long format)
# 2. Check assumptions
spher_result = pg.sphericity(data, dv='anxiety', subject='subject', within='time')

# 3. Run ANOVA
results = pg.rm_anova(data=data, dv='anxiety', within='time', 
                       subject='subject', detailed=True)

# 4. Report corrected p-value if sphericity violated
if not spher_result[0]:
    p_value = results['p-GG-corr'].values[0]
    correction = "Greenhouse-Geisser"
else:
    p_value = results['p-unc'].values[0]
    correction = "none"

# 5. Post-hoc tests if significant
if p_value < 0.05:
    posthoc = pg.pairwise_tests(data=data, dv='anxiety', within='time',
                                 subject='subject', padjust='holm')

Repeated measures ANOVA is a powerful tool when your experimental design involves tracking the same subjects across conditions. Master the assumptions, apply corrections when needed, and always follow up significant omnibus tests with pairwise comparisons. The pingouin library handles the heavy lifting—your job is interpreting results correctly and communicating them clearly.