How to Perform ANOVA Using Pingouin in Python

Analysis of Variance (ANOVA) remains one of the most widely used statistical methods for comparing means across multiple groups. Whether you're analyzing experimental treatment effects, comparing...

Key Insights

  • Pingouin provides a pandas-native API that returns complete ANOVA results—including effect sizes—in a single DataFrame, eliminating the fragmented workflow common with scipy and statsmodels.
  • The library handles repeated measures ANOVA with automatic sphericity correction, a feature that typically requires manual implementation in other Python statistics packages.
  • Built-in assumption testing functions (pg.normality(), pg.homoscedasticity()) integrate directly into your analysis pipeline, making it trivial to validate your statistical approach before interpreting results.

Introduction to ANOVA and Pingouin

Analysis of Variance (ANOVA) remains one of the most widely used statistical methods for comparing means across multiple groups. Whether you’re analyzing experimental treatment effects, comparing performance across different conditions, or evaluating survey responses by demographic segments, ANOVA provides a principled approach to determining whether observed differences are statistically meaningful.

Python offers several options for ANOVA: scipy.stats provides basic functionality, statsmodels offers more comprehensive tools, and various specialized packages exist for specific use cases. Pingouin stands out for a simple reason—it was designed by researchers who were tired of wrestling with fragmented outputs and non-intuitive APIs.

Pingouin returns results as pandas DataFrames with all relevant statistics included by default. You get F-statistics, p-values, degrees of freedom, and effect sizes (eta-squared, omega-squared) in a single call. No more combining outputs from multiple functions or manually calculating effect sizes. The API reads like plain English, and the documentation is exceptional.

Installation and Setup

Install Pingouin via pip:

pip install pingouin

For this tutorial, we’ll work with both synthetic data and realistic examples. Here’s the standard setup:

import pingouin as pg
import pandas as pd
import numpy as np

# Set random seed for reproducibility
np.random.seed(42)

# Create sample dataset: exam scores across teaching methods
n_per_group = 30
data = pd.DataFrame({
    'student_id': range(n_per_group * 3),
    'method': ['traditional'] * n_per_group + 
              ['interactive'] * n_per_group + 
              ['online'] * n_per_group,
    'score': np.concatenate([
        np.random.normal(72, 10, n_per_group),  # traditional
        np.random.normal(78, 12, n_per_group),  # interactive
        np.random.normal(75, 11, n_per_group)   # online
    ])
})

print(data.groupby('method')['score'].describe())

This creates a dataset simulating exam scores from students taught using three different methods. The group means differ slightly, and we want to determine if those differences are statistically significant.

One-Way ANOVA

One-way ANOVA tests whether the means of three or more independent groups differ significantly. The null hypothesis states that all group means are equal.

# Perform one-way ANOVA
aov = pg.anova(data=data, dv='score', between='method', detailed=True)
print(aov)

The output DataFrame includes:

Source SS DF MS F p-unc np2
method 547.2 2 273.6 2.31 0.105 0.05
Within 10312.4 87 118.5 - - -

Key columns to interpret:

  • F: The F-statistic (ratio of between-group variance to within-group variance)
  • p-unc: Uncorrected p-value (compare against your alpha, typically 0.05)
  • np2: Partial eta-squared, an effect size measure (0.01 = small, 0.06 = medium, 0.14 = large)

In this example, p = 0.105 exceeds our threshold, so we fail to reject the null hypothesis—the teaching methods don’t produce significantly different scores.

For a more concise output without the detailed breakdown:

# Simplified output
aov_simple = pg.anova(data=data, dv='score', between='method')
print(aov_simple[['Source', 'F', 'p-unc', 'np2']])

Two-Way and N-Way ANOVA

Real experiments often involve multiple factors. Two-way ANOVA examines the effects of two independent variables and their interaction.

# Extended dataset with experience level
np.random.seed(42)
data_2way = pd.DataFrame({
    'score': np.random.normal(75, 10, 180),
    'method': np.tile(['traditional', 'interactive', 'online'], 60),
    'experience': np.repeat(['novice', 'experienced'], 90)
})

# Add realistic effects
data_2way.loc[data_2way['method'] == 'interactive', 'score'] += 5
data_2way.loc[data_2way['experience'] == 'experienced', 'score'] += 8
# Interaction: experienced students benefit more from interactive method
data_2way.loc[(data_2way['method'] == 'interactive') & 
              (data_2way['experience'] == 'experienced'), 'score'] += 4

# Two-way ANOVA
aov_2way = pg.anova(data=data_2way, dv='score', between=['method', 'experience'])
print(aov_2way)

The output now includes rows for each main effect and the interaction:

#          Source          SS  DF        MS       F     p-unc    np2
# 0        method      1842.3   2    921.15   10.12    0.0001  0.104
# 1    experience      3156.7   1   3156.70   34.68    0.0000  0.166
# 2  method * exp       412.5   2    206.25    2.27    0.1067  0.025
# 3      Residual     15841.2 174     91.04     NaN       NaN    NaN

A significant interaction (p < 0.05) would indicate that the effect of teaching method depends on experience level. In this case, the interaction approaches but doesn’t reach significance.

Repeated Measures ANOVA

When the same subjects are measured under multiple conditions, you need repeated measures ANOVA to account for the correlation between measurements.

# Repeated measures data: same students tested at three time points
np.random.seed(42)
n_subjects = 25

rm_data = pd.DataFrame({
    'subject': np.repeat(range(n_subjects), 3),
    'time': np.tile(['baseline', 'week4', 'week8'], n_subjects),
    'performance': np.concatenate([
        np.random.normal(50, 8, n_subjects),      # baseline
        np.random.normal(58, 8, n_subjects),      # week 4
        np.random.normal(65, 8, n_subjects)       # week 8
    ])
})

# Repeated measures ANOVA
rm_aov = pg.rm_anova(data=rm_data, dv='performance', within='time', 
                     subject='subject', detailed=True)
print(rm_aov)

Pingouin automatically computes Mauchly’s test for sphericity and applies corrections when needed:

# Key output columns include:
# - W: Mauchly's W statistic
# - sphericity: Boolean indicating if sphericity assumption is met
# - GG: Greenhouse-Geisser corrected p-value
# - HF: Huynh-Feldt corrected p-value

When sphericity is violated (W significantly different from 1), use the corrected p-values. Greenhouse-Geisser is more conservative; Huynh-Feldt is slightly more liberal.

Post-Hoc Tests and Pairwise Comparisons

A significant ANOVA tells you that at least one group differs, but not which groups differ from which. Post-hoc tests answer that question while controlling for multiple comparisons.

# Create data with clear group differences
np.random.seed(42)
posthoc_data = pd.DataFrame({
    'group': np.repeat(['A', 'B', 'C', 'D'], 25),
    'value': np.concatenate([
        np.random.normal(50, 5, 25),
        np.random.normal(55, 5, 25),
        np.random.normal(50, 5, 25),
        np.random.normal(62, 5, 25)
    ])
})

# First confirm ANOVA is significant
print(pg.anova(data=posthoc_data, dv='value', between='group'))

# Tukey's HSD - controls familywise error rate
tukey = pg.pairwise_tukey(data=posthoc_data, dv='value', between='group')
print(tukey[['A', 'B', 'mean(A)', 'mean(B)', 'diff', 'p-tukey']])

Tukey’s HSD is appropriate when comparing all pairs. For more flexibility, use pairwise_tests():

# Pairwise t-tests with multiple comparison correction
pairwise = pg.pairwise_tests(data=posthoc_data, dv='value', between='group',
                              padjust='bonf')  # Bonferroni correction
print(pairwise[['A', 'B', 'T', 'p-unc', 'p-corr', 'hedges']])

Available correction methods include 'bonf' (Bonferroni), 'holm', 'fdr_bh' (Benjamini-Hochberg), and others. The hedges column provides effect sizes for each comparison.

Assumptions and Diagnostics

ANOVA assumes normality within groups and homogeneity of variance across groups. Pingouin makes testing these assumptions straightforward.

# Test normality for each group
normality = pg.normality(data=posthoc_data, dv='value', group='group')
print(normality)

# Test homogeneity of variance (Levene's test)
homoscedasticity = pg.homoscedasticity(data=posthoc_data, dv='value', group='group')
print(homoscedasticity)

When assumptions are violated, consider alternatives:

# Welch's ANOVA - robust to unequal variances
welch = pg.welch_anova(data=posthoc_data, dv='value', between='group')
print(welch)

# Non-parametric alternative: Kruskal-Wallis
kruskal = pg.kruskal(data=posthoc_data, dv='value', between='group')
print(kruskal)

A practical workflow combines these checks:

def robust_anova(data, dv, between, alpha=0.05):
    """Perform ANOVA with automatic assumption checking."""
    
    # Check normality
    norm = pg.normality(data, dv=dv, group=between)
    normality_ok = norm['pval'].min() > alpha
    
    # Check homoscedasticity
    homo = pg.homoscedasticity(data, dv=dv, group=between)
    variance_ok = homo['pval'].values[0] > alpha
    
    if normality_ok and variance_ok:
        print("Assumptions met: using standard ANOVA")
        return pg.anova(data, dv=dv, between=between)
    elif normality_ok and not variance_ok:
        print("Unequal variances: using Welch's ANOVA")
        return pg.welch_anova(data, dv=dv, between=between)
    else:
        print("Non-normal data: using Kruskal-Wallis")
        return pg.kruskal(data, dv=dv, between=between)

result = robust_anova(posthoc_data, 'value', 'group')
print(result)

Pingouin transforms ANOVA from a multi-step process requiring careful coordination of different functions into a streamlined workflow. The pandas-native design means results integrate naturally with your data analysis pipeline, and the comprehensive output eliminates the need to manually compute effect sizes or hunt for degrees of freedom across multiple objects. For anyone doing statistical analysis in Python, it deserves serious consideration as your primary tool.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.