How to Perform a Two-Way ANOVA in Python

Key Insights

Two-way ANOVA tests the effects of two categorical independent variables on a continuous outcome, revealing both main effects and interaction effects that one-way ANOVA cannot detect.
Always verify assumptions (normality, homogeneity of variance) before interpreting results—violated assumptions lead to unreliable conclusions and wasted analysis time.
Use statsmodels for detailed control over sum of squares types and model specification; use pingouin when you need quick, readable output with built-in effect sizes.

Introduction to Two-Way ANOVA

Two-way ANOVA extends the classic one-way ANOVA by allowing you to test the effects of two categorical independent variables (factors) on a continuous dependent variable simultaneously. More importantly, it reveals whether these factors interact—something you’d completely miss by running separate one-way ANOVAs.

Consider this scenario: you’re analyzing whether diet type (low-carb, low-fat, Mediterranean) and exercise frequency (none, moderate, intense) affect weight loss. One-way ANOVA could tell you if diet matters or if exercise matters, but it can’t answer the critical question: does the effect of diet depend on exercise level? That’s an interaction effect, and it’s often where the real insights hide.

Two-way ANOVA gives you three pieces of information:

Main effect of Factor A: Does diet type affect weight loss (ignoring exercise)?
Main effect of Factor B: Does exercise frequency affect weight loss (ignoring diet)?
Interaction effect (A × B): Does the effect of diet depend on exercise level?

When the interaction is significant, you need to interpret main effects cautiously—the story becomes more nuanced than “diet works” or “exercise works.”

Prerequisites and Setup

You’ll need four libraries. pandas handles data manipulation, scipy provides assumption tests, statsmodels gives you the full ANOVA machinery, and pingouin offers a cleaner alternative with built-in effect sizes.

import pandas as pd
import numpy as np
from scipy import stats
import statsmodels.api as sm
from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm
from statsmodels.stats.multicomp import pairwise_tukeyhsd
import pingouin as pg
import seaborn as sns
import matplotlib.pyplot as plt

# Set random seed for reproducibility
np.random.seed(42)

# Create sample dataset: weight loss study
n_per_group = 20

data = {
    'diet': np.repeat(['low_carb', 'low_fat', 'mediterranean'], n_per_group * 3),
    'exercise': np.tile(np.repeat(['none', 'moderate', 'intense'], n_per_group), 3),
    'weight_loss': np.concatenate([
        # Low carb: works better with exercise
        np.random.normal(3, 1.5, n_per_group),   # none
        np.random.normal(7, 1.5, n_per_group),   # moderate
        np.random.normal(12, 1.5, n_per_group),  # intense
        # Low fat: moderate effect, less interaction
        np.random.normal(2, 1.5, n_per_group),   # none
        np.random.normal(5, 1.5, n_per_group),   # moderate
        np.random.normal(7, 1.5, n_per_group),   # intense
        # Mediterranean: consistent effect regardless of exercise
        np.random.normal(5, 1.5, n_per_group),   # none
        np.random.normal(6, 1.5, n_per_group),   # moderate
        np.random.normal(7, 1.5, n_per_group),   # intense
    ])
}

df = pd.DataFrame(data)
print(df.groupby(['diet', 'exercise'])['weight_loss'].describe().round(2))

This dataset has 180 observations (20 per cell in a 3×3 design). I’ve deliberately built in an interaction: low-carb diet shows dramatically better results with intense exercise, while Mediterranean diet performs consistently regardless of exercise level.

Data Preparation and Assumptions

Two-way ANOVA assumes normality of residuals, homogeneity of variance across groups, and independence of observations. Independence comes from your study design—you can’t test it statistically. The other two you can and should verify.

# Check normality within each group using Shapiro-Wilk test
print("Normality tests (Shapiro-Wilk) by group:")
print("-" * 50)

for diet in df['diet'].unique():
    for exercise in df['exercise'].unique():
        subset = df[(df['diet'] == diet) & (df['exercise'] == exercise)]['weight_loss']
        stat, p = stats.shapiro(subset)
        status = "OK" if p > 0.05 else "VIOLATION"
        print(f"{diet:15} + {exercise:10}: p = {p:.4f} [{status}]")

# Levene's test for homogeneity of variance
# Create group labels for the test
df['group'] = df['diet'] + '_' + df['exercise']
groups = [group['weight_loss'].values for name, group in df.groupby('group')]

levene_stat, levene_p = stats.levene(*groups)
print(f"\nLevene's test for homogeneity of variance:")
print(f"Statistic = {levene_stat:.4f}, p = {levene_p:.4f}")
print(f"Assumption {'met' if levene_p > 0.05 else 'violated'}")

# Visual inspection with box plots
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Box plot by diet
sns.boxplot(data=df, x='diet', y='weight_loss', ax=axes[0])
axes[0].set_title('Weight Loss by Diet Type')
axes[0].set_ylabel('Weight Loss (kg)')

# Box plot by exercise
sns.boxplot(data=df, x='exercise', y='weight_loss', ax=axes[1])
axes[1].set_title('Weight Loss by Exercise Level')
axes[1].set_ylabel('Weight Loss (kg)')

plt.tight_layout()
plt.savefig('assumption_boxplots.png', dpi=150)
plt.show()

If normality is violated in some groups but sample sizes are equal and reasonably large (n > 15 per cell), ANOVA is robust enough to proceed. For severe violations, consider the Kruskal-Wallis test or data transformation.

Performing Two-Way ANOVA with statsmodels

The statsmodels library uses R-style formulas. The * operator includes both main effects and the interaction. Use + if you only want main effects (rarely what you want).

# Fit the two-way ANOVA model
# C() explicitly treats variables as categorical
model = ols('weight_loss ~ C(diet) * C(exercise)', data=df).fit()

# Type II sum of squares (default, recommended for balanced designs)
anova_table_type2 = anova_lm(model, typ=2)
print("Two-Way ANOVA Results (Type II SS)")
print("=" * 60)
print(anova_table_type2.round(4))

Output interpretation:

sum_sq: Sum of squares—variance explained by each factor
df: Degrees of freedom
F: F-statistic—ratio of between-group variance to within-group variance
PR(>F): p-value—probability of seeing this F-value if the null hypothesis were true

# Type III sum of squares (preferred for unbalanced designs)
anova_table_type3 = anova_lm(model, typ=3)
print("\nTwo-Way ANOVA Results (Type III SS)")
print("=" * 60)
print(anova_table_type3.round(4))

Type II vs. Type III: Use Type II for balanced designs (equal sample sizes per cell). Use Type III for unbalanced designs or when you specifically care about each factor’s effect after controlling for all other factors. Type III is more conservative and what most statistical software defaults to.

A significant interaction (p < 0.05 for C(diet):C(exercise)) means the effect of diet depends on exercise level. When this happens, interpret main effects with caution—the interaction tells the more complete story.

Alternative: Using Pingouin for Simplified Analysis

Pingouin provides cleaner syntax and automatically calculates effect sizes, which statsmodels doesn’t include by default.

# Two-way ANOVA with pingouin
aov = pg.anova(data=df, 
               dv='weight_loss', 
               between=['diet', 'exercise'],
               detailed=True)

print("Two-Way ANOVA Results (Pingouin)")
print("=" * 70)
print(aov.round(4))

Pingouin’s output includes:

np2 (partial eta-squared): Effect size showing proportion of variance explained
η² interpretation: Small (0.01), Medium (0.06), Large (0.14)

# Get omega-squared for less biased effect size estimates
aov_omega = pg.anova(data=df,
                     dv='weight_loss',
                     between=['diet', 'exercise'],
                     effsize='n2')  # Can also use 'np2' or 'n2'

print("\nEffect sizes (eta-squared):")
for idx, row in aov.iterrows():
    source = row['Source']
    if source != 'Residual':
        print(f"  {source}: η²p = {row['np2']:.4f}")

Pingouin is my recommendation for exploratory analysis and reports where you need effect sizes. Use statsmodels when you need more control over model specification or Type III sums of squares.

Post-Hoc Analysis and Interpretation

A significant main effect tells you that at least one group differs from another—but not which groups differ. Post-hoc tests answer that question while controlling for multiple comparisons.

# Tukey HSD for diet main effect
print("Post-hoc comparisons for Diet (Tukey HSD)")
print("=" * 60)

tukey_diet = pairwise_tukeyhsd(df['weight_loss'], df['diet'], alpha=0.05)
print(tukey_diet)

# Tukey HSD for exercise main effect
print("\nPost-hoc comparisons for Exercise (Tukey HSD)")
print("=" * 60)

tukey_exercise = pairwise_tukeyhsd(df['weight_loss'], df['exercise'], alpha=0.05)
print(tukey_exercise)

# For interaction effects, compare all cell combinations
print("\nPost-hoc comparisons for all group combinations")
print("=" * 60)

tukey_all = pairwise_tukeyhsd(df['weight_loss'], df['group'], alpha=0.05)

# Filter to show only significant differences
tukey_df = pd.DataFrame(data=tukey_all._results_table.data[1:],
                        columns=tukey_all._results_table.data[0])
significant = tukey_df[tukey_df['reject'] == True]
print(f"\nSignificant pairwise differences ({len(significant)} of {len(tukey_df)}):")
print(significant.to_string(index=False))

The reject column tells you whether to reject the null hypothesis of no difference between those two groups. The meandiff column shows the actual difference in means.

Visualizing Results

Interaction plots are essential for understanding two-way ANOVA results. Non-parallel lines indicate an interaction.

# Interaction plot
from statsmodels.graphics.factorplots import interaction_plot

fig, ax = plt.subplots(figsize=(10, 6))

interaction_plot(x=df['exercise'], 
                 trace=df['diet'],
                 response=df['weight_loss'],
                 colors=['#e74c3c', '#3498db', '#2ecc71'],
                 markers=['o', 's', '^'],
                 ms=10,
                 ax=ax)

ax.set_xlabel('Exercise Level', fontsize=12)
ax.set_ylabel('Mean Weight Loss (kg)', fontsize=12)
ax.set_title('Interaction Plot: Diet × Exercise', fontsize=14)
ax.legend(title='Diet Type', loc='upper left')
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('interaction_plot.png', dpi=150)
plt.show()

# Grouped box plot showing all combinations
fig, ax = plt.subplots(figsize=(12, 6))

sns.boxplot(data=df, x='exercise', y='weight_loss', hue='diet',
            palette='Set2', ax=ax)

ax.set_xlabel('Exercise Level', fontsize=12)
ax.set_ylabel('Weight Loss (kg)', fontsize=12)
ax.set_title('Weight Loss by Diet and Exercise Level', fontsize=14)
ax.legend(title='Diet Type')

plt.tight_layout()
plt.savefig('grouped_boxplot.png', dpi=150)
plt.show()

In our example, you’ll see that low-carb diet lines slope steeply upward with exercise intensity, while Mediterranean diet remains relatively flat. That’s the interaction visualized—and it’s far more informative than any p-value alone.

Two-way ANOVA is a powerful tool, but remember: statistical significance doesn’t equal practical significance. Always report effect sizes, visualize your results, and interpret findings in the context of your domain. A statistically significant 0.5 kg difference in weight loss might be meaningless in practice, while a non-significant 3 kg difference with high variance might warrant further investigation with a larger sample.