ANOVA in R: Step-by-Step Guide

Key Insights

ANOVA tests whether group means differ significantly, but always verify assumptions first—normality violations matter less with larger samples, but heterogeneous variances will invalidate your results
A significant ANOVA p-value only tells you that some groups differ; you must run post-hoc tests like Tukey HSD to identify which specific pairs are different
Effect size (eta-squared) matters more than p-values for practical interpretation—a statistically significant result with η² < 0.01 rarely has real-world importance

Introduction to ANOVA

Analysis of Variance (ANOVA) answers a straightforward question: do the means of three or more groups differ significantly? While a t-test compares two groups, ANOVA handles multiple groups without inflating your Type I error rate from running multiple pairwise comparisons.

The technique works by partitioning total variance into between-group variance (differences due to group membership) and within-group variance (random variation within groups). When between-group variance substantially exceeds within-group variance, you have evidence that group membership matters.

ANOVA requires three assumptions:

Independence: Observations are independent of each other
Normality: The residuals follow a normal distribution
Homogeneity of variance: All groups have equal variances (homoscedasticity)

One-way ANOVA examines one factor (e.g., treatment type), while two-way ANOVA examines two factors and their potential interaction (e.g., treatment type and dosage level). We’ll cover both, starting with the simpler one-way case.

Setting Up Your Data

ANOVA requires data in long format—one row per observation with a column identifying group membership. R’s built-in datasets work well for learning. Let’s use ToothGrowth, which contains guinea pig tooth length measurements across different vitamin C delivery methods and doses.

# Load and inspect the data
data(ToothGrowth)

# Examine structure
str(ToothGrowth)
# 'data.frame':	60 obs. of  3 variables:
#  $ len : num  4.2 11.5 7.3 5.8 6.4 ...
#  $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 ...
#  $ dose: num  0.5 0.5 0.5 0.5 0.5 ...

# Summary statistics
summary(ToothGrowth)

# Convert dose to factor for ANOVA
ToothGrowth$dose <- factor(ToothGrowth$dose)

# Check group sizes
table(ToothGrowth$dose)
# 0.5   1   2 
#  20  20  20

Balanced designs (equal group sizes) are more robust to assumption violations. With 20 observations per dose level, we have a balanced design. Always verify your data structure before proceeding—ANOVA on continuous variables treated as factors produces meaningless results.

Checking ANOVA Assumptions

Never skip assumption checking. Violations can invalidate your conclusions entirely.

Normality Testing

The Shapiro-Wilk test assesses whether data deviates significantly from normality. Test residuals, not raw data:

# Fit the model first
model <- aov(len ~ dose, data = ToothGrowth)

# Test normality of residuals
shapiro.test(residuals(model))
# Shapiro-Wilk normality test
# W = 0.98463, p-value = 0.6694

A p-value above 0.05 suggests no significant departure from normality. Visual inspection with Q-Q plots provides additional insight:

library(ggplot2)

# Q-Q plot of residuals
ggplot(data.frame(resid = residuals(model)), aes(sample = resid)) +
  stat_qq() +
  stat_qq_line(color = "red") +
  labs(title = "Q-Q Plot of ANOVA Residuals",
       x = "Theoretical Quantiles",
       y = "Sample Quantiles") +
  theme_minimal()

Points following the diagonal line indicate normality. Minor deviations at the tails are acceptable, especially with sample sizes above 30 per group.

Homogeneity of Variance

Levene’s test from the car package tests whether group variances are equal:

library(car)

leveneTest(len ~ dose, data = ToothGrowth)
# Levene's Test for Homogeneity of Variance
#       Df F value Pr(>F)
# group  2  0.6461 0.5281
#       57

A non-significant result (p > 0.05) indicates homogeneous variances. Boxplots provide visual confirmation:

ggplot(ToothGrowth, aes(x = dose, y = len, fill = dose)) +
  geom_boxplot() +
  labs(title = "Tooth Length by Vitamin C Dose",
       x = "Dose (mg/day)",
       y = "Tooth Length") +
  theme_minimal() +
  theme(legend.position = "none")

Similar box heights indicate similar variances. If Levene’s test is significant, consider Welch’s ANOVA (oneway.test()) which doesn’t assume equal variances.

Running One-Way ANOVA

With assumptions verified, run the ANOVA using aov():

model <- aov(len ~ dose, data = ToothGrowth)
summary(model)
#             Df Sum Sq Mean Sq F value   Pr(>F)    
# dose         2   2426    1213   67.42 9.53e-16 ***
# Residuals   57   1026      18                     
# ---
# Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Interpreting the Output

The ANOVA table contains critical information:

Df (Degrees of Freedom): For the factor, df = number of groups - 1. For residuals, df = total observations - number of groups.
Sum Sq (Sum of Squares): Between-group SS (2426) represents variance explained by dose; residual SS (1026) represents unexplained variance.
Mean Sq (Mean Square): Sum of squares divided by degrees of freedom.
F value: Ratio of between-group mean square to within-group mean square. Larger values indicate greater group differences.
Pr(>F): The p-value. Here, 9.53e-16 is far below 0.05.

The interpretation: Vitamin C dose significantly affects tooth length (F(2, 57) = 67.42, p < 0.001). But which doses differ from each other? ANOVA doesn’t tell you—that requires post-hoc testing.

Post-Hoc Testing

A significant ANOVA p-value indicates that at least one group mean differs from the others. To identify specific differences, use Tukey’s Honest Significant Difference (HSD) test, which controls for multiple comparisons:

tukey_result <- TukeyHSD(model)
print(tukey_result)
#   Tukey multiple comparisons of means
#     95% family-wise confidence level
# 
# Fit: aov(formula = len ~ dose, data = ToothGrowth)
# 
# $dose
#         diff       lwr       upr     p adj
# 1-0.5  9.130  5.901805 12.358195 0.0000000
# 2-0.5 15.495 12.266805 18.723195 0.0000000
# 2-1    6.365  3.136805  9.593195 0.0000425

Each row compares two groups:

diff: Mean difference between groups
lwr/upr: 95% confidence interval bounds
p adj: Adjusted p-value (controlling for multiple comparisons)

All three pairwise comparisons are significant (p adj < 0.05). The 2mg dose produces teeth 15.5 units longer than 0.5mg, and 6.4 units longer than 1mg.

Visualize these differences:

plot(tukey_result, las = 1)

Confidence intervals not crossing zero indicate significant differences. This plot immediately shows all dose levels differ significantly from each other.

Two-Way ANOVA

When you have two factors, two-way ANOVA examines main effects and interactions. The ToothGrowth dataset includes supplement type (OJ vs VC), making it perfect for this extension:

# Two-way ANOVA with interaction
model_2way <- aov(len ~ supp * dose, data = ToothGrowth)
summary(model_2way)
#             Df Sum Sq Mean Sq F value   Pr(>F)    
# supp         1  205.4   205.4  15.572 0.000231 ***
# dose         2 2426.4  1213.2  92.000 < 2e-16 ***
# supp:dose    2  108.3    54.2   4.107 0.021860 *  
# Residuals   54  712.1    13.2

Interpreting Interaction Effects

The supp:dose interaction term is significant (p = 0.022), meaning the effect of dose depends on supplement type. When interactions are significant, interpret main effects cautiously—they represent average effects that may not hold across all levels of the other factor.

Visualize the interaction:

ggplot(ToothGrowth, aes(x = dose, y = len, color = supp, group = supp)) +
  stat_summary(fun = mean, geom = "point", size = 3) +
  stat_summary(fun = mean, geom = "line") +
  stat_summary(fun.data = mean_se, geom = "errorbar", width = 0.1) +
  labs(title = "Interaction Between Dose and Supplement Type",
       x = "Dose (mg/day)",
       y = "Mean Tooth Length",
       color = "Supplement") +
  theme_minimal()

Non-parallel lines indicate interaction. Here, orange juice (OJ) shows stronger effects at lower doses, but the supplements converge at the 2mg dose.

Reporting Results

Academic and professional reports require specific formatting. Here’s how to write up your ANOVA results:

Calculating Effect Size

P-values indicate statistical significance, but effect size indicates practical importance. Eta-squared (η²) represents the proportion of variance explained by the factor:

# Manual calculation
ss_effect <- summary(model)[[1]]["dose", "Sum Sq"]
ss_total <- sum(summary(model)[[1]][, "Sum Sq"])
eta_squared <- ss_effect / ss_total

print(paste("Eta-squared:", round(eta_squared, 3)))
# [1] "Eta-squared: 0.703"

# Using the effectsize package (recommended)
library(effectsize)
eta_squared(model)
# Parameter | Eta2 |       95% CI
# -------------------------------
# dose      | 0.70 | [0.57, 0.79]

Interpretation guidelines for η²:

Small: 0.01
Medium: 0.06
Large: 0.14

Our η² of 0.70 is very large—dose explains 70% of variance in tooth length.

Writing the Results

A complete ANOVA report includes:

A one-way ANOVA revealed a statistically significant effect of vitamin C dose on tooth length, F(2, 57) = 67.42, p < .001, η² = .70. Post-hoc comparisons using Tukey’s HSD indicated that all three dose levels differed significantly from each other (all p < .001). Mean tooth length increased from 10.6 (SD = 4.5) at 0.5mg to 19.7 (SD = 4.4) at 1mg to 26.1 (SD = 3.8) at 2mg.

For two-way ANOVA, report main effects and interactions:

A two-way ANOVA examined the effects of supplement type and dose on tooth length. There was a significant main effect of dose, F(2, 54) = 92.00, p < .001, and supplement type, F(1, 54) = 15.57, p < .001. The interaction between dose and supplement was also significant, F(2, 54) = 4.11, p = .022, indicating that the effect of dose varied by supplement type.

Practical Recommendations

After running hundreds of ANOVAs, here’s my advice:

Always visualize first. Boxplots reveal outliers, skewness, and variance differences faster than any statistical test. I’ve caught data entry errors this way that would have invalidated entire analyses.

Don’t obsess over normality. ANOVA is robust to normality violations with sample sizes above 20-30 per group. Heterogeneous variances are the bigger threat—if Levene’s test fails, use oneway.test(len ~ dose, data = ToothGrowth, var.equal = FALSE) for Welch’s ANOVA.

Report effect sizes. A p-value of 0.001 with η² = 0.02 means you detected a real but trivial effect. Conversely, p = 0.06 with η² = 0.15 might indicate a meaningful effect that your sample couldn’t reliably detect.

Consider alternatives. When assumptions fail badly, Kruskal-Wallis (kruskal.test()) provides a non-parametric alternative that doesn’t assume normality or equal variances.

ANOVA remains a workhorse of statistical analysis. Master these fundamentals, and you’ll handle the vast majority of group comparison problems you encounter.