How to Perform a Repeated Measures ANOVA in R

Key Insights

Repeated measures ANOVA requires your data in long format with one row per observation, and the rstatix package provides the cleanest interface for running the analysis in R
Sphericity violations are common and consequential—always check Mauchly’s test and apply Greenhouse-Geisser corrections when epsilon falls below 0.75
Post-hoc pairwise comparisons must use paired t-tests with p-value adjustments; forgetting the paired = TRUE argument is one of the most common mistakes

Introduction to Repeated Measures ANOVA

Repeated measures ANOVA is your go-to analysis when you’ve measured the same subjects multiple times under different conditions or across time points. Unlike between-subjects ANOVA, which compares different groups of people, repeated measures designs use each participant as their own control. This approach dramatically increases statistical power because you eliminate individual differences as a source of variance.

Common use cases include measuring reaction times across three difficulty levels, tracking patient outcomes before, during, and after treatment, or comparing user preferences across multiple product designs. Any time your research question involves “How does this outcome change across conditions within the same individuals?”, repeated measures ANOVA is likely appropriate.

The analysis comes with three key assumptions. First, sphericity—the variances of differences between all pairs of conditions should be equal. This is the repeated measures equivalent of homogeneity of variance, and violations are common. Second, the residuals should be approximately normally distributed. Third, you shouldn’t have significant outliers that could distort your results. We’ll address how to check and handle each of these.

Preparing Your Data

Repeated measures ANOVA in R requires data in long format: one row per observation, with a column identifying the subject and another identifying the condition. Most researchers collect data in wide format (one row per subject, conditions as columns), so reshaping is usually necessary.

Let’s create a realistic example dataset. Imagine we’re studying how cognitive load affects response time. We measured 20 participants’ reaction times (in milliseconds) under low, medium, and high cognitive load conditions:

library(tidyverse)
library(rstatix)

# Create sample data in wide format (as you might receive it)
set.seed(42)
n_subjects <- 20

data_wide <- tibble(
  subject_id = factor(1:n_subjects),
  low_load = rnorm(n_subjects, mean = 450, sd = 50),
  medium_load = rnorm(n_subjects, mean = 520, sd = 55),
  high_load = rnorm(n_subjects, mean = 610, sd = 60)
)

# Convert to long format for analysis
data_long <- data_wide %>%
  pivot_longer(
    cols = c(low_load, medium_load, high_load),
    names_to = "condition",
    values_to = "reaction_time"
  ) %>%
  mutate(
    condition = factor(condition, 
                       levels = c("low_load", "medium_load", "high_load"))
  )

head(data_long, 9)

The pivot_longer() function transforms your wide data into the required structure. The names_to argument specifies what to call the new column containing your condition labels, and values_to names the column for your outcome variable. Setting the factor levels explicitly ensures your conditions appear in a logical order in output tables.

Checking Assumptions

Before running the ANOVA, verify your assumptions. Start with outlier detection using the identify_outliers() function from rstatix:

# Check for extreme outliers by condition
data_long %>%
  group_by(condition) %>%
  identify_outliers(reaction_time)

Values flagged as extreme outliers (beyond 3 × IQR) warrant investigation. Consider whether they represent data entry errors or genuine extreme responses.

Next, check normality. For repeated measures, we care about normality of residuals within each condition:

# Shapiro-Wilk test by condition
data_long %>%
  group_by(condition) %>%
  shapiro_test(reaction_time)

# Visual inspection with QQ plots
ggplot(data_long, aes(sample = reaction_time)) +
  stat_qq() +
  stat_qq_line() +
  facet_wrap(~condition) +
  theme_minimal() +
  labs(title = "QQ Plots by Condition")

Shapiro-Wilk p-values above 0.05 suggest normality isn’t violated. However, with sample sizes above 30, the visual QQ plot becomes more informative than the formal test, which can flag trivial deviations as significant.

Sphericity gets tested as part of the ANOVA procedure itself, which we’ll cover next.

Running the Analysis with rstatix

The rstatix package provides a clean, pipe-friendly interface for repeated measures ANOVA. The anova_test() function handles the analysis with intuitive syntax:

# Run repeated measures ANOVA
rm_anova <- data_long %>%
  anova_test(
    dv = reaction_time,          # Dependent variable
    wid = subject_id,            # Subject identifier
    within = condition           # Within-subjects factor
  )

# View results
rm_anova

The output includes several components. The ANOVA table shows the F-statistic, degrees of freedom, p-value, and effect size (generalized eta-squared, ges). You’ll also see Mauchly’s test for sphericity and, if sphericity is violated, the corrected results.

Here’s how to interpret the key values:

# Extract the main ANOVA table
get_anova_table(rm_anova)

The ges (generalized eta-squared) indicates effect size: values around 0.01 are small, 0.06 are medium, and 0.14 or higher are large. A significant p-value tells you that at least one condition differs from another, but not which specific conditions differ—that requires post-hoc tests.

Handling Sphericity Violations

Sphericity is the assumption that variances of differences between all condition pairs are equal. Violations inflate Type I error rates, making you more likely to find false positives.

Mauchly’s test appears automatically in the anova_test() output. A significant result (p < 0.05) indicates sphericity violation. When this happens, apply corrections:

# Get sphericity-corrected results
rm_anova$`Mauchly's Test for Sphericity`
rm_anova$`Sphericity Corrections`

Two corrections are standard. Greenhouse-Geisser (GG) is more conservative and appropriate when the epsilon value is below 0.75. Huynh-Feldt (HF) is less conservative and preferred when epsilon exceeds 0.75. The epsilon value quantifies how severely sphericity is violated (1.0 means perfect sphericity).

# Extract corrected p-values programmatically
corrections <- rm_anova$`Sphericity Corrections`

# Use GG correction if epsilon < 0.75, otherwise HF
epsilon <- corrections$GGe
corrected_p <- if(epsilon < 0.75) {
  corrections$`p[GG]`
} else {
  corrections$`p[HF]`
}

cat("Epsilon:", round(epsilon, 3), "\n")
cat("Corrected p-value:", round(corrected_p, 4), "\n")

Many researchers default to Greenhouse-Geisser regardless of epsilon value—this is defensible since it’s the more conservative approach.

Post-Hoc Pairwise Comparisons

A significant omnibus ANOVA tells you conditions differ somewhere, but you need post-hoc tests to identify specific differences. Use paired t-tests with p-value adjustment:

# Pairwise comparisons with Bonferroni correction
posthoc <- data_long %>%
  pairwise_t_test(
    reaction_time ~ condition,
    paired = TRUE,
    p.adjust.method = "bonferroni"
  )

posthoc

The paired = TRUE argument is critical—omitting it runs independent samples t-tests, which is statistically inappropriate for repeated measures data and will give you wrong results.

Bonferroni correction is the most conservative adjustment, multiplying p-values by the number of comparisons. Holm’s method offers more power while still controlling family-wise error:

# Holm correction (more powerful than Bonferroni)
posthoc_holm <- data_long %>%
  pairwise_t_test(
    reaction_time ~ condition,
    paired = TRUE,
    p.adjust.method = "holm"
  )

posthoc_holm

For three conditions, you get three pairwise comparisons. The adjusted p-values tell you which specific pairs differ significantly after accounting for multiple testing.

Reporting Results

APA style requires reporting the F-statistic, degrees of freedom, p-value, and effect size. Here’s a template for writing up your results:

# Create a reporting function
report_rm_anova <- function(anova_result) {
  tbl <- get_anova_table(anova_result)
  
  sprintf(
    "A repeated measures ANOVA revealed a %s effect of condition on reaction time, F(%d, %d) = %.2f, p %s, ηG² = %.3f.",
    ifelse(tbl$p < 0.05, "significant", "non-significant"),
    tbl$DFn,
    tbl$DFd,
    tbl$F,
    ifelse(tbl$p < 0.001, "< .001", sprintf("= %.3f", tbl$p)),
    tbl$ges
  )
}

report_rm_anova(rm_anova)

For publication-ready tables, combine your ANOVA and post-hoc results:

# Create summary statistics table
summary_table <- data_long %>%
  group_by(condition) %>%
  get_summary_stats(reaction_time, type = "mean_sd") %>%
  select(condition, n, mean, sd)

# Format for publication
summary_table %>%
  mutate(
    mean = round(mean, 2),
    sd = round(sd, 2),
    `M (SD)` = sprintf("%.2f (%.2f)", mean, sd)
  ) %>%
  select(Condition = condition, N = n, `M (SD)`)

When sphericity was violated, explicitly state which correction you applied: “Mauchly’s test indicated that sphericity was violated (W = 0.72, p = .023); therefore, Greenhouse-Geisser corrected values are reported (ε = 0.78).”

Include post-hoc results by stating which pairs differed: “Post-hoc pairwise comparisons with Bonferroni correction revealed significant differences between low and high load conditions (p < .001) and between medium and high load conditions (p = .012), but not between low and medium load conditions (p = .089).”

The combination of omnibus test, effect size, assumption checks, and targeted post-hoc comparisons gives readers a complete picture of your repeated measures analysis.