How to Perform a Paired T-Test in R

Key Insights

Paired t-tests compare two related measurements from the same subjects, making them more powerful than independent t-tests when your study design involves before/after measurements or matched pairs.
Always verify the normality assumption by testing the differences between paired observations, not the raw data itself—this is a common mistake that leads to incorrect conclusions.
Statistical significance alone isn’t enough; calculate Cohen’s d to report effect size and help readers understand the practical importance of your findings.

Introduction to Paired T-Tests

The paired t-test answers a straightforward question: did something change between two related measurements? You’ll reach for this test when analyzing before/after data, comparing two treatments on the same subjects, or examining matched pairs in an experiment.

Unlike the independent t-test, which compares means from two separate groups, the paired t-test leverages the relationship between observations. Each subject serves as their own control, which dramatically reduces variability and increases statistical power. This means you can detect smaller effects with fewer participants.

Common use cases include:

Measuring blood pressure before and after medication
Comparing test scores before and after training
Evaluating user task completion times with two different interfaces
Analyzing paired biological samples (left eye vs. right eye)

The key distinction is dependency. If removing one observation would logically require removing another, you have paired data.

Assumptions and Prerequisites

Before running a paired t-test, verify these assumptions:

Paired observations: Each subject has exactly two measurements that are meaningfully connected.
Continuous data: Your dependent variable must be measured on an interval or ratio scale.
Normally distributed differences: The differences between pairs should follow a normal distribution (not the raw scores themselves).
No significant outliers: Extreme values in the differences can distort results.

For this tutorial, base R handles everything. The tidyverse package helps with data manipulation, and effsize simplifies effect size calculations.

# Load packages
library(tidyverse)
library(effsize)

# Create sample dataset: reaction times before and after caffeine
set.seed(42)
n <- 30

reaction_data <- tibble(
  subject_id = 1:n,
  before = rnorm(n, mean = 350, sd = 45),
  after = rnorm(n, mean = 320, sd = 40)
)

# Examine the data structure
str(reaction_data)

tibble [30 × 3] (S3: tbl_df/tbl/data.frame)
 $ subject_id: int [1:30] 1 2 3 4 5 6 7 8 9 10 ...
 $ before    : num [1:30] 412 325 396 378 370 ...
 $ after     : num [1:30] 295 324 335 340 291 ...

# Preview the data
head(reaction_data)

# A tibble: 6 × 3
  subject_id before after
       <int>  <dbl> <dbl>
1          1   412.  295.
2          2   325.  324.
3          3   396.  335.
4          4   378.  340.
5          5   370.  291.
6          6   340.  347.

Preparing Your Data

Paired t-tests in R work with data in either wide or long format. Wide format (shown above) has one row per subject with separate columns for each measurement. Long format has multiple rows per subject with a grouping variable.

Many datasets arrive in long format, especially from databases or survey tools. Here’s how to handle both scenarios:

# Convert wide to long format
reaction_long <- reaction_data %>%
  pivot_longer(
    cols = c(before, after),
    names_to = "time_point",
    values_to = "reaction_time"
  )

head(reaction_long)

# A tibble: 6 × 3
  subject_id time_point reaction_time
       <int> <chr>              <dbl>
1          1 before              412.
2          1 after               295.
3          2 before              325.
4          2 after               324.
5          3 before              396.
6          3 after               335.

# Convert long back to wide format
reaction_wide <- reaction_long %>%
  pivot_wider(
    names_from = time_point,
    values_from = reaction_time
  )

Missing values require careful handling. A paired t-test needs complete pairs—if one measurement is missing, you must exclude the entire subject:

# Simulate missing data
reaction_missing <- reaction_data
reaction_missing$after[c(5, 12, 23)] <- NA

# Remove incomplete pairs
reaction_complete <- reaction_missing %>%
  filter(!is.na(before) & !is.na(after))

cat("Original observations:", nrow(reaction_missing), "\n")
cat("Complete pairs:", nrow(reaction_complete), "\n")

Original observations: 30 
Complete pairs: 27

Don’t impute missing values for paired t-tests. The statistical validity depends on having genuine paired observations.

Checking Assumptions

The normality assumption applies to the differences between paired observations, not to each variable separately. This is a critical distinction that many analysts miss.

# Calculate differences
reaction_data <- reaction_data %>%
  mutate(difference = after - before)

# Shapiro-Wilk test for normality
shapiro_result <- shapiro.test(reaction_data$difference)
print(shapiro_result)

	Shapiro-Wilk normality test

data:  reaction_data$difference
W = 0.97629, p-value = 0.7218

A p-value above 0.05 suggests the differences don’t significantly deviate from normality. However, don’t rely solely on this test—visual inspection matters:

# Create diagnostic plots
par(mfrow = c(1, 2))

# Q-Q plot
qqnorm(reaction_data$difference, main = "Q-Q Plot of Differences")
qqline(reaction_data$difference, col = "red", lwd = 2)

# Boxplot to check for outliers
boxplot(reaction_data$difference, 
        main = "Boxplot of Differences",
        ylab = "Difference (After - Before)")

In the Q-Q plot, points should roughly follow the red reference line. Systematic deviations indicate non-normality. The boxplot reveals outliers as points beyond the whiskers.

# Identify potential outliers using IQR method
Q1 <- quantile(reaction_data$difference, 0.25)
Q3 <- quantile(reaction_data$difference, 0.75)
IQR_val <- Q3 - Q1

lower_bound <- Q1 - 1.5 * IQR_val
upper_bound <- Q3 + 1.5 * IQR_val

outliers <- reaction_data %>%
  filter(difference < lower_bound | difference > upper_bound)

cat("Number of outliers detected:", nrow(outliers), "\n")

If you find outliers, investigate them before removal. They might represent data entry errors, measurement problems, or genuinely unusual cases worth understanding.

Running the Paired T-Test

With assumptions verified, execute the test using t.test() with paired = TRUE:

# Two-tailed paired t-test
result <- t.test(
  reaction_data$before, 
  reaction_data$after, 
  paired = TRUE
)

print(result)

	Paired t-test

data:  reaction_data$before and reaction_data$after
t = 3.8547, df = 29, p-value = 0.0005857
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
 13.91838 46.22904
sample estimates:
mean difference 
       30.07371

Let’s break down the output:

t = 3.8547: The test statistic measuring how many standard errors the mean difference is from zero.
df = 29: Degrees of freedom (n - 1 for paired tests).
p-value = 0.0005857: Probability of observing this difference (or more extreme) if the null hypothesis were true.
95% CI [13.92, 46.23]: We’re 95% confident the true mean difference falls in this range.
mean difference = 30.07: On average, reaction times decreased by about 30ms after caffeine.

For directional hypotheses, specify the alternative:

# One-tailed test: after < before (improvement expected)
result_one_tailed <- t.test(
  reaction_data$before, 
  reaction_data$after, 
  paired = TRUE,
  alternative = "greater"  # before is greater than after
)

print(result_one_tailed)

	Paired t-test

data:  reaction_data$before and reaction_data$after
t = 3.8547, df = 29, p-value = 0.0002929
alternative hypothesis: true mean difference is greater than 0
95 percent confidence interval:
 16.86698      Inf
sample estimates:
mean difference 
       30.07371

Use one-tailed tests only when you have a strong theoretical reason to expect a specific direction before collecting data.

Calculating Effect Size

Statistical significance tells you whether an effect exists; effect size tells you whether it matters. Cohen’s d for paired samples quantifies the magnitude of change in standard deviation units.

# Manual calculation of Cohen's d for paired samples
mean_diff <- mean(reaction_data$difference)
sd_diff <- sd(reaction_data$difference)

cohens_d_manual <- mean_diff / sd_diff
cat("Cohen's d (manual):", round(cohens_d_manual, 3), "\n")

Cohen's d (manual): 0.704

The effsize package provides a cleaner approach with confidence intervals:

# Using effsize package
d_result <- cohen.d(
  reaction_data$before, 
  reaction_data$after, 
  paired = TRUE
)

print(d_result)

Cohen's d

d estimate: 0.7038691 (medium)
95 percent confidence interval:
    lower     upper 
0.2562986 1.1514396

Interpret Cohen’s d using these conventional thresholds:

Small: d ≈ 0.2
Medium: d ≈ 0.5
Large: d ≈ 0.8

Our d = 0.70 represents a medium-to-large effect—caffeine produced a meaningful improvement in reaction times, not just a statistically detectable one.

Reporting Results

For academic publications, follow APA format:

A paired-samples t-test revealed that reaction times were significantly faster after caffeine consumption (M = 319.93ms, SD = 40.12) compared to before (M = 350.00ms, SD = 45.00), t(29) = 3.85, p < .001, d = 0.70, 95% CI [13.92, 46.23].

For technical documentation, include the essential statistics and practical interpretation:

# Generate summary statistics for reporting
summary_stats <- reaction_data %>%
  summarise(
    n = n(),
    mean_before = mean(before),
    sd_before = sd(before),
    mean_after = mean(after),
    sd_after = sd(after),
    mean_diff = mean(difference),
    sd_diff = sd(difference)
  )

print(summary_stats)

When normality assumptions are violated (Shapiro-Wilk p < 0.05 or severe outliers), use the Wilcoxon signed-rank test instead:

# Non-parametric alternative
wilcox_result <- wilcox.test(
  reaction_data$before, 
  reaction_data$after, 
  paired = TRUE
)

print(wilcox_result)

	Wilcoxon signed rank test with continuity correction

data:  reaction_data$before and reaction_data$after
V = 385, p-value = 0.001039
alternative hypothesis: true location shift is not equal to 0

The Wilcoxon test makes fewer assumptions but has less statistical power. Use it when your data genuinely violate normality, not as a default choice.

The paired t-test remains one of the most useful tools in your statistical toolkit. Master its assumptions, report effect sizes alongside p-values, and your analyses will be both statistically sound and practically meaningful.