How to Perform a Two-Sample T-Test in R

Key Insights

The two-sample t-test compares means between independent groups, but Welch’s t-test (the R default) is almost always preferable to Student’s t-test because it doesn’t assume equal variances.
Checking assumptions matters less than you think—the t-test is robust to normality violations with samples over 30, but you should still visualize your data to catch outliers and skewness.
Always report effect size alongside p-values; statistical significance tells you whether an effect exists, not whether it matters.

Introduction to Two-Sample T-Tests

The two-sample t-test answers a straightforward question: do two independent groups have different population means? You’ll reach for this test when comparing treatment versus control groups, before-and-after measurements on different subjects, or any scenario where you’re measuring the same variable across two distinct populations.

The test rests on three assumptions. First, observations must be independent—knowing one value shouldn’t tell you anything about another. Second, the data in each group should be approximately normally distributed. Third, the classic Student’s t-test assumes equal variances between groups, though Welch’s modification removes this requirement.

Here’s the practical reality: independence is non-negotiable and comes from your study design. Normality matters less than textbooks suggest, especially with larger samples. Variance equality is why R defaults to Welch’s t-test, and you should rarely change that default.

Preparing Your Data

R accepts data for t-tests in two formats. You’ll encounter both in practice, so know how to work with each.

Separate vectors work well for quick analyses:

# Simulated data: reaction times (ms) for two drug conditions
drug_a <- c(245, 258, 232, 267, 241, 253, 239, 261, 248, 255)
drug_b <- c(228, 235, 219, 241, 225, 232, 221, 238, 229, 233)

Grouped data frames are better for real-world workflows and integrate with tidyverse tools:

# Same data in long format
reaction_data <- data.frame(
  drug = rep(c("A", "B"), each = 10),
  time = c(245, 258, 232, 267, 241, 253, 239, 261, 248, 255,
           228, 235, 219, 241, 225, 232, 221, 238, 229, 233)
)

For real data, load it appropriately and ensure your grouping variable is a factor:

# Loading real data
df <- read.csv("experiment_results.csv")

# Ensure grouping variable is a factor
df$treatment <- as.factor(df$treatment)

# Check structure
str(df)

# Quick summary by group
tapply(df$outcome, df$treatment, summary)

Always inspect your data before testing. Look for missing values, outliers, and obvious data entry errors. The t-test won’t warn you about garbage input.

Checking Assumptions

Normality

The Shapiro-Wilk test formally checks normality, but visual inspection via Q-Q plots is more informative:

# Shapiro-Wilk test for each group
shapiro.test(drug_a)
shapiro.test(drug_b)

# For grouped data
by(reaction_data$time, reaction_data$drug, shapiro.test)

# Q-Q plots (more useful than the test)
par(mfrow = c(1, 2))
qqnorm(drug_a, main = "Drug A")
qqline(drug_a)
qqnorm(drug_b, main = "Drug B")
qqline(drug_b)

A p-value above 0.05 from Shapiro-Wilk doesn’t prove normality—it just fails to reject it. With small samples, the test has low power and misses non-normality. With large samples, it flags trivial deviations that don’t affect the t-test. Trust your eyes on the Q-Q plot: points should roughly follow the diagonal line.

Variance Equality

The F-test and Levene’s test check whether variances differ between groups:

# F-test (sensitive to non-normality)
var.test(drug_a, drug_b)

# Levene's test (more robust)
# Requires the car package
library(car)
leveneTest(time ~ drug, data = reaction_data)

Here’s my opinionated take: skip these tests. If variances are equal, Welch’s t-test performs nearly as well as Student’s. If variances are unequal, Welch’s handles it correctly while Student’s gives wrong answers. There’s no practical benefit to testing variance equality first.

Running the T-Test with `t.test()`

R’s t.test() function handles both data formats. The default settings are sensible for most situations:

# Using separate vectors
t.test(drug_a, drug_b)

# Using formula notation with grouped data
t.test(time ~ drug, data = reaction_data)

Both produce identical results. The formula notation (outcome ~ group) is cleaner and self-documenting.

Key Parameters

var.equal: Set to TRUE for Student’s t-test, FALSE (default) for Welch’s. Keep the default unless you have a specific reason.

# Welch's t-test (default, recommended)
t.test(time ~ drug, data = reaction_data, var.equal = FALSE)

# Student's t-test (only if you're certain variances are equal)
t.test(time ~ drug, data = reaction_data, var.equal = TRUE)

alternative: Specifies one-tailed or two-tailed tests. Options are "two.sided" (default), "less", or "greater".

# Two-tailed: Is there any difference?
t.test(drug_a, drug_b, alternative = "two.sided")

# One-tailed: Is Drug A greater than Drug B?
t.test(drug_a, drug_b, alternative = "greater")

Use one-tailed tests sparingly. You should decide the direction before seeing your data, and you need strong theoretical justification. When in doubt, use two-tailed.

conf.level: Confidence level for the interval estimate. Default is 0.95.

# 99% confidence interval
t.test(drug_a, drug_b, conf.level = 0.99)

Interpreting the Output

Running the test produces several pieces of information:

result <- t.test(time ~ drug, data = reaction_data)
print(result)

Output:

	Welch Two Sample t-test

data:  time by drug
t = 4.5826, df = 17.854, p-value = 0.0002369
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
  9.569498 26.030502
sample estimates:
mean in group A mean in group B 
          249.9           232.1

t-statistic (4.58): Measures how many standard errors the observed difference is from zero. Larger absolute values indicate stronger evidence against the null hypothesis.

Degrees of freedom (17.85): Welch’s t-test produces non-integer df because it adjusts for unequal variances. Student’s t-test would give df = 18 (n₁ + n₂ - 2).

p-value (0.0002): Probability of observing a difference this extreme if the null hypothesis (no difference) were true. This is below 0.05, so we reject the null.

Confidence interval (9.57 to 26.03): We’re 95% confident the true mean difference falls in this range. Since it doesn’t include zero, this confirms our significant result.

Sample means (249.9 vs 232.1): Drug A shows higher reaction times by about 17.8 ms on average.

Extract specific components for reporting or further analysis:

# Extract components
result$statistic    # t-value
result$parameter    # degrees of freedom
result$p.value      # p-value
result$conf.int     # confidence interval
result$estimate     # group means

# Calculate effect size (Cohen's d)
cohens_d <- (mean(drug_a) - mean(drug_b)) / 
            sqrt((var(drug_a) + var(drug_b)) / 2)
print(paste("Cohen's d:", round(cohens_d, 3)))

A Cohen’s d around 0.2 is small, 0.5 is medium, and 0.8 is large. Always report effect size—a significant p-value with a tiny effect size might not be practically meaningful.

Visualizing the Comparison

Statistical output needs visual support. Boxplots effectively show group differences:

library(ggplot2)

ggplot(reaction_data, aes(x = drug, y = time, fill = drug)) +
  geom_boxplot(alpha = 0.7) +
  geom_jitter(width = 0.1, alpha = 0.5) +
  labs(
    title = "Reaction Time by Drug Condition",
    x = "Drug",
    y = "Reaction Time (ms)"
  ) +
  theme_minimal() +
  theme(legend.position = "none") +
  scale_fill_brewer(palette = "Set2")

Add significance annotations with the ggsignif package:

library(ggsignif)

ggplot(reaction_data, aes(x = drug, y = time, fill = drug)) +
  geom_boxplot(alpha = 0.7) +
  geom_jitter(width = 0.1, alpha = 0.5) +
  geom_signif(
    comparisons = list(c("A", "B")),
    map_signif_level = TRUE,
    textsize = 5
  ) +
  labs(
    title = "Reaction Time by Drug Condition",
    x = "Drug",
    y = "Reaction Time (ms)"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

The visualization reveals what summary statistics hide: distribution shape, outliers, and overlap between groups.

Common Pitfalls and Alternatives

When Assumptions Fail

If your data severely violates normality (especially with small samples), the Mann-Whitney U test (also called Wilcoxon rank-sum test) is your non-parametric alternative:

# Mann-Whitney U / Wilcoxon rank-sum test
wilcox.test(time ~ drug, data = reaction_data)

# Or with separate vectors
wilcox.test(drug_a, drug_b)

This test compares medians rather than means and makes no distributional assumptions. The trade-off is lower statistical power when normality holds.

Don’t Confuse Paired and Independent Tests

The paired t-test is for dependent observations—same subjects measured twice, or matched pairs. If you have before-after measurements on the same people, you need paired = TRUE:

# WRONG: Treating paired data as independent
t.test(before, after)

# CORRECT: Paired t-test
t.test(before, after, paired = TRUE)

Using an independent t-test on paired data throws away information and reduces power. Using a paired test on independent data violates assumptions and produces invalid results.

Other Mistakes to Avoid

Multiple comparisons: If you’re comparing more than two groups, don’t run multiple t-tests. Use ANOVA with post-hoc tests to control family-wise error rate.

Ignoring effect size: A significant p-value with n=10,000 might reflect a trivial difference. Always compute and report effect size.

P-hacking: Running the test, then removing “outliers” until you get significance is scientific fraud. Define your analysis plan before looking at results.

The two-sample t-test is fundamental to statistical analysis in R. Master the mechanics here, but remember that choosing the right test matters less than designing good studies and collecting quality data.