How to Perform a Paired T-Test in Python

Key Insights

A paired t-test compares means from the same subjects under two conditions, making it more powerful than independent t-tests when you have naturally paired data like before/after measurements.
Always verify the normality assumption by testing the differences between paired observations, not the raw data itself—use the Shapiro-Wilk test with a significance threshold of 0.05.
Statistical significance (p-value) tells you whether an effect exists, but Cohen’s d tells you whether the effect matters—always report both.

Introduction to Paired T-Tests

The paired t-test is your go-to statistical tool when you need to compare two related measurements from the same subjects. Unlike an independent t-test that compares means between two separate groups, a paired t-test accounts for the inherent correlation between observations from the same individual or matched pair.

You should reach for a paired t-test when:

Measuring the same subjects before and after an intervention (blood pressure before/after medication)
Comparing two different treatments applied to the same individuals (reaction time with caffeine vs. placebo)
Analyzing matched pairs in experimental designs (twins, matched controls)

The key insight is that paired designs reduce variability by controlling for individual differences. If you’re measuring blood pressure changes, between-subject variability (some people naturally have higher blood pressure) gets eliminated because each person serves as their own control.

Using an independent t-test on paired data wastes statistical power. You’re ignoring valuable information about the correlation structure in your data.

Assumptions and Prerequisites

Before running a paired t-test, verify these assumptions:

Paired observations: Each observation in one group has exactly one corresponding observation in the other group.
Continuous dependent variable: Your measurements should be on an interval or ratio scale.
Normal distribution of differences: The differences between paired observations should be approximately normally distributed.

Notice that the normality requirement applies to the differences, not the original measurements. This is a common mistake. Your raw data can be skewed, but if the differences follow a normal distribution, you’re good to go.

Here’s how to check normality using the Shapiro-Wilk test:

import numpy as np
from scipy import stats

# Sample data: blood pressure before and after treatment
before = np.array([120, 135, 128, 142, 138, 125, 130, 145, 133, 140])
after = np.array([118, 130, 125, 138, 132, 122, 128, 140, 130, 135])

# Calculate differences
differences = after - before

# Shapiro-Wilk test for normality
stat, p_value = stats.shapiro(differences)

print(f"Shapiro-Wilk statistic: {stat:.4f}")
print(f"P-value: {p_value:.4f}")

if p_value > 0.05:
    print("Differences appear normally distributed (fail to reject H0)")
else:
    print("Differences may not be normally distributed (reject H0)")

If the Shapiro-Wilk p-value falls below 0.05, consider the Wilcoxon signed-rank test as a non-parametric alternative. With sample sizes above 30, the Central Limit Theorem provides some protection against normality violations, but don’t use this as an excuse to skip the check.

Setting Up Your Environment

You’ll need three core libraries for paired t-tests in Python:

import numpy as np
import pandas as pd
from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sns

# Set random seed for reproducibility
np.random.seed(42)

Let’s create a realistic dataset. Imagine a clinical trial measuring systolic blood pressure for 25 patients before and after a 12-week exercise intervention:

# Generate realistic blood pressure data
n_patients = 25

# Baseline blood pressure (slightly elevated, typical for intervention study)
baseline_bp = np.random.normal(loc=138, scale=12, size=n_patients)

# Treatment effect: average reduction of 5 mmHg with individual variation
treatment_effect = np.random.normal(loc=-5, scale=3, size=n_patients)
followup_bp = baseline_bp + treatment_effect

# Create a DataFrame for easier manipulation
df = pd.DataFrame({
    'patient_id': range(1, n_patients + 1),
    'baseline_bp': np.round(baseline_bp, 1),
    'followup_bp': np.round(followup_bp, 1)
})

df['difference'] = df['followup_bp'] - df['baseline_bp']

print(df.head(10))
print(f"\nMean baseline: {df['baseline_bp'].mean():.1f} mmHg")
print(f"Mean followup: {df['followup_bp'].mean():.1f} mmHg")
print(f"Mean difference: {df['difference'].mean():.1f} mmHg")

This setup gives you a clean DataFrame with patient IDs, both measurements, and the calculated differences—everything you need for analysis and visualization.

Performing the Paired T-Test with SciPy

The scipy.stats.ttest_rel() function handles paired t-tests. The “rel” stands for “related,” indicating dependent samples:

# Perform paired t-test
t_statistic, p_value = stats.ttest_rel(df['followup_bp'], df['baseline_bp'])

print(f"T-statistic: {t_statistic:.4f}")
print(f"P-value: {p_value:.4f}")

# Determine significance at alpha = 0.05
alpha = 0.05
if p_value < alpha:
    print(f"\nResult: Statistically significant difference (p < {alpha})")
else:
    print(f"\nResult: No statistically significant difference (p >= {alpha})")

The t-statistic represents the ratio of the mean difference to the standard error of that difference. A larger absolute t-value indicates stronger evidence against the null hypothesis (that the true mean difference is zero).

The order of arguments matters for interpretation. With ttest_rel(followup, baseline), a negative t-statistic indicates that follow-up values are lower than baseline—exactly what you’d expect if the treatment reduces blood pressure.

For a two-tailed test (the default), you’re testing whether any difference exists, regardless of direction. If you have a directional hypothesis (e.g., treatment reduces blood pressure), divide the p-value by 2 for a one-tailed test:

# One-tailed test (if hypothesis is that treatment reduces BP)
one_tailed_p = p_value / 2

if t_statistic < 0 and one_tailed_p < alpha:
    print("Significant reduction in blood pressure (one-tailed test)")

Interpreting Results and Effect Size

A p-value below 0.05 tells you the observed difference is unlikely under the null hypothesis. It says nothing about whether the difference is meaningful in practice.

Cohen’s d for paired samples quantifies effect size by expressing the mean difference in standard deviation units:

def cohens_d_paired(group1, group2):
    """
    Calculate Cohen's d for paired samples.
    
    Parameters:
    -----------
    group1, group2 : array-like
        Paired observations
    
    Returns:
    --------
    float : Cohen's d effect size
    """
    differences = np.array(group1) - np.array(group2)
    return np.mean(differences) / np.std(differences, ddof=1)

# Calculate effect size
effect_size = cohens_d_paired(df['followup_bp'], df['baseline_bp'])

print(f"Cohen's d: {effect_size:.3f}")

# Interpret effect size
if abs(effect_size) < 0.2:
    interpretation = "negligible"
elif abs(effect_size) < 0.5:
    interpretation = "small"
elif abs(effect_size) < 0.8:
    interpretation = "medium"
else:
    interpretation = "large"

print(f"Effect size interpretation: {interpretation}")

Cohen’s conventional thresholds are 0.2 (small), 0.5 (medium), and 0.8 (large). These are rough guidelines, not rigid rules. A “small” effect in clinical research might be highly meaningful if the intervention is cheap and scalable.

You can also calculate confidence intervals for the mean difference:

def paired_confidence_interval(group1, group2, confidence=0.95):
    """Calculate CI for mean difference in paired samples."""
    differences = np.array(group1) - np.array(group2)
    n = len(differences)
    mean_diff = np.mean(differences)
    se = stats.sem(differences)
    
    # t critical value
    t_crit = stats.t.ppf((1 + confidence) / 2, df=n-1)
    
    margin = t_crit * se
    return mean_diff - margin, mean_diff + margin

ci_low, ci_high = paired_confidence_interval(df['followup_bp'], df['baseline_bp'])
print(f"95% CI for mean difference: [{ci_low:.2f}, {ci_high:.2f}] mmHg")

If this confidence interval excludes zero, you have a statistically significant result—this is mathematically equivalent to the p-value being below 0.05.

Visualizing Paired Data

Visualizations communicate paired relationships more effectively than summary statistics alone. Here are two essential plots:

fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Plot 1: Paired observations with connecting lines
ax1 = axes[0]
for i in range(len(df)):
    ax1.plot([0, 1], [df['baseline_bp'].iloc[i], df['followup_bp'].iloc[i]], 
             'o-', color='steelblue', alpha=0.5, linewidth=1)

ax1.set_xticks([0, 1])
ax1.set_xticklabels(['Baseline', 'Follow-up'])
ax1.set_ylabel('Systolic Blood Pressure (mmHg)')
ax1.set_title('Individual Patient Trajectories')

# Add mean lines
ax1.axhline(df['baseline_bp'].mean(), color='red', linestyle='--', 
            linewidth=2, label=f"Mean baseline: {df['baseline_bp'].mean():.1f}")
ax1.axhline(df['followup_bp'].mean(), color='darkred', linestyle='--', 
            linewidth=2, label=f"Mean followup: {df['followup_bp'].mean():.1f}")
ax1.legend()

# Plot 2: Distribution of differences
ax2 = axes[1]
sns.histplot(df['difference'], kde=True, ax=ax2, color='steelblue')
ax2.axvline(0, color='black', linestyle='-', linewidth=2, label='No change')
ax2.axvline(df['difference'].mean(), color='red', linestyle='--', 
            linewidth=2, label=f"Mean: {df['difference'].mean():.1f}")
ax2.set_xlabel('Change in Blood Pressure (mmHg)')
ax2.set_ylabel('Count')
ax2.set_title('Distribution of Individual Changes')
ax2.legend()

plt.tight_layout()
plt.savefig('paired_ttest_visualization.png', dpi=150, bbox_inches='tight')
plt.show()

The left plot shows individual trajectories, revealing whether the treatment effect is consistent across patients or driven by a few extreme responders. The right plot displays the distribution of differences, making it easy to assess normality and identify outliers.

Conclusion and Best Practices

Here’s the complete workflow for paired t-tests in Python:

Verify pairing: Confirm observations are genuinely paired and have a 1:1 correspondence.
Check normality: Apply Shapiro-Wilk to the differences, not raw data.
Run the test: Use scipy.stats.ttest_rel() with correct argument order.
Calculate effect size: Report Cohen’s d alongside p-values.
Visualize: Create paired trajectory plots and difference distributions.

Common pitfalls to avoid:

Using independent t-tests on paired data: You lose statistical power and may miss real effects.
Ignoring effect size: A significant p-value with negligible Cohen’s d means nothing practically.
Forgetting to check normality of differences: Test the differences, not the original measurements.
Small sample sizes without normality: With n < 30 and non-normal differences, switch to Wilcoxon signed-rank.

When your data violates the normality assumption, the Wilcoxon signed-rank test (scipy.stats.wilcoxon()) provides a robust non-parametric alternative. It tests whether the median difference equals zero rather than the mean, making it resistant to outliers and skewed distributions.