How to Perform a Two-Sample T-Test in Python

The two-sample t-test answers a straightforward question: are the means of two independent groups statistically different? You'll reach for this test constantly in applied work—comparing conversion...

Key Insights

  • The two-sample t-test compares means between two independent groups, but you must verify assumptions about normality and variance equality before trusting the results.
  • Use Welch’s t-test (equal_var=False) as your default choice—it’s more robust when sample sizes or variances differ between groups.
  • Statistical significance (p-value) tells you whether an effect exists, but effect size (Cohen’s d) tells you whether it matters practically.

Introduction to Two-Sample T-Tests

The two-sample t-test answers a straightforward question: are the means of two independent groups statistically different? You’ll reach for this test constantly in applied work—comparing conversion rates between website variants, measuring treatment effects in experiments, or evaluating whether two manufacturing processes produce different results.

The test works by calculating how far apart the two group means are relative to the variability within each group. If the means are far apart and the variability is low, you have evidence that the groups genuinely differ rather than the difference being due to random chance.

Despite its simplicity, the t-test remains one of the most misused statistical tools. Researchers apply it without checking assumptions, misinterpret p-values, or ignore effect sizes entirely. This article walks through the complete workflow for conducting a rigorous two-sample t-test in Python.

Assumptions and Prerequisites

Before running a t-test, understand what it assumes about your data. Violating these assumptions can produce misleading results.

Independence: Observations in one group cannot influence observations in the other. If you’re comparing test scores, the same person can’t appear in both groups.

Normality: The data in each group should be approximately normally distributed. With large samples (n > 30 per group), the Central Limit Theorem makes this less critical.

Homogeneity of variance: Both groups should have similar variances. When this fails, use Welch’s t-test instead of Student’s t-test.

Continuous data: The dependent variable must be measured on a continuous scale (interval or ratio).

Let’s set up our environment and create a realistic dataset:

import numpy as np
import pandas as pd
from scipy import stats

# Set seed for reproducibility
np.random.seed(42)

# Simulate A/B test data: time spent on page (seconds)
# Control group: current design
# Treatment group: new design with improved layout
control = np.random.normal(loc=45, scale=12, size=150)
treatment = np.random.normal(loc=52, scale=14, size=130)

# Create DataFrame for easier manipulation
df = pd.DataFrame({
    'group': ['control'] * len(control) + ['treatment'] * len(treatment),
    'time_on_page': np.concatenate([control, treatment])
})

Preparing Your Data

Real-world data rarely arrives in the clean format shown above. You’ll typically load it from a file and need to separate it into groups for analysis.

# In practice, you'd load from a file
# df = pd.read_csv('ab_test_results.csv')

# Split data into groups
control_data = df[df['group'] == 'control']['time_on_page']
treatment_data = df[df['group'] == 'treatment']['time_on_page']

# Calculate descriptive statistics
def describe_group(data, name):
    return {
        'Group': name,
        'N': len(data),
        'Mean': data.mean(),
        'Std Dev': data.std(),
        'Median': data.median(),
        'Min': data.min(),
        'Max': data.max()
    }

summary = pd.DataFrame([
    describe_group(control_data, 'Control'),
    describe_group(treatment_data, 'Treatment')
])

print(summary.to_string(index=False))

Output:

    Group    N       Mean    Std Dev     Median        Min        Max
  Control  150  44.576633  11.671528  44.266498  14.354515  75.828516
Treatment  130  51.884498  13.785498  51.438637  18.232137  89.482902

The treatment group shows a higher mean time on page (roughly 52 seconds versus 45 seconds). But is this difference statistically significant, or could it be explained by random variation?

Checking Assumptions

Never skip assumption checking. It takes thirty seconds and can save you from drawing false conclusions.

Testing Normality

The Shapiro-Wilk test evaluates whether data follows a normal distribution. A p-value above 0.05 suggests normality is a reasonable assumption.

def check_normality(data, group_name, alpha=0.05):
    stat, p_value = stats.shapiro(data)
    result = "normal" if p_value > alpha else "non-normal"
    print(f"{group_name}: W={stat:.4f}, p={p_value:.4f} -> {result}")
    return p_value > alpha

print("Shapiro-Wilk Normality Test")
print("-" * 40)
control_normal = check_normality(control_data, "Control")
treatment_normal = check_normality(treatment_data, "Treatment")

Output:

Shapiro-Wilk Normality Test
----------------------------------------
Control: W=0.9953, p=0.9063 -> normal
Treatment: W=0.9924, p=0.7185 -> normal

Both groups pass the normality test. If they hadn’t, you’d consider the Mann-Whitney U test as a non-parametric alternative.

Testing Variance Equality

Levene’s test checks whether two groups have equal variances. This determines whether you should use Student’s t-test (equal variances) or Welch’s t-test (unequal variances).

def check_variance_equality(group1, group2, alpha=0.05):
    stat, p_value = stats.levene(group1, group2)
    equal_var = p_value > alpha
    result = "equal" if equal_var else "unequal"
    print(f"Levene's Test: W={stat:.4f}, p={p_value:.4f} -> variances are {result}")
    return equal_var

print("\nLevene's Test for Equality of Variances")
print("-" * 40)
equal_variances = check_variance_equality(control_data, treatment_data)

Output:

Levene's Test for Equality of Variances
----------------------------------------
Levene's Test: W=2.6842, p=0.1025 -> variances are equal

The test suggests equal variances, but the p-value is close to our threshold. In practice, I recommend using Welch’s t-test by default—it performs nearly as well as Student’s t-test when variances are equal and much better when they’re not.

Performing the T-Test

With assumptions verified, we can run the test. The scipy.stats.ttest_ind() function handles both Student’s and Welch’s variants.

# Student's t-test (assumes equal variances)
t_stat_student, p_value_student = stats.ttest_ind(
    control_data, 
    treatment_data, 
    equal_var=True
)

# Welch's t-test (does not assume equal variances)
t_stat_welch, p_value_welch = stats.ttest_ind(
    control_data, 
    treatment_data, 
    equal_var=False
)

print("Two-Sample T-Test Results")
print("=" * 50)
print(f"\nStudent's t-test (equal_var=True):")
print(f"  t-statistic: {t_stat_student:.4f}")
print(f"  p-value: {p_value_student:.6f}")

print(f"\nWelch's t-test (equal_var=False):")
print(f"  t-statistic: {t_stat_welch:.4f}")
print(f"  p-value: {p_value_welch:.6f}")

Output:

Two-Sample T-Test Results
==================================================

Student's t-test (equal_var=True):
  t-statistic: -4.8572
  p-value: 0.000002

Welch's t-test (equal_var=False):
  t-statistic: -4.8140
  p-value: 0.000003

Both tests yield highly significant results (p < 0.001). The negative t-statistic indicates the first group (control) has a lower mean than the second group (treatment).

Interpreting and Reporting Results

A statistically significant p-value tells you that the observed difference is unlikely under the null hypothesis (no real difference). But statistical significance doesn’t equal practical significance.

Calculating Effect Size

Cohen’s d quantifies the magnitude of the difference in standard deviation units. Guidelines suggest d = 0.2 is small, d = 0.5 is medium, and d = 0.8 is large.

def cohens_d(group1, group2):
    n1, n2 = len(group1), len(group2)
    var1, var2 = group1.var(), group2.var()
    
    # Pooled standard deviation
    pooled_std = np.sqrt(((n1 - 1) * var1 + (n2 - 1) * var2) / (n1 + n2 - 2))
    
    return (group2.mean() - group1.mean()) / pooled_std

effect_size = cohens_d(control_data, treatment_data)

def interpret_cohens_d(d):
    d_abs = abs(d)
    if d_abs < 0.2:
        return "negligible"
    elif d_abs < 0.5:
        return "small"
    elif d_abs < 0.8:
        return "medium"
    else:
        return "large"

print(f"\nEffect Size Analysis")
print("-" * 40)
print(f"Cohen's d: {effect_size:.4f}")
print(f"Interpretation: {interpret_cohens_d(effect_size)} effect")

Output:

Effect Size Analysis
----------------------------------------
Cohen's d: 0.5805
Interpretation: medium effect

Formatting Results for Reports

Here’s a complete function that produces publication-ready output:

def report_ttest_results(group1, group2, group1_name, group2_name, alpha=0.05):
    # Run Welch's t-test
    t_stat, p_value = stats.ttest_ind(group1, group2, equal_var=False)
    
    # Calculate effect size
    d = cohens_d(group1, group2)
    
    # Determine significance
    significant = p_value < alpha
    
    # Build report
    report = f"""
Statistical Analysis Report
{'=' * 50}
Comparison: {group1_name} vs {group2_name}

Descriptive Statistics:
  {group1_name}: M = {group1.mean():.2f}, SD = {group1.std():.2f}, n = {len(group1)}
  {group2_name}: M = {group2.mean():.2f}, SD = {group2.std():.2f}, n = {len(group2)}

Welch's Two-Sample T-Test:
  t({len(group1) + len(group2) - 2}) = {t_stat:.3f}, p = {p_value:.6f}
  
Effect Size:
  Cohen's d = {d:.3f} ({interpret_cohens_d(d)})

Conclusion:
  The difference between groups is {'statistically significant' if significant else 'not statistically significant'} 
  at α = {alpha}. The {group2_name} group showed {'higher' if d > 0 else 'lower'} values 
  with a {interpret_cohens_d(d)} effect size.
"""
    return report

print(report_ttest_results(control_data, treatment_data, "Control", "Treatment"))

Conclusion

The two-sample t-test workflow follows a consistent pattern: prepare data, check assumptions, run the test, and interpret results with effect sizes. Here’s the condensed version:

  1. Verify independence through study design (not testable statistically)
  2. Check normality with Shapiro-Wilk; use Mann-Whitney U if violated with small samples
  3. Check variance equality with Levene’s test; use Welch’s t-test if violated (or just use Welch’s by default)
  4. Run the test and examine both p-value and effect size
  5. Report completely: include means, standard deviations, test statistics, p-values, and Cohen’s d

When normality assumptions fail badly, switch to the Mann-Whitney U test (scipy.stats.mannwhitneyu()). When you have paired observations (same subjects measured twice), use the paired t-test (scipy.stats.ttest_rel()).

The biggest mistake I see is treating p < 0.05 as the only thing that matters. A tiny effect can be statistically significant with enough data, and a meaningful effect can miss significance with too little data. Always report effect sizes, and always consider whether your finding matters practically, not just statistically.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.