How to Perform Welch's T-Test in Python

Key Insights

Welch’s t-test should be your default choice for comparing two group means—it doesn’t assume equal variances and performs well even when variances are equal.
Use scipy.stats.ttest_ind(group1, group2, equal_var=False) for the standard implementation, but always check normality assumptions first.
When your data violates normality assumptions, switch to the Mann-Whitney U test instead of forcing a parametric approach.

Introduction to Welch’s T-Test

Welch’s t-test compares the means of two independent groups when you can’t assume they have equal variances. This makes it more robust than the classic Student’s t-test, which requires the homogeneity of variance assumption that rarely holds in practice.

Here’s the practical reality: you should almost always use Welch’s t-test instead of Student’s t-test. When variances are actually equal, Welch’s test performs nearly identically to Student’s. When they’re not equal, Student’s t-test gives misleading results while Welch’s remains accurate. There’s no downside to using the more flexible option.

Common use cases include:

A/B testing: Comparing conversion rates or engagement metrics between control and treatment groups
Clinical trials: Analyzing treatment effects where patient responses vary widely
Quality control: Comparing measurements from different manufacturing processes
Academic research: Any scenario where you’re comparing two independent samples

Mathematical Foundation

Welch’s t-test modifies the standard t-test by adjusting both the test statistic and degrees of freedom to account for unequal variances.

The test statistic is calculated as:

$$t = \frac{\bar{X}_1 - \bar{X}_2}{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}}$$

The degrees of freedom use the Welch-Satterthwaite equation:

$$df = \frac{\left(\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}\right)^2}{\frac{s_1^4}{n_1^2(n_1-1)} + \frac{s_2^4}{n_2^2(n_2-1)}}$$

The assumptions for Welch’s t-test are:

Independence: Observations within and between groups are independent
Normality: Data in each group should be approximately normally distributed
Continuous data: The dependent variable should be measured on a continuous scale

Notice what’s missing: equal variances. That’s the key advantage.

import numpy as np
from scipy import stats

def welch_ttest_manual(group1, group2):
    """
    Manual implementation of Welch's t-test for educational purposes.
    """
    n1, n2 = len(group1), len(group2)
    mean1, mean2 = np.mean(group1), np.mean(group2)
    var1, var2 = np.var(group1, ddof=1), np.var(group2, ddof=1)
    
    # Standard error of the difference
    se_diff = np.sqrt(var1/n1 + var2/n2)
    
    # T-statistic
    t_stat = (mean1 - mean2) / se_diff
    
    # Welch-Satterthwaite degrees of freedom
    numerator = (var1/n1 + var2/n2)**2
    denominator = (var1**2 / (n1**2 * (n1-1))) + (var2**2 / (n2**2 * (n2-1)))
    df = numerator / denominator
    
    # Two-tailed p-value
    p_value = 2 * stats.t.sf(abs(t_stat), df)
    
    return {
        't_statistic': t_stat,
        'degrees_of_freedom': df,
        'p_value': p_value,
        'mean_difference': mean1 - mean2
    }

# Test with sample data
np.random.seed(42)
group_a = np.random.normal(100, 15, 30)  # Mean=100, SD=15
group_b = np.random.normal(110, 25, 25)  # Mean=110, SD=25 (different variance!)

result = welch_ttest_manual(group_a, group_b)
print(f"T-statistic: {result['t_statistic']:.4f}")
print(f"Degrees of freedom: {result['degrees_of_freedom']:.2f}")
print(f"P-value: {result['p_value']:.4f}")

Performing Welch’s T-Test with SciPy

In practice, you’ll use SciPy’s implementation rather than rolling your own. The key parameter is equal_var=False, which switches from Student’s t-test to Welch’s.

import numpy as np
from scipy import stats

# Generate two groups with different variances
np.random.seed(42)
control_group = np.random.normal(loc=50, scale=10, size=40)
treatment_group = np.random.normal(loc=55, scale=18, size=35)

# Perform Welch's t-test
t_stat, p_value = stats.ttest_ind(
    control_group, 
    treatment_group, 
    equal_var=False  # This is the critical parameter
)

print(f"Welch's t-test results:")
print(f"  T-statistic: {t_stat:.4f}")
print(f"  P-value: {p_value:.4f}")
print(f"  Control mean: {control_group.mean():.2f} (SD: {control_group.std():.2f})")
print(f"  Treatment mean: {treatment_group.mean():.2f} (SD: {treatment_group.std():.2f})")

# Interpretation
alpha = 0.05
if p_value < alpha:
    print(f"\nResult: Statistically significant difference (p < {alpha})")
else:
    print(f"\nResult: No statistically significant difference (p >= {alpha})")

Interpreting the output:

T-statistic: Measures how many standard errors the group means are apart. Larger absolute values indicate bigger differences.
P-value: The probability of observing this difference (or more extreme) if the null hypothesis were true. Below your significance threshold (typically 0.05), you reject the null.

Practical Example with Real Data

Let’s work through a realistic scenario: analyzing whether a new website design improves user engagement time.

import pandas as pd
import numpy as np
from scipy import stats

# Simulate realistic A/B test data
np.random.seed(123)

data = pd.DataFrame({
    'user_id': range(1, 201),
    'group': ['control'] * 100 + ['treatment'] * 100,
    'time_on_page': np.concatenate([
        np.random.exponential(45, 100) + 30,  # Control: mean ~75 seconds
        np.random.exponential(55, 100) + 35   # Treatment: mean ~90 seconds
    ])
})

# Examine the data
print("Dataset Overview:")
print(data.groupby('group')['time_on_page'].agg(['count', 'mean', 'std', 'median']))
print()

# Split into groups
control = data[data['group'] == 'control']['time_on_page']
treatment = data[data['group'] == 'treatment']['time_on_page']

# Check variance ratio (rule of thumb: concern if ratio > 2)
variance_ratio = treatment.var() / control.var()
print(f"Variance ratio (treatment/control): {variance_ratio:.2f}")

# Perform Welch's t-test
t_stat, p_value = stats.ttest_ind(control, treatment, equal_var=False)

print(f"\nWelch's T-Test Results:")
print(f"  T-statistic: {t_stat:.4f}")
print(f"  P-value: {p_value:.4f}")

# Calculate effect size (Cohen's d)
pooled_std = np.sqrt((control.std()**2 + treatment.std()**2) / 2)
cohens_d = (treatment.mean() - control.mean()) / pooled_std
print(f"  Cohen's d: {cohens_d:.3f}")

# Confidence interval for the difference
from scipy.stats import t as t_dist

mean_diff = treatment.mean() - control.mean()
se_diff = np.sqrt(control.var()/len(control) + treatment.var()/len(treatment))
df = ((control.var()/len(control) + treatment.var()/len(treatment))**2 / 
      ((control.var()**2/(len(control)**2*(len(control)-1))) + 
       (treatment.var()**2/(len(treatment)**2*(len(treatment)-1)))))

ci_margin = t_dist.ppf(0.975, df) * se_diff
print(f"  95% CI for difference: [{mean_diff - ci_margin:.2f}, {mean_diff + ci_margin:.2f}]")

Checking Assumptions and Alternatives

Before trusting your t-test results, verify the normality assumption. Small violations are acceptable due to the Central Limit Theorem, but severe non-normality requires non-parametric alternatives.

import numpy as np
from scipy import stats

def check_ttest_assumptions(group1, group2, group1_name="Group 1", group2_name="Group 2"):
    """
    Check assumptions for Welch's t-test and recommend alternatives if needed.
    """
    results = {'can_use_welch': True, 'warnings': [], 'recommendations': []}
    
    # Check sample sizes
    n1, n2 = len(group1), len(group2)
    print(f"Sample sizes: {group1_name}={n1}, {group2_name}={n2}")
    
    if n1 < 5 or n2 < 5:
        results['warnings'].append("Very small sample size - results unreliable")
        results['can_use_welch'] = False
    
    # Normality test (Shapiro-Wilk)
    # Only use for n < 5000; for larger samples, use visual inspection
    print("\nNormality Tests (Shapiro-Wilk):")
    
    for name, group in [(group1_name, group1), (group2_name, group2)]:
        if len(group) <= 5000:
            stat, p = stats.shapiro(group)
            normality_status = "Normal" if p > 0.05 else "Non-normal"
            print(f"  {name}: W={stat:.4f}, p={p:.4f} ({normality_status})")
            
            if p < 0.05 and len(group) < 30:
                results['warnings'].append(f"{name} appears non-normal with small sample")
        else:
            print(f"  {name}: Sample too large for Shapiro-Wilk, use visual inspection")
    
    # Check for outliers using IQR method
    print("\nOutlier Detection (IQR method):")
    for name, group in [(group1_name, group1), (group2_name, group2)]:
        q1, q3 = np.percentile(group, [25, 75])
        iqr = q3 - q1
        outliers = np.sum((group < q1 - 1.5*iqr) | (group > q3 + 1.5*iqr))
        outlier_pct = 100 * outliers / len(group)
        print(f"  {name}: {outliers} outliers ({outlier_pct:.1f}%)")
        
        if outlier_pct > 10:
            results['warnings'].append(f"{name} has many outliers")
    
    # Provide recommendations
    print("\n" + "="*50)
    if results['warnings']:
        print("WARNINGS:")
        for w in results['warnings']:
            print(f"  ⚠ {w}")
        print("\nRECOMMENDATION: Consider Mann-Whitney U test")
        
        # Run Mann-Whitney as alternative
        u_stat, u_pvalue = stats.mannwhitneyu(group1, group2, alternative='two-sided')
        print(f"\nMann-Whitney U results: U={u_stat:.1f}, p={u_pvalue:.4f}")
    else:
        print("✓ Assumptions appear satisfied. Welch's t-test is appropriate.")
    
    return results

# Example usage
np.random.seed(42)
normal_group = np.random.normal(100, 15, 50)
skewed_group = np.random.exponential(20, 45) + 80  # Skewed distribution

check_ttest_assumptions(normal_group, skewed_group, "Normal", "Skewed")

Visualizing Results

Effective visualization communicates your findings better than numbers alone. Here’s how to create publication-ready plots with statistical annotations.

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

# Prepare data
np.random.seed(42)
control = np.random.normal(100, 12, 45)
treatment = np.random.normal(108, 15, 50)

# Run the test
t_stat, p_value = stats.ttest_ind(control, treatment, equal_var=False)

# Create figure with two subplots
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Plot 1: Box plot with individual points
ax1 = axes[0]
data_for_plot = [control, treatment]
positions = [1, 2]

bp = ax1.boxplot(data_for_plot, positions=positions, widths=0.6, 
                  patch_artist=True)
bp['boxes'][0].set_facecolor('#3498db')
bp['boxes'][1].set_facecolor('#e74c3c')

# Add individual points with jitter
for i, (data, pos) in enumerate(zip(data_for_plot, positions)):
    jitter = np.random.normal(0, 0.04, len(data))
    ax1.scatter(pos + jitter, data, alpha=0.5, s=20, 
                color='#2c3e50', zorder=3)

# Add significance annotation
y_max = max(control.max(), treatment.max())
y_annotation = y_max + 5

ax1.plot([1, 1, 2, 2], [y_annotation, y_annotation+2, y_annotation+2, y_annotation], 
         'k-', linewidth=1.5)

sig_text = f"p = {p_value:.4f}" if p_value >= 0.001 else "p < 0.001"
if p_value < 0.05:
    sig_text += " *"
if p_value < 0.01:
    sig_text += "*"
if p_value < 0.001:
    sig_text += "*"

ax1.text(1.5, y_annotation + 4, sig_text, ha='center', fontsize=11)

ax1.set_xticks([1, 2])
ax1.set_xticklabels(['Control', 'Treatment'])
ax1.set_ylabel('Value')
ax1.set_title('Group Comparison with Significance')

# Plot 2: Distribution comparison
ax2 = axes[1]
sns.kdeplot(control, ax=ax2, label=f'Control (μ={control.mean():.1f})', 
            color='#3498db', fill=True, alpha=0.3)
sns.kdeplot(treatment, ax=ax2, label=f'Treatment (μ={treatment.mean():.1f})', 
            color='#e74c3c', fill=True, alpha=0.3)

# Add vertical lines for means
ax2.axvline(control.mean(), color='#3498db', linestyle='--', linewidth=2)
ax2.axvline(treatment.mean(), color='#e74c3c', linestyle='--', linewidth=2)

ax2.set_xlabel('Value')
ax2.set_ylabel('Density')
ax2.set_title('Distribution Comparison')
ax2.legend()

plt.tight_layout()
plt.savefig('welch_ttest_visualization.png', dpi=150, bbox_inches='tight')
plt.show()

print(f"\nStatistical Summary:")
print(f"  Control: n={len(control)}, mean={control.mean():.2f}, SD={control.std():.2f}")
print(f"  Treatment: n={len(treatment)}, mean={treatment.mean():.2f}, SD={treatment.std():.2f}")
print(f"  Welch's t = {t_stat:.3f}, p = {p_value:.4f}")

Conclusion and Best Practices

Welch’s t-test should be your default for comparing two independent groups. Here’s a decision framework:

Use Welch’s t-test when:

Comparing means of two independent groups
Data is approximately normal (or sample sizes are large enough for CLT)
You want a robust test that handles unequal variances

Switch to Mann-Whitney U when:

Normality assumption is clearly violated
You have ordinal data
Outliers are present and can’t be addressed

Common pitfalls to avoid:

Using Student’s t-test by default: Set equal_var=False unless you have a specific reason not to.
Ignoring effect size: Statistical significance doesn’t mean practical significance. Always report Cohen’s d.
Multiple comparisons: If testing many groups, use ANOVA with post-hoc corrections instead of multiple t-tests.
Small samples with non-normality: The CLT won’t save you with n < 30 and skewed data.

The bottom line: Welch’s t-test is more flexible and nearly as powerful as Student’s t-test. Make it your standard tool for two-sample comparisons.