How to Use scipy.stats.ttest_ind in Python
The independent two-sample t-test answers a straightforward question: do these two groups have different means? You're comparing two separate, unrelated groups—not the same subjects measured twice.
Key Insights
- The
equal_var=Falseparameter (Welch’s t-test) should be your default choice—it’s more robust and performs well even when variances are actually equal. - A statistically significant p-value tells you nothing about practical importance; always calculate effect size (Cohen’s d) alongside your t-test.
- The
alternativeparameter lets you run one-tailed tests, which have more statistical power when you have a directional hypothesis, but use them only when justified before seeing your data.
Introduction to Independent Two-Sample T-Tests
The independent two-sample t-test answers a straightforward question: do these two groups have different means? You’re comparing two separate, unrelated groups—not the same subjects measured twice.
This test shows up constantly in practical work. A/B testing in product development, comparing treatment versus control groups in experiments, analyzing whether two manufacturing processes produce different results, or checking if customer segments differ in their spending behavior. Any time you have two independent groups and a continuous outcome, the two-sample t-test is often your first tool.
SciPy’s scipy.stats.ttest_ind function handles this elegantly, but its parameters and output require careful interpretation. Let’s break it down.
Function Syntax and Parameters
Here’s the function signature with the parameters you’ll actually use:
from scipy import stats
result = stats.ttest_ind(
a, # First sample (array-like)
b, # Second sample (array-like)
equal_var=True, # True = Student's t-test, False = Welch's t-test
alternative='two-sided', # 'two-sided', 'less', or 'greater'
nan_policy='propagate' # 'propagate', 'raise', or 'omit'
)
The return value is a named tuple containing statistic (the t-value) and pvalue. Starting with SciPy 1.9, you also get access to confidence intervals through the confidence_interval method.
Parameter breakdown:
a,b: Your two samples. These can be lists, NumPy arrays, or any array-like structure. They don’t need equal lengths.equal_var: Controls whether to assume equal population variances. This is the difference between Student’s t-test (True) and Welch’s t-test (False).alternative: Determines the alternative hypothesis. Use'two-sided'when you’re testing for any difference,'greater'when testing if a > b, and'less'when testing if a < b.nan_policy: How to handle NaN values.'propagate'returns NaN,'raise'throws an error, and'omit'removes NaN values before calculation.
Basic Usage Example
Let’s compare test scores between two classrooms to see if there’s a statistically significant difference:
import numpy as np
from scipy import stats
# Test scores from two different classrooms
classroom_a = np.array([78, 82, 85, 79, 88, 91, 76, 84, 87, 83])
classroom_b = np.array([72, 75, 78, 71, 80, 74, 69, 77, 73, 76])
# Run the t-test
result = stats.ttest_ind(classroom_a, classroom_b)
print(f"T-statistic: {result.statistic:.4f}")
print(f"P-value: {result.pvalue:.4f}")
# Output:
# T-statistic: 4.0620
# P-value: 0.0008
Interpreting the output:
The t-statistic of 4.06 tells you the difference between group means in terms of standard error units. A positive value means group a has a higher mean than group b.
The p-value of 0.0008 is the probability of observing a difference this extreme (or more extreme) if the null hypothesis were true—that is, if both classrooms actually had the same underlying mean. With p < 0.05 (or whatever threshold you’ve set), you reject the null hypothesis and conclude the means differ significantly.
# Get the actual means for context
print(f"Classroom A mean: {classroom_a.mean():.2f}")
print(f"Classroom B mean: {classroom_b.mean():.2f}")
print(f"Difference: {classroom_a.mean() - classroom_b.mean():.2f} points")
# Output:
# Classroom A mean: 83.30
# Classroom B mean: 74.50
# Difference: 8.80 points
Welch’s T-Test vs. Student’s T-Test
The equal_var parameter is more important than it might seem. Student’s t-test assumes both populations have equal variances. When this assumption is violated, it can produce misleading results, especially with unequal sample sizes.
Welch’s t-test doesn’t make this assumption. It adjusts the degrees of freedom based on the sample variances, making it more robust.
import numpy as np
from scipy import stats
# Two groups with very different variances
group_high_variance = np.array([50, 55, 80, 45, 90, 52, 88, 48])
group_low_variance = np.array([70, 72, 68, 71, 69, 73, 70, 71])
print(f"Group 1 variance: {group_high_variance.var(ddof=1):.2f}")
print(f"Group 2 variance: {group_low_variance.var(ddof=1):.2f}")
# Student's t-test (assumes equal variances)
student_result = stats.ttest_ind(
group_high_variance,
group_low_variance,
equal_var=True
)
# Welch's t-test (doesn't assume equal variances)
welch_result = stats.ttest_ind(
group_high_variance,
group_low_variance,
equal_var=False
)
print(f"\nStudent's t-test: t={student_result.statistic:.4f}, p={student_result.pvalue:.4f}")
print(f"Welch's t-test: t={welch_result.statistic:.4f}, p={welch_result.pvalue:.4f}")
# Output:
# Group 1 variance: 340.21
# Group 2 variance: 2.79
#
# Student's t-test: t=-0.9355, p=0.3658
# Welch's t-test: t=-0.9355, p=0.3783
The t-statistics are identical, but the p-values differ because Welch’s test uses adjusted degrees of freedom. In this case, the difference is small, but with more extreme variance differences or unequal sample sizes, the gap widens.
My recommendation: Use equal_var=False by default. Welch’s test performs nearly identically to Student’s test when variances are equal, but protects you when they’re not. There’s no good reason to assume equal variances unless you have strong prior evidence.
One-Tailed vs. Two-Tailed Tests
The alternative parameter controls your hypothesis direction. A two-tailed test asks “are these different?” while one-tailed tests ask “is A greater than B?” or “is A less than B?”
import numpy as np
from scipy import stats
# A/B test: comparing conversion rates (as percentages per session)
control_group = np.array([2.1, 1.8, 2.3, 1.9, 2.0, 2.2, 1.7, 2.4, 2.1, 1.8])
treatment_group = np.array([2.5, 2.8, 2.3, 2.9, 2.6, 2.7, 2.4, 3.0, 2.5, 2.8])
# Two-tailed: Is there any difference?
two_tailed = stats.ttest_ind(treatment_group, control_group, alternative='two-sided')
# One-tailed: Is treatment specifically GREATER than control?
one_tailed = stats.ttest_ind(treatment_group, control_group, alternative='greater')
print(f"Two-tailed test: t={two_tailed.statistic:.4f}, p={two_tailed.pvalue:.4f}")
print(f"One-tailed test: t={one_tailed.statistic:.4f}, p={one_tailed.pvalue:.4f}")
# Output:
# Two-tailed test: t=4.6528, p=0.0002
# One-tailed test: t=4.6528, p=0.0001
The one-tailed p-value is exactly half the two-tailed p-value when the effect is in the predicted direction. This gives you more statistical power—but only use it when you have a directional hypothesis specified before looking at the data. Using a one-tailed test after seeing that one group is higher is p-hacking.
Assumptions and Validation
The t-test makes several assumptions. Understanding when they matter helps you interpret results correctly.
Independence: Observations must be independent within and between groups. This is a study design issue—no statistical test can verify it.
Normality: The sampling distribution of the mean should be approximately normal. With large samples (n > 30 per group), the Central Limit Theorem handles this for you. With small samples, check normality:
import numpy as np
from scipy import stats
# Small sample data
sample_a = np.array([23, 25, 28, 24, 26, 27, 25, 29])
sample_b = np.array([19, 21, 18, 22, 20, 23, 21, 20])
# Shapiro-Wilk test for normality (null hypothesis: data is normal)
shapiro_a = stats.shapiro(sample_a)
shapiro_b = stats.shapiro(sample_b)
print(f"Sample A normality test: W={shapiro_a.statistic:.4f}, p={shapiro_a.pvalue:.4f}")
print(f"Sample B normality test: W={shapiro_b.statistic:.4f}, p={shapiro_b.pvalue:.4f}")
# If p > 0.05, we don't reject normality (data is consistent with normal distribution)
if shapiro_a.pvalue > 0.05 and shapiro_b.pvalue > 0.05:
print("\nBoth samples are approximately normal. T-test is appropriate.")
result = stats.ttest_ind(sample_a, sample_b, equal_var=False)
print(f"T-test result: t={result.statistic:.4f}, p={result.pvalue:.4f}")
else:
print("\nNormality assumption violated. Consider Mann-Whitney U test.")
# Output:
# Sample A normality test: W=0.9593, p=0.8035
# Sample B normality test: W=0.9408, p=0.6169
#
# Both samples are approximately normal. T-test is appropriate.
# T-test result: t=5.0990, p=0.0002
Practical Tips and Common Pitfalls
Handling NaN Values
Real data has missing values. Handle them explicitly:
import numpy as np
from scipy import stats
# Data with missing values
data_a = np.array([10, 12, np.nan, 14, 11, 13, np.nan, 15])
data_b = np.array([8, 9, 10, np.nan, 11, 9, 10, 8])
# Default behavior: propagates NaN (result is NaN)
result_propagate = stats.ttest_ind(data_a, data_b, nan_policy='propagate')
print(f"With propagate: p={result_propagate.pvalue}")
# Better: omit NaN values
result_omit = stats.ttest_ind(data_a, data_b, nan_policy='omit')
print(f"With omit: p={result_omit.pvalue:.4f}")
# Output:
# With propagate: p=nan
# With omit: p=0.0043
Always Calculate Effect Size
P-values tell you whether an effect exists, not whether it matters. Cohen’s d quantifies the magnitude:
import numpy as np
from scipy import stats
def cohens_d(group1, group2):
"""Calculate Cohen's d for independent samples."""
n1, n2 = len(group1), len(group2)
var1, var2 = group1.var(ddof=1), group2.var(ddof=1)
# Pooled standard deviation
pooled_std = np.sqrt(((n1 - 1) * var1 + (n2 - 1) * var2) / (n1 + n2 - 2))
return (group1.mean() - group2.mean()) / pooled_std
# Complete analysis workflow
treatment = np.array([105, 110, 108, 112, 107, 115, 109, 111, 106, 113])
control = np.array([100, 102, 98, 104, 101, 99, 103, 97, 102, 100])
# Run t-test
result = stats.ttest_ind(treatment, control, equal_var=False)
# Calculate effect size
d = cohens_d(treatment, control)
print(f"Treatment mean: {treatment.mean():.2f}")
print(f"Control mean: {control.mean():.2f}")
print(f"T-statistic: {result.statistic:.4f}")
print(f"P-value: {result.pvalue:.6f}")
print(f"Cohen's d: {d:.4f}")
print(f"\nInterpretation: {'Small' if abs(d) < 0.5 else 'Medium' if abs(d) < 0.8 else 'Large'} effect size")
# Output:
# Treatment mean: 109.60
# Control mean: 100.60
# T-statistic: 7.5000
# P-value: 0.000001
# Cohen's d: 3.3541
#
# Interpretation: Large effect size
When to Use Alternatives
If your normality check fails and you have small samples, consider the Mann-Whitney U test:
from scipy import stats
# Non-normal data
skewed_a = np.array([1, 2, 2, 3, 3, 3, 15, 20]) # Right-skewed
skewed_b = np.array([1, 1, 2, 2, 2, 3, 3, 4])
# Mann-Whitney U (non-parametric alternative)
mann_whitney = stats.mannwhitneyu(skewed_a, skewed_b, alternative='two-sided')
print(f"Mann-Whitney U: statistic={mann_whitney.statistic:.4f}, p={mann_whitney.pvalue:.4f}")
The t-test is robust to moderate non-normality with larger samples, but when in doubt with small, skewed samples, non-parametric tests are safer.