How to Use scipy.stats for Hypothesis Testing in Python

Key Insights

scipy.stats provides a comprehensive suite of statistical tests that cover most hypothesis testing scenarios, from t-tests to chi-square tests to ANOVA, all with consistent interfaces and return types.
Choosing the right test depends on your data characteristics: parametric tests (t-tests, ANOVA) assume normally distributed data, while non-parametric alternatives (Mann-Whitney, Kruskal-Wallis) handle violations of these assumptions.
A statistically significant p-value doesn’t mean a practically significant result—always pair hypothesis tests with effect size calculations and domain knowledge before drawing conclusions.

Introduction to Hypothesis Testing with scipy.stats

Hypothesis testing is the backbone of statistical inference. You have data, you have a question, and you need a rigorous way to answer it. The scipy.stats module is Python’s most mature and battle-tested library for this purpose.

The core workflow is straightforward: you formulate a null hypothesis (typically “no effect” or “no difference”), collect data, run a statistical test, and examine whether the evidence is strong enough to reject that null hypothesis. The p-value tells you the probability of observing your data (or something more extreme) if the null hypothesis were true. When this probability drops below your significance level (commonly 0.05), you reject the null.

scipy.stats handles the mathematical heavy lifting. Your job is to choose the right test, prepare your data correctly, and interpret results with appropriate skepticism.

Setting Up Your Environment

First, get your environment ready. If you don’t have scipy installed, fix that now:

pip install scipy numpy

Here’s the standard import pattern and some sample data generation we’ll use throughout this article:

import numpy as np
from scipy import stats

# Set seed for reproducibility
np.random.seed(42)

# Sample data: two groups with slightly different means
group_a = np.random.normal(loc=100, scale=15, size=50)
group_b = np.random.normal(loc=108, scale=15, size=50)

# Paired data: before and after treatment
before_treatment = np.random.normal(loc=75, scale=10, size=30)
after_treatment = before_treatment + np.random.normal(loc=5, scale=3, size=30)

# Non-normal data for non-parametric tests
skewed_data = np.random.exponential(scale=2, size=100)

I’m using numpy’s random generators to create synthetic datasets with known properties. In real work, you’d load your actual data from CSV files, databases, or APIs.

One-Sample and Two-Sample T-Tests

The t-test is your workhorse for comparing means. Use the one-sample version when comparing a sample to a known population value, and the two-sample version when comparing two independent groups.

One-Sample T-Test

Suppose you’re testing whether a manufacturing process produces widgets with a target weight of 100 grams:

# Sample of widget weights
widget_weights = np.random.normal(loc=101.5, scale=3, size=40)

# Test against target of 100 grams
result = stats.ttest_1samp(widget_weights, popmean=100)

print(f"T-statistic: {result.statistic:.4f}")
print(f"P-value: {result.pvalue:.4f}")

# For a two-tailed test at alpha=0.05
alpha = 0.05
if result.pvalue < alpha:
    print("Reject null hypothesis: mean differs from 100")
else:
    print("Fail to reject null hypothesis")

The function returns a named tuple with the test statistic and p-value. This is scipy.stats’ consistent pattern across most tests.

Two-Sample Independent T-Test

When comparing two independent groups, use ttest_ind():

# Compare group_a and group_b
result = stats.ttest_ind(group_a, group_b)

print(f"T-statistic: {result.statistic:.4f}")
print(f"P-value: {result.pvalue:.4f}")

# If variances are unequal, use Welch's t-test
result_welch = stats.ttest_ind(group_a, group_b, equal_var=False)
print(f"Welch's T-statistic: {result_welch.statistic:.4f}")
print(f"Welch's P-value: {result_welch.pvalue:.4f}")

Always consider whether equal variance assumption holds. When in doubt, use equal_var=False for Welch’s t-test—it’s more robust and there’s little downside when variances actually are equal.

Paired T-Tests and Non-Parametric Alternatives

Paired T-Test

When observations are naturally paired (same subjects measured twice, matched pairs), use ttest_rel():

# Before/after treatment comparison
result = stats.ttest_rel(before_treatment, after_treatment)

print(f"T-statistic: {result.statistic:.4f}")
print(f"P-value: {result.pvalue:.4f}")

# Calculate mean difference
mean_diff = np.mean(after_treatment - before_treatment)
print(f"Mean improvement: {mean_diff:.2f}")

The paired test accounts for individual variation by analyzing differences rather than raw values. This typically gives you more statistical power when the pairing is meaningful.

Non-Parametric Alternatives

When your data violates normality assumptions (check with stats.shapiro() or visual inspection), switch to non-parametric tests:

# Test for normality first
stat, p = stats.shapiro(skewed_data)
print(f"Shapiro-Wilk test p-value: {p:.4f}")

# Mann-Whitney U test (non-parametric alternative to independent t-test)
group_x = np.random.exponential(scale=2, size=50)
group_y = np.random.exponential(scale=2.5, size=50)

result = stats.mannwhitneyu(group_x, group_y, alternative='two-sided')
print(f"Mann-Whitney U statistic: {result.statistic:.4f}")
print(f"P-value: {result.pvalue:.4f}")

# Wilcoxon signed-rank test (non-parametric alternative to paired t-test)
diff = after_treatment - before_treatment
result = stats.wilcoxon(diff)
print(f"Wilcoxon statistic: {result.statistic:.4f}")
print(f"P-value: {result.pvalue:.4f}")

Non-parametric tests make fewer assumptions but typically have less statistical power. Use them when parametric assumptions clearly fail, not as a default choice.

Chi-Square Tests for Categorical Data

When working with categorical variables, chi-square tests are your primary tool.

Test of Independence

Use chi2_contingency() to test whether two categorical variables are independent:

# Contingency table: treatment vs outcome
# Rows: Treatment A, Treatment B
# Columns: Success, Failure
observed = np.array([
    [45, 15],   # Treatment A: 45 successes, 15 failures
    [30, 30]    # Treatment B: 30 successes, 30 failures
])

chi2, p, dof, expected = stats.chi2_contingency(observed)

print(f"Chi-square statistic: {chi2:.4f}")
print(f"P-value: {p:.4f}")
print(f"Degrees of freedom: {dof}")
print(f"Expected frequencies:\n{expected}")

The function returns four values: the chi-square statistic, p-value, degrees of freedom, and the expected frequencies under the null hypothesis of independence.

Goodness-of-Fit Test

Test whether observed frequencies match expected proportions with chisquare():

# Observed dice rolls
observed_rolls = np.array([18, 22, 16, 20, 14, 20])  # 110 total rolls

# Expected: equal probability for fair die
expected_rolls = np.array([110/6] * 6)

result = stats.chisquare(observed_rolls, f_exp=expected_rolls)
print(f"Chi-square statistic: {result.statistic:.4f}")
print(f"P-value: {result.pvalue:.4f}")

ANOVA and Beyond

When comparing means across three or more groups, ANOVA is the appropriate test.

One-Way ANOVA

# Three treatment groups
treatment_1 = np.random.normal(loc=50, scale=8, size=25)
treatment_2 = np.random.normal(loc=55, scale=8, size=25)
treatment_3 = np.random.normal(loc=52, scale=8, size=25)

result = stats.f_oneway(treatment_1, treatment_2, treatment_3)

print(f"F-statistic: {result.statistic:.4f}")
print(f"P-value: {result.pvalue:.4f}")

ANOVA tells you whether at least one group differs from the others, but not which one. For post-hoc pairwise comparisons, you’ll need additional tests (consider scipy.stats.tukey_hsd() in newer scipy versions or the statsmodels library).

Kruskal-Wallis Test

The non-parametric alternative to one-way ANOVA:

# Non-normal data across groups
group_1 = np.random.exponential(scale=2, size=30)
group_2 = np.random.exponential(scale=2.5, size=30)
group_3 = np.random.exponential(scale=3, size=30)

result = stats.kruskal(group_1, group_2, group_3)
print(f"H-statistic: {result.statistic:.4f}")
print(f"P-value: {result.pvalue:.4f}")

Interpreting Results and Best Practices

Raw scipy.stats output is just the beginning. Here’s how to extract and present results professionally:

def format_test_result(test_name, result, alpha=0.05):
    """Format hypothesis test results for reporting."""
    significant = result.pvalue < alpha
    
    output = {
        'test': test_name,
        'statistic': round(result.statistic, 4),
        'p_value': round(result.pvalue, 4),
        'alpha': alpha,
        'significant': significant,
        'conclusion': 'Reject H0' if significant else 'Fail to reject H0'
    }
    return output

# Example usage
result = stats.ttest_ind(group_a, group_b)
formatted = format_test_result('Independent t-test', result)

for key, value in formatted.items():
    print(f"{key}: {value}")

# Calculate effect size (Cohen's d) for practical significance
def cohens_d(group1, group2):
    n1, n2 = len(group1), len(group2)
    var1, var2 = np.var(group1, ddof=1), np.var(group2, ddof=1)
    pooled_std = np.sqrt(((n1-1)*var1 + (n2-1)*var2) / (n1+n2-2))
    return (np.mean(group1) - np.mean(group2)) / pooled_std

effect = cohens_d(group_a, group_b)
print(f"Cohen's d: {effect:.4f}")

Critical Best Practices

Don’t worship p-values. A p-value of 0.049 isn’t meaningfully different from 0.051. Report exact p-values and let readers judge significance in context.

Account for multiple testing. Running 20 tests at α=0.05 virtually guarantees a false positive. Use Bonferroni correction (alpha / number_of_tests) or better yet, the Benjamini-Hochberg procedure for controlling false discovery rate.

Check assumptions before testing. Use stats.shapiro() for normality, stats.levene() for homogeneity of variance. Violations don’t always invalidate tests, but severe violations require non-parametric alternatives.

Effect size matters more than significance. With large samples, trivial differences become “statistically significant.” Always calculate and report effect sizes alongside p-values.

Know when to stop. scipy.stats covers common scenarios well. For complex designs (repeated measures ANOVA, mixed models, survival analysis), graduate to statsmodels or specialized R packages. When stakes are high, consult a statistician.