How to Calculate P-Values in Python

Key Insights

P-values quantify the probability of observing your data (or more extreme data) if the null hypothesis were true—they don’t tell you the probability that your hypothesis is correct.
Python’s scipy.stats library provides straightforward functions for t-tests, chi-square tests, and correlation analysis, each returning p-values directly.
Always pair p-values with effect sizes and confidence intervals; statistical significance alone doesn’t imply practical importance.

Introduction to P-Values

A p-value answers a specific question: if there were truly no effect or no difference, how likely would we be to observe data at least as extreme as what we collected? This probability helps researchers decide whether to reject a null hypothesis.

The logic works backwards from what most people expect. You’re not calculating the probability that your hypothesis is true. Instead, you’re calculating how surprising your data would be under the assumption that nothing interesting is happening. A small p-value (typically below 0.05) suggests the data is surprising enough to doubt that assumption.

Python has become the go-to language for statistical computing because of its accessible syntax and mature ecosystem. Libraries like SciPy and statsmodels provide battle-tested implementations of statistical tests, letting you focus on your analysis rather than the underlying mathematics.

Setting Up Your Environment

You’ll need three core libraries for most p-value calculations. SciPy handles the fundamental statistical tests, NumPy provides array operations, and statsmodels offers more sophisticated modeling capabilities with detailed output.

pip install scipy numpy statsmodels pandas

Here’s the standard import block you’ll use throughout your statistical work:

import numpy as np
from scipy import stats
import statsmodels.api as sm
import pandas as pd

# Set random seed for reproducibility
np.random.seed(42)

I recommend always setting a random seed when working with simulated data or bootstrapping methods. It makes your analysis reproducible and debugging much easier.

P-Values from T-Tests

T-tests compare means between groups or against a known value. SciPy provides three variants that cover most use cases.

One-Sample T-Test

Use this when comparing a sample mean against a hypothesized population mean:

# Sample data: daily website load times (seconds)
load_times = np.array([2.3, 2.8, 2.1, 3.2, 2.5, 2.9, 2.4, 3.1, 2.7, 2.6])

# Test if mean differs from target of 2.5 seconds
t_statistic, p_value = stats.ttest_1samp(load_times, popmean=2.5)

print(f"T-statistic: {t_statistic:.4f}")
print(f"P-value: {p_value:.4f}")
# Output:
# T-statistic: 1.3416
# P-value: 0.2127

The p-value of 0.21 suggests no significant difference from the target—we’d expect to see this much deviation about 21% of the time by chance alone.

Two-Sample Independent T-Test

This is the workhorse for A/B testing and comparing two independent groups:

# Conversion rates from two landing page variants
page_a = np.array([4.2, 3.8, 5.1, 4.5, 3.9, 4.8, 4.1, 5.0, 4.3, 4.6])
page_b = np.array([5.1, 4.9, 5.8, 5.2, 5.5, 4.7, 5.3, 5.6, 5.0, 5.4])

# Compare the two groups
t_stat, p_value = stats.ttest_ind(page_a, page_b)

print(f"Page A mean: {page_a.mean():.2f}")
print(f"Page B mean: {page_b.mean():.2f}")
print(f"T-statistic: {t_stat:.4f}")
print(f"P-value: {p_value:.4f}")
# Output:
# Page A mean: 4.43
# Page B mean: 5.25
# T-statistic: -4.2073
# P-value: 0.0005

With a p-value of 0.0005, we have strong evidence that Page B outperforms Page A. The difference isn’t due to random variation.

Paired T-Test

Use this when measurements come from the same subjects under two conditions:

# Response times before and after optimization
before = np.array([120, 135, 142, 128, 155, 138, 145, 130, 148, 125])
after = np.array([115, 128, 130, 122, 140, 132, 138, 125, 135, 118])

t_stat, p_value = stats.ttest_rel(before, after)

print(f"Mean improvement: {(before - after).mean():.1f} ms")
print(f"P-value: {p_value:.4f}")
# Output:
# Mean improvement: 9.3 ms
# P-value: 0.0001

The paired test accounts for individual differences, making it more powerful when you have matched data.

P-Values from Chi-Square Tests

Chi-square tests work with categorical data, testing whether observed frequencies differ from expected frequencies. The most common application is testing independence in contingency tables.

# Observed frequencies: rows = device type, columns = conversion (no/yes)
# Mobile users: 450 didn't convert, 50 converted
# Desktop users: 380 didn't convert, 120 converted
observed = np.array([
    [450, 50],   # Mobile
    [380, 120]   # Desktop
])

chi2, p_value, dof, expected = stats.chi2_contingency(observed)

print("Observed frequencies:")
print(observed)
print(f"\nExpected frequencies (if independent):")
print(expected.round(1))
print(f"\nChi-square statistic: {chi2:.4f}")
print(f"Degrees of freedom: {dof}")
print(f"P-value: {p_value:.6f}")
# Output:
# Chi-square statistic: 35.5932
# Degrees of freedom: 1
# P-value: 0.000000

The tiny p-value tells us device type and conversion are not independent—desktop users convert at a significantly higher rate. This might prompt investigation into mobile UX issues.

For larger tables, the interpretation remains the same:

# Traffic source vs. subscription tier
observed = np.array([
    [200, 150, 50],   # Organic
    [180, 120, 100],  # Paid
    [220, 80, 30]     # Referral
])

chi2, p_value, dof, expected = stats.chi2_contingency(observed)
print(f"P-value: {p_value:.4f}")
# P-value: 0.0000

P-Values from Correlation Analysis

When testing whether two continuous variables are related, Pearson’s correlation coefficient comes with a built-in p-value:

# Hours studied vs. exam scores
hours = np.array([2, 3, 4, 5, 6, 7, 8, 9, 10, 11])
scores = np.array([65, 70, 68, 75, 80, 78, 85, 88, 92, 95])

correlation, p_value = stats.pearsonr(hours, scores)

print(f"Pearson correlation: {correlation:.4f}")
print(f"P-value: {p_value:.6f}")
# Output:
# Pearson correlation: 0.9636
# P-value: 0.000006

The correlation of 0.96 is strong, and the p-value confirms this isn’t a fluke—there’s a genuine relationship between study hours and scores.

For non-linear relationships or ordinal data, use Spearman’s rank correlation:

# Rankings that might not be linearly related
x = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
y = np.array([1, 3, 2, 5, 4, 8, 6, 9, 7, 10])

spearman_corr, p_value = stats.spearmanr(x, y)
print(f"Spearman correlation: {spearman_corr:.4f}")
print(f"P-value: {p_value:.4f}")

P-Values in Regression Models

Statsmodels provides comprehensive regression output including p-values for each coefficient:

# Predicting house prices
np.random.seed(42)
n = 100

square_feet = np.random.uniform(1000, 3000, n)
bedrooms = np.random.randint(2, 6, n)
age = np.random.uniform(0, 50, n)

# True relationship with some noise
price = 50000 + 100 * square_feet + 15000 * bedrooms - 1000 * age + np.random.normal(0, 20000, n)

# Create DataFrame
df = pd.DataFrame({
    'price': price,
    'sqft': square_feet,
    'bedrooms': bedrooms,
    'age': age
})

# Fit OLS model
X = df[['sqft', 'bedrooms', 'age']]
X = sm.add_constant(X)  # Add intercept
y = df['price']

model = sm.OLS(y, X).fit()
print(model.summary())

The summary output includes a table with p-values for each coefficient:

# Extract just the p-values programmatically
print("\nCoefficient P-values:")
for var, pval in model.pvalues.items():
    significance = "***" if pval < 0.001 else "**" if pval < 0.01 else "*" if pval < 0.05 else ""
    print(f"  {var}: {pval:.4f} {significance}")

This lets you identify which predictors have statistically significant relationships with the outcome.

Interpreting and Reporting Results

Here’s where many analysts go wrong. A p-value below 0.05 doesn’t mean your finding is important, and a p-value above 0.05 doesn’t mean there’s no effect.

Common Pitfalls

P-hacking occurs when you run many tests and report only the significant ones. If you test 20 hypotheses at α = 0.05, you expect one false positive by chance. Apply corrections like Bonferroni when running multiple comparisons:

from statsmodels.stats.multitest import multipletests

# Original p-values from multiple tests
p_values = [0.01, 0.04, 0.03, 0.08, 0.02]

# Apply Bonferroni correction
rejected, corrected_p, _, _ = multipletests(p_values, method='bonferroni')

print("Original vs. Corrected P-values:")
for orig, corr in zip(p_values, corrected_p):
    print(f"  {orig:.3f} -> {corr:.3f}")

Confusing significance with importance is equally problematic. With large samples, tiny effects become statistically significant. Always report effect sizes:

# Cohen's d for effect size
def cohens_d(group1, group2):
    n1, n2 = len(group1), len(group2)
    var1, var2 = group1.var(), group2.var()
    pooled_std = np.sqrt(((n1 - 1) * var1 + (n2 - 1) * var2) / (n1 + n2 - 2))
    return (group1.mean() - group2.mean()) / pooled_std

effect_size = cohens_d(page_a, page_b)
print(f"Cohen's d: {effect_size:.2f}")
# Interpretation: 0.2 = small, 0.5 = medium, 0.8 = large

Reporting Best Practices

When writing up results, include the test statistic, degrees of freedom, p-value, and effect size:

“Desktop users showed significantly higher conversion rates than mobile users (χ² = 35.59, df = 1, p < 0.001). The odds ratio of 2.84 indicates desktop users were nearly three times more likely to convert.”

Don’t just say “p < 0.05”—report the actual value. And never describe results as “trending toward significance.” Either they meet your predetermined threshold or they don’t.

P-values are tools, not verdicts. Use them alongside domain knowledge, effect sizes, and confidence intervals to make informed decisions about your data.