How to Calculate Effect Size (Cohen's d) in Python
Statistical significance has a credibility problem. With a large enough sample, you can achieve a p-value below 0.05 for differences so small they're meaningless in practice. This is where effect...
Key Insights
- P-values tell you whether an effect exists, but Cohen’s d tells you whether that effect actually matters—a statistically significant result can be practically meaningless without effect size context.
- Cohen’s d expresses the difference between two groups in standard deviation units, making it comparable across different studies and measurement scales.
- Always pair your hypothesis tests with effect size calculations and confidence intervals; libraries like
pingouinmake this trivially easy in Python.
Introduction to Effect Size
Statistical significance has a credibility problem. With a large enough sample, you can achieve a p-value below 0.05 for differences so small they’re meaningless in practice. This is where effect size saves you from publishing embarrassing conclusions.
Cohen’s d is the most widely used effect size measure for comparing two means. It answers a simple question: how many standard deviations apart are these two groups? A significant p-value tells you the difference probably isn’t due to chance; Cohen’s d tells you whether you should care.
Here’s the problem in action:
import numpy as np
from scipy import stats
np.random.seed(42)
# Tiny effect, huge sample
group_a = np.random.normal(100, 15, 10000)
group_b = np.random.normal(100.5, 15, 10000) # Only 0.5 point difference
t_stat, p_value = stats.ttest_ind(group_a, group_b)
print(f"Tiny effect: p-value = {p_value:.6f}") # Significant!
# Large effect, small sample
group_c = np.random.normal(100, 15, 30)
group_d = np.random.normal(115, 15, 30) # 15 point difference (1 SD)
t_stat, p_value = stats.ttest_ind(group_c, group_d)
print(f"Large effect: p-value = {p_value:.6f}") # Also significant
Both results are statistically significant, but the first represents a trivial 0.5-point difference while the second shows a full standard deviation gap. Without effect size, you can’t distinguish between them.
The Math Behind Cohen’s d
Cohen’s d for two independent samples is calculated as:
$$d = \frac{M_1 - M_2}{s_{pooled}}$$
Where the pooled standard deviation is:
$$s_{pooled} = \sqrt{\frac{(n_1 - 1)s_1^2 + (n_2 - 1)s_2^2}{n_1 + n_2 - 2}}$$
This formula weights each group’s variance by its degrees of freedom, giving larger samples more influence on the pooled estimate.
Jacob Cohen himself provided rough benchmarks for interpreting effect sizes:
- Small: d = 0.2 (distributions overlap about 85%)
- Medium: d = 0.5 (distributions overlap about 67%)
- Large: d = 0.8 (distributions overlap about 53%)
These are guidelines, not laws. A “small” effect size might be huge in your domain—a 0.2 SD improvement in cancer survival rates matters enormously.
Let’s calculate this manually:
import numpy as np
# Sample data
group_1 = np.array([23, 25, 28, 29, 31, 33, 35, 37, 40, 42])
group_2 = np.array([18, 20, 22, 24, 26, 27, 29, 30, 32, 34])
# Step 1: Calculate means
mean_1 = np.mean(group_1)
mean_2 = np.mean(group_2)
print(f"Mean 1: {mean_1:.2f}, Mean 2: {mean_2:.2f}")
# Step 2: Calculate standard deviations (using n-1, i.e., ddof=1)
std_1 = np.std(group_1, ddof=1)
std_2 = np.std(group_2, ddof=1)
print(f"Std 1: {std_1:.2f}, Std 2: {std_2:.2f}")
# Step 3: Calculate pooled standard deviation
n1, n2 = len(group_1), len(group_2)
pooled_std = np.sqrt(((n1 - 1) * std_1**2 + (n2 - 1) * std_2**2) / (n1 + n2 - 2))
print(f"Pooled Std: {pooled_std:.2f}")
# Step 4: Calculate Cohen's d
cohens_d = (mean_1 - mean_2) / pooled_std
print(f"Cohen's d: {cohens_d:.3f}")
Implementing Cohen’s d from Scratch
Here are production-ready functions for calculating Cohen’s d:
import numpy as np
from typing import Literal
def cohens_d_independent(group1: np.ndarray, group2: np.ndarray) -> float:
"""
Calculate Cohen's d for two independent samples.
Uses pooled standard deviation, appropriate when population
variances are assumed equal.
"""
n1, n2 = len(group1), len(group2)
var1, var2 = np.var(group1, ddof=1), np.var(group2, ddof=1)
pooled_std = np.sqrt(((n1 - 1) * var1 + (n2 - 1) * var2) / (n1 + n2 - 2))
return (np.mean(group1) - np.mean(group2)) / pooled_std
def cohens_d_paired(before: np.ndarray, after: np.ndarray) -> float:
"""
Calculate Cohen's d for paired samples (repeated measures).
Uses the standard deviation of the difference scores.
"""
if len(before) != len(after):
raise ValueError("Paired samples must have equal length")
diff = after - before
return np.mean(diff) / np.std(diff, ddof=1)
def cohens_d(
x: np.ndarray,
y: np.ndarray,
paired: bool = False,
correction: Literal["none", "hedges"] = "none"
) -> float:
"""
Unified Cohen's d calculation with optional Hedges' g correction.
Parameters
----------
x, y : array-like
The two samples to compare
paired : bool
Whether samples are paired/repeated measures
correction : str
'none' for Cohen's d, 'hedges' for Hedges' g (small sample correction)
"""
x, y = np.asarray(x), np.asarray(y)
if paired:
d = cohens_d_paired(x, y)
n = len(x)
else:
d = cohens_d_independent(x, y)
n = len(x) + len(y)
if correction == "hedges":
# Hedges' g correction factor
d *= (1 - (3 / (4 * (n - 2) - 1)))
return d
# Test the functions
np.random.seed(42)
control = np.random.normal(50, 10, 25)
treatment = np.random.normal(58, 10, 25)
print(f"Cohen's d (independent): {cohens_d(control, treatment):.3f}")
print(f"Hedges' g (independent): {cohens_d(control, treatment, correction='hedges'):.3f}")
# Paired example
before = np.random.normal(100, 15, 20)
after = before + np.random.normal(8, 5, 20) # Improvement with noise
print(f"Cohen's d (paired): {cohens_d(before, after, paired=True):.3f}")
Using Statistical Libraries
Don’t reinvent the wheel in production code. Here are three library approaches:
import numpy as np
from scipy import stats
# Install if needed: pip install pingouin statsmodels
import pingouin as pg
from statsmodels.stats.power import TTestIndPower
np.random.seed(42)
group_a = np.random.normal(100, 15, 50)
group_b = np.random.normal(110, 15, 50)
# Method 1: Manual calculation (our function)
d_manual = cohens_d(group_a, group_b)
# Method 2: Pingouin (recommended for most use cases)
d_pingouin = pg.compute_effsize(group_a, group_b, eftype='cohen')
# Method 3: Pingouin with confidence interval
effsize_ci = pg.compute_effsize_from_t(
stats.ttest_ind(group_a, group_b).statistic,
nx=len(group_a),
ny=len(group_b),
eftype='cohen'
)
# Method 4: Using statsmodels (reverse-engineering from power analysis)
# Less direct, but useful if you're already doing power calculations
power_analysis = TTestIndPower()
print(f"Manual calculation: d = {d_manual:.4f}")
print(f"Pingouin: d = {d_pingouin:.4f}")
print(f"Pingouin from t: d = {effsize_ci:.4f}")
# Pingouin also gives you everything in one call
full_results = pg.ttest(group_a, group_b, correction=False)
print(f"\nFull t-test with effect size:")
print(full_results[['T', 'p-val', 'cohen-d', 'CI95%']].to_string())
Pingouin is my recommendation for most workflows. It calculates effect sizes automatically with t-tests and provides confidence intervals without extra work.
Interpreting and Visualizing Results
Numbers need context. A Cohen’s d of 0.65 means little without visualization:
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
def plot_effect_size(group1, group2, ax=None):
"""Visualize two distributions with Cohen's d annotation."""
if ax is None:
fig, ax = plt.subplots(figsize=(10, 6))
# Calculate Cohen's d
d = cohens_d(group1, group2)
# Create smooth density curves
x_range = np.linspace(
min(group1.min(), group2.min()) - 10,
max(group1.max(), group2.max()) + 10,
200
)
kde1 = stats.gaussian_kde(group1)
kde2 = stats.gaussian_kde(group2)
# Plot distributions
ax.fill_between(x_range, kde1(x_range), alpha=0.5, label=f'Group 1 (M={np.mean(group1):.1f})')
ax.fill_between(x_range, kde2(x_range), alpha=0.5, label=f'Group 2 (M={np.mean(group2):.1f})')
# Add vertical lines at means
ax.axvline(np.mean(group1), color='C0', linestyle='--', alpha=0.7)
ax.axvline(np.mean(group2), color='C1', linestyle='--', alpha=0.7)
# Annotate effect size
interpretation = "small" if abs(d) < 0.5 else "medium" if abs(d) < 0.8 else "large"
ax.set_title(f"Cohen's d = {d:.2f} ({interpretation} effect)", fontsize=14)
ax.set_xlabel('Value')
ax.set_ylabel('Density')
ax.legend()
return ax
# Create comparison plots
np.random.seed(42)
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
# Small effect
g1_small = np.random.normal(100, 15, 200)
g2_small = np.random.normal(103, 15, 200)
plot_effect_size(g1_small, g2_small, axes[0])
# Medium effect
g1_med = np.random.normal(100, 15, 200)
g2_med = np.random.normal(107.5, 15, 200)
plot_effect_size(g1_med, g2_med, axes[1])
# Large effect
g1_large = np.random.normal(100, 15, 200)
g2_large = np.random.normal(112, 15, 200)
plot_effect_size(g1_large, g2_large, axes[2])
plt.tight_layout()
plt.savefig('effect_size_comparison.png', dpi=150, bbox_inches='tight')
plt.show()
Common Pitfalls and Best Practices
Use Hedges’ g for small samples. Cohen’s d overestimates effect size when n < 20 per group. Hedges’ g applies a correction factor that becomes negligible with larger samples.
Report confidence intervals. A point estimate of d = 0.5 means nothing without uncertainty bounds. Here’s how to bootstrap them:
import numpy as np
from typing import Tuple
def cohens_d_ci(
group1: np.ndarray,
group2: np.ndarray,
confidence: float = 0.95,
n_bootstrap: int = 10000,
random_state: int = None
) -> Tuple[float, float, float]:
"""
Calculate Cohen's d with bootstrapped confidence interval.
Returns
-------
tuple : (point_estimate, ci_lower, ci_upper)
"""
rng = np.random.default_rng(random_state)
point_estimate = cohens_d(group1, group2)
bootstrap_d = np.zeros(n_bootstrap)
n1, n2 = len(group1), len(group2)
for i in range(n_bootstrap):
# Resample with replacement
sample1 = rng.choice(group1, size=n1, replace=True)
sample2 = rng.choice(group2, size=n2, replace=True)
bootstrap_d[i] = cohens_d(sample1, sample2)
# Calculate percentile confidence interval
alpha = 1 - confidence
ci_lower = np.percentile(bootstrap_d, 100 * alpha / 2)
ci_upper = np.percentile(bootstrap_d, 100 * (1 - alpha / 2))
return point_estimate, ci_lower, ci_upper
# Example usage
np.random.seed(42)
control = np.random.normal(50, 10, 30)
treatment = np.random.normal(56, 10, 30)
d, ci_low, ci_high = cohens_d_ci(control, treatment, random_state=42)
print(f"Cohen's d = {d:.3f}, 95% CI [{ci_low:.3f}, {ci_high:.3f}]")
# Compare with Hedges' g for this small sample
g = cohens_d(control, treatment, correction='hedges')
print(f"Hedges' g = {g:.3f} (small-sample corrected)")
Don’t cherry-pick effect size measures. Decide before analysis whether you’ll use Cohen’s d, Hedges’ g, or Glass’s delta. Switching after seeing results is p-hacking’s cousin.
Consider domain context. Cohen’s benchmarks came from behavioral sciences. In medicine, a “small” effect might save thousands of lives. In education, a “large” effect might be impractical to achieve at scale.
Conclusion
Effect size calculation belongs in every statistical analysis. P-values answer “is there an effect?” while Cohen’s d answers “how big is it?"—and the second question usually matters more.
For most Python workflows, use pingouin.compute_effsize() or include effect sizes automatically via pingouin.ttest(). For small samples (n < 20 per group), apply Hedges’ g correction. Always report confidence intervals alongside point estimates.
The functions in this article give you everything needed to calculate, interpret, and visualize effect sizes. Stop publishing naked p-values.