How to Calculate Effect Size (Cohen's d) in Python

Statistical significance has a credibility problem. With a large enough sample, you can achieve a p-value below 0.05 for differences so small they're meaningless in practice. This is where effect...

Key Insights

  • P-values tell you whether an effect exists, but Cohen’s d tells you whether that effect actually matters—a statistically significant result can be practically meaningless without effect size context.
  • Cohen’s d expresses the difference between two groups in standard deviation units, making it comparable across different studies and measurement scales.
  • Always pair your hypothesis tests with effect size calculations and confidence intervals; libraries like pingouin make this trivially easy in Python.

Introduction to Effect Size

Statistical significance has a credibility problem. With a large enough sample, you can achieve a p-value below 0.05 for differences so small they’re meaningless in practice. This is where effect size saves you from publishing embarrassing conclusions.

Cohen’s d is the most widely used effect size measure for comparing two means. It answers a simple question: how many standard deviations apart are these two groups? A significant p-value tells you the difference probably isn’t due to chance; Cohen’s d tells you whether you should care.

Here’s the problem in action:

import numpy as np
from scipy import stats

np.random.seed(42)

# Tiny effect, huge sample
group_a = np.random.normal(100, 15, 10000)
group_b = np.random.normal(100.5, 15, 10000)  # Only 0.5 point difference

t_stat, p_value = stats.ttest_ind(group_a, group_b)
print(f"Tiny effect: p-value = {p_value:.6f}")  # Significant!

# Large effect, small sample
group_c = np.random.normal(100, 15, 30)
group_d = np.random.normal(115, 15, 30)  # 15 point difference (1 SD)

t_stat, p_value = stats.ttest_ind(group_c, group_d)
print(f"Large effect: p-value = {p_value:.6f}")  # Also significant

Both results are statistically significant, but the first represents a trivial 0.5-point difference while the second shows a full standard deviation gap. Without effect size, you can’t distinguish between them.

The Math Behind Cohen’s d

Cohen’s d for two independent samples is calculated as:

$$d = \frac{M_1 - M_2}{s_{pooled}}$$

Where the pooled standard deviation is:

$$s_{pooled} = \sqrt{\frac{(n_1 - 1)s_1^2 + (n_2 - 1)s_2^2}{n_1 + n_2 - 2}}$$

This formula weights each group’s variance by its degrees of freedom, giving larger samples more influence on the pooled estimate.

Jacob Cohen himself provided rough benchmarks for interpreting effect sizes:

  • Small: d = 0.2 (distributions overlap about 85%)
  • Medium: d = 0.5 (distributions overlap about 67%)
  • Large: d = 0.8 (distributions overlap about 53%)

These are guidelines, not laws. A “small” effect size might be huge in your domain—a 0.2 SD improvement in cancer survival rates matters enormously.

Let’s calculate this manually:

import numpy as np

# Sample data
group_1 = np.array([23, 25, 28, 29, 31, 33, 35, 37, 40, 42])
group_2 = np.array([18, 20, 22, 24, 26, 27, 29, 30, 32, 34])

# Step 1: Calculate means
mean_1 = np.mean(group_1)
mean_2 = np.mean(group_2)
print(f"Mean 1: {mean_1:.2f}, Mean 2: {mean_2:.2f}")

# Step 2: Calculate standard deviations (using n-1, i.e., ddof=1)
std_1 = np.std(group_1, ddof=1)
std_2 = np.std(group_2, ddof=1)
print(f"Std 1: {std_1:.2f}, Std 2: {std_2:.2f}")

# Step 3: Calculate pooled standard deviation
n1, n2 = len(group_1), len(group_2)
pooled_std = np.sqrt(((n1 - 1) * std_1**2 + (n2 - 1) * std_2**2) / (n1 + n2 - 2))
print(f"Pooled Std: {pooled_std:.2f}")

# Step 4: Calculate Cohen's d
cohens_d = (mean_1 - mean_2) / pooled_std
print(f"Cohen's d: {cohens_d:.3f}")

Implementing Cohen’s d from Scratch

Here are production-ready functions for calculating Cohen’s d:

import numpy as np
from typing import Literal

def cohens_d_independent(group1: np.ndarray, group2: np.ndarray) -> float:
    """
    Calculate Cohen's d for two independent samples.
    
    Uses pooled standard deviation, appropriate when population
    variances are assumed equal.
    """
    n1, n2 = len(group1), len(group2)
    var1, var2 = np.var(group1, ddof=1), np.var(group2, ddof=1)
    
    pooled_std = np.sqrt(((n1 - 1) * var1 + (n2 - 1) * var2) / (n1 + n2 - 2))
    
    return (np.mean(group1) - np.mean(group2)) / pooled_std


def cohens_d_paired(before: np.ndarray, after: np.ndarray) -> float:
    """
    Calculate Cohen's d for paired samples (repeated measures).
    
    Uses the standard deviation of the difference scores.
    """
    if len(before) != len(after):
        raise ValueError("Paired samples must have equal length")
    
    diff = after - before
    return np.mean(diff) / np.std(diff, ddof=1)


def cohens_d(
    x: np.ndarray, 
    y: np.ndarray, 
    paired: bool = False,
    correction: Literal["none", "hedges"] = "none"
) -> float:
    """
    Unified Cohen's d calculation with optional Hedges' g correction.
    
    Parameters
    ----------
    x, y : array-like
        The two samples to compare
    paired : bool
        Whether samples are paired/repeated measures
    correction : str
        'none' for Cohen's d, 'hedges' for Hedges' g (small sample correction)
    """
    x, y = np.asarray(x), np.asarray(y)
    
    if paired:
        d = cohens_d_paired(x, y)
        n = len(x)
    else:
        d = cohens_d_independent(x, y)
        n = len(x) + len(y)
    
    if correction == "hedges":
        # Hedges' g correction factor
        d *= (1 - (3 / (4 * (n - 2) - 1)))
    
    return d


# Test the functions
np.random.seed(42)
control = np.random.normal(50, 10, 25)
treatment = np.random.normal(58, 10, 25)

print(f"Cohen's d (independent): {cohens_d(control, treatment):.3f}")
print(f"Hedges' g (independent): {cohens_d(control, treatment, correction='hedges'):.3f}")

# Paired example
before = np.random.normal(100, 15, 20)
after = before + np.random.normal(8, 5, 20)  # Improvement with noise

print(f"Cohen's d (paired): {cohens_d(before, after, paired=True):.3f}")

Using Statistical Libraries

Don’t reinvent the wheel in production code. Here are three library approaches:

import numpy as np
from scipy import stats

# Install if needed: pip install pingouin statsmodels
import pingouin as pg
from statsmodels.stats.power import TTestIndPower

np.random.seed(42)
group_a = np.random.normal(100, 15, 50)
group_b = np.random.normal(110, 15, 50)

# Method 1: Manual calculation (our function)
d_manual = cohens_d(group_a, group_b)

# Method 2: Pingouin (recommended for most use cases)
d_pingouin = pg.compute_effsize(group_a, group_b, eftype='cohen')

# Method 3: Pingouin with confidence interval
effsize_ci = pg.compute_effsize_from_t(
    stats.ttest_ind(group_a, group_b).statistic,
    nx=len(group_a),
    ny=len(group_b),
    eftype='cohen'
)

# Method 4: Using statsmodels (reverse-engineering from power analysis)
# Less direct, but useful if you're already doing power calculations
power_analysis = TTestIndPower()

print(f"Manual calculation:  d = {d_manual:.4f}")
print(f"Pingouin:            d = {d_pingouin:.4f}")
print(f"Pingouin from t:     d = {effsize_ci:.4f}")

# Pingouin also gives you everything in one call
full_results = pg.ttest(group_a, group_b, correction=False)
print(f"\nFull t-test with effect size:")
print(full_results[['T', 'p-val', 'cohen-d', 'CI95%']].to_string())

Pingouin is my recommendation for most workflows. It calculates effect sizes automatically with t-tests and provides confidence intervals without extra work.

Interpreting and Visualizing Results

Numbers need context. A Cohen’s d of 0.65 means little without visualization:

import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

def plot_effect_size(group1, group2, ax=None):
    """Visualize two distributions with Cohen's d annotation."""
    if ax is None:
        fig, ax = plt.subplots(figsize=(10, 6))
    
    # Calculate Cohen's d
    d = cohens_d(group1, group2)
    
    # Create smooth density curves
    x_range = np.linspace(
        min(group1.min(), group2.min()) - 10,
        max(group1.max(), group2.max()) + 10,
        200
    )
    
    kde1 = stats.gaussian_kde(group1)
    kde2 = stats.gaussian_kde(group2)
    
    # Plot distributions
    ax.fill_between(x_range, kde1(x_range), alpha=0.5, label=f'Group 1 (M={np.mean(group1):.1f})')
    ax.fill_between(x_range, kde2(x_range), alpha=0.5, label=f'Group 2 (M={np.mean(group2):.1f})')
    
    # Add vertical lines at means
    ax.axvline(np.mean(group1), color='C0', linestyle='--', alpha=0.7)
    ax.axvline(np.mean(group2), color='C1', linestyle='--', alpha=0.7)
    
    # Annotate effect size
    interpretation = "small" if abs(d) < 0.5 else "medium" if abs(d) < 0.8 else "large"
    ax.set_title(f"Cohen's d = {d:.2f} ({interpretation} effect)", fontsize=14)
    ax.set_xlabel('Value')
    ax.set_ylabel('Density')
    ax.legend()
    
    return ax

# Create comparison plots
np.random.seed(42)
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

# Small effect
g1_small = np.random.normal(100, 15, 200)
g2_small = np.random.normal(103, 15, 200)
plot_effect_size(g1_small, g2_small, axes[0])

# Medium effect
g1_med = np.random.normal(100, 15, 200)
g2_med = np.random.normal(107.5, 15, 200)
plot_effect_size(g1_med, g2_med, axes[1])

# Large effect
g1_large = np.random.normal(100, 15, 200)
g2_large = np.random.normal(112, 15, 200)
plot_effect_size(g1_large, g2_large, axes[2])

plt.tight_layout()
plt.savefig('effect_size_comparison.png', dpi=150, bbox_inches='tight')
plt.show()

Common Pitfalls and Best Practices

Use Hedges’ g for small samples. Cohen’s d overestimates effect size when n < 20 per group. Hedges’ g applies a correction factor that becomes negligible with larger samples.

Report confidence intervals. A point estimate of d = 0.5 means nothing without uncertainty bounds. Here’s how to bootstrap them:

import numpy as np
from typing import Tuple

def cohens_d_ci(
    group1: np.ndarray, 
    group2: np.ndarray, 
    confidence: float = 0.95,
    n_bootstrap: int = 10000,
    random_state: int = None
) -> Tuple[float, float, float]:
    """
    Calculate Cohen's d with bootstrapped confidence interval.
    
    Returns
    -------
    tuple : (point_estimate, ci_lower, ci_upper)
    """
    rng = np.random.default_rng(random_state)
    
    point_estimate = cohens_d(group1, group2)
    
    bootstrap_d = np.zeros(n_bootstrap)
    n1, n2 = len(group1), len(group2)
    
    for i in range(n_bootstrap):
        # Resample with replacement
        sample1 = rng.choice(group1, size=n1, replace=True)
        sample2 = rng.choice(group2, size=n2, replace=True)
        bootstrap_d[i] = cohens_d(sample1, sample2)
    
    # Calculate percentile confidence interval
    alpha = 1 - confidence
    ci_lower = np.percentile(bootstrap_d, 100 * alpha / 2)
    ci_upper = np.percentile(bootstrap_d, 100 * (1 - alpha / 2))
    
    return point_estimate, ci_lower, ci_upper


# Example usage
np.random.seed(42)
control = np.random.normal(50, 10, 30)
treatment = np.random.normal(56, 10, 30)

d, ci_low, ci_high = cohens_d_ci(control, treatment, random_state=42)
print(f"Cohen's d = {d:.3f}, 95% CI [{ci_low:.3f}, {ci_high:.3f}]")

# Compare with Hedges' g for this small sample
g = cohens_d(control, treatment, correction='hedges')
print(f"Hedges' g = {g:.3f} (small-sample corrected)")

Don’t cherry-pick effect size measures. Decide before analysis whether you’ll use Cohen’s d, Hedges’ g, or Glass’s delta. Switching after seeing results is p-hacking’s cousin.

Consider domain context. Cohen’s benchmarks came from behavioral sciences. In medicine, a “small” effect might save thousands of lives. In education, a “large” effect might be impractical to achieve at scale.

Conclusion

Effect size calculation belongs in every statistical analysis. P-values answer “is there an effect?” while Cohen’s d answers “how big is it?"—and the second question usually matters more.

For most Python workflows, use pingouin.compute_effsize() or include effect sizes automatically via pingouin.ttest(). For small samples (n < 20 per group), apply Hedges’ g correction. Always report confidence intervals alongside point estimates.

The functions in this article give you everything needed to calculate, interpret, and visualize effect sizes. Stop publishing naked p-values.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.