How to Calculate Effect Size (Cohen's d) in Python

Key Insights

P-values tell you whether an effect exists, but Cohen’s d tells you whether that effect actually matters—a statistically significant result can be practically meaningless without effect size context.
Cohen’s d expresses the difference between two groups in standard deviation units, making it comparable across different studies and measurement scales.
Always pair your hypothesis tests with effect size calculations and confidence intervals; libraries like pingouin make this trivially easy in Python.

Introduction to Effect Size

Statistical significance has a credibility problem. With a large enough sample, you can achieve a p-value below 0.05 for differences so small they’re meaningless in practice. This is where effect size saves you from publishing embarrassing conclusions.

Cohen’s d is the most widely used effect size measure for comparing two means. It answers a simple question: how many standard deviations apart are these two groups? A significant p-value tells you the difference probably isn’t due to chance; Cohen’s d tells you whether you should care.

Here’s the problem in action:

import numpy as np
from scipy import stats

np.random.seed(42)

# Tiny effect, huge sample
group_a = np.random.normal(100, 15, 10000)
group_b = np.random.normal(100.5, 15, 10000)  # Only 0.5 point difference

t_stat, p_value = stats.ttest_ind(group_a, group_b)
print(f"Tiny effect: p-value = {p_value:.6f}")  # Significant!

# Large effect, small sample
group_c = np.random.normal(100, 15, 30)
group_d = np.random.normal(115, 15, 30)  # 15 point difference (1 SD)

t_stat, p_value = stats.ttest_ind(group_c, group_d)
print(f"Large effect: p-value = {p_value:.6f}")  # Also significant

Both results are statistically significant, but the first represents a trivial 0.5-point difference while the second shows a full standard deviation gap. Without effect size, you can’t distinguish between them.

The Math Behind Cohen’s d

Cohen’s d for two independent samples is calculated as:

$$d = \frac{M_1 - M_2}{s_{pooled}}$$

Where the pooled standard deviation is:

$$s_{pooled} = \sqrt{\frac{(n_1 - 1)s_1^2 + (n_2 - 1)s_2^2}{n_1 + n_2 - 2}}$$

This formula weights each group’s variance by its degrees of freedom, giving larger samples more influence on the pooled estimate.

Jacob Cohen himself provided rough benchmarks for interpreting effect sizes:

Small: d = 0.2 (distributions overlap about 85%)
Medium: d = 0.5 (distributions overlap about 67%)
Large: d = 0.8 (distributions overlap about 53%)

These are guidelines, not laws. A “small” effect size might be huge in your domain—a 0.2 SD improvement in cancer survival rates matters enormously.

Let’s calculate this manually:

import numpy as np

# Sample data
group_1 = np.array([23, 25, 28, 29, 31, 33, 35, 37, 40, 42])
group_2 = np.array([18, 20, 22, 24, 26, 27, 29, 30, 32, 34])

# Step 1: Calculate means
mean_1 = np.mean(group_1)
mean_2 = np.mean(group_2)
print(f"Mean 1: {mean_1:.2f}, Mean 2: {mean_2:.2f}")

# Step 2: Calculate standard deviations (using n-1, i.e., ddof=1)
std_1 = np.std(group_1, ddof=1)
std_2 = np.std(group_2, ddof=1)
print(f"Std 1: {std_1:.2f}, Std 2: {std_2:.2f}")

# Step 3: Calculate pooled standard deviation
n1, n2 = len(group_1), len(group_2)
pooled_std = np.sqrt(((n1 - 1) * std_1**2 + (n2 - 1) * std_2**2) / (n1 + n2 - 2))
print(f"Pooled Std: {pooled_std:.2f}")

# Step 4: Calculate Cohen's d
cohens_d = (mean_1 - mean_2) / pooled_std
print(f"Cohen's d: {cohens_d:.3f}")

Implementing Cohen’s d from Scratch

Here are production-ready functions for calculating Cohen’s d:

import numpy as np
from typing import Literal

def cohens_d_independent(group1: np.ndarray, group2: np.ndarray) -> float:
    """
    Calculate Cohen's d for two independent samples.
    
    Uses pooled standard deviation, appropriate when population
    variances are assumed equal.
    """
    n1, n2 = len(group1), len(group2)
    var1, var2 = np.var(group1, ddof=1), np.var(group2, ddof=1)
    
    pooled_std = np.sqrt(((n1 - 1) * var1 + (n2 - 1) * var2) / (n1 + n2 - 2))
    
    return (np.mean(group1) - np.mean(group2)) / pooled_std


def cohens_d_paired(before: np.ndarray, after: np.ndarray) -> float:
    """
    Calculate Cohen's d for paired samples (repeated measures).
    
    Uses the standard deviation of the difference scores.
    """
    if len(before) != len(after):
        raise ValueError("Paired samples must have equal length")
    
    diff = after - before
    return np.mean(diff) / np.std(diff, ddof=1)


def cohens_d(
    x: np.ndarray, 
    y: np.ndarray, 
    paired: bool = False,
    correction: Literal["none", "hedges"] = "none"
) -> float:
    """
    Unified Cohen's d calculation with optional Hedges' g correction.
    
    Parameters
    ----------
    x, y : array-like
        The two samples to compare
    paired : bool
        Whether samples are paired/repeated measures
    correction : str
        'none' for Cohen's d, 'hedges' for Hedges' g (small sample correction)
    """
    x, y = np.asarray(x), np.asarray(y)
    
    if paired:
        d = cohens_d_paired(x, y)
        n = len(x)
    else:
        d = cohens_d_independent(x, y)
        n = len(x) + len(y)
    
    if correction == "hedges":
        # Hedges' g correction factor
        d *= (1 - (3 / (4 * (n - 2) - 1)))
    
    return d


# Test the functions
np.random.seed(42)
control = np.random.normal(50, 10, 25)
treatment = np.random.normal(58, 10, 25)

print(f"Cohen's d (independent): {cohens_d(control, treatment):.3f}")
print(f"Hedges' g (independent): {cohens_d(control, treatment, correction='hedges'):.3f}")

# Paired example
before = np.random.normal(100, 15, 20)
after = before + np.random.normal(8, 5, 20)  # Improvement with noise

print(f"Cohen's d (paired): {cohens_d(before, after, paired=True):.3f}")

Using Statistical Libraries

Don’t reinvent the wheel in production code. Here are three library approaches:

import numpy as np
from scipy import stats

# Install if needed: pip install pingouin statsmodels
import pingouin as pg
from statsmodels.stats.power import TTestIndPower

np.random.seed(42)
group_a = np.random.normal(100, 15, 50)
group_b = np.random.normal(110, 15, 50)

# Method 1: Manual calculation (our function)
d_manual = cohens_d(group_a, group_b)

# Method 2: Pingouin (recommended for most use cases)
d_pingouin = pg.compute_effsize(group_a, group_b, eftype='cohen')

# Method 3: Pingouin with confidence interval
effsize_ci = pg.compute_effsize_from_t(
    stats.ttest_ind(group_a, group_b).statistic,
    nx=len(group_a),
    ny=len(group_b),
    eftype='cohen'
)

# Method 4: Using statsmodels (reverse-engineering from power analysis)
# Less direct, but useful if you're already doing power calculations
power_analysis = TTestIndPower()

print(f"Manual calculation:  d = {d_manual:.4f}")
print(f"Pingouin:            d = {d_pingouin:.4f}")
print(f"Pingouin from t:     d = {effsize_ci:.4f}")

# Pingouin also gives you everything in one call
full_results = pg.ttest(group_a, group_b, correction=False)
print(f"\nFull t-test with effect size:")
print(full_results[['T', 'p-val', 'cohen-d', 'CI95%']].to_string())

Pingouin is my recommendation for most workflows. It calculates effect sizes automatically with t-tests and provides confidence intervals without extra work.

Interpreting and Visualizing Results

Numbers need context. A Cohen’s d of 0.65 means little without visualization:

import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

def plot_effect_size(group1, group2, ax=None):
    """Visualize two distributions with Cohen's d annotation."""
    if ax is None:
        fig, ax = plt.subplots(figsize=(10, 6))
    
    # Calculate Cohen's d
    d = cohens_d(group1, group2)
    
    # Create smooth density curves
    x_range = np.linspace(
        min(group1.min(), group2.min()) - 10,
        max(group1.max(), group2.max()) + 10,
        200
    )
    
    kde1 = stats.gaussian_kde(group1)
    kde2 = stats.gaussian_kde(group2)
    
    # Plot distributions
    ax.fill_between(x_range, kde1(x_range), alpha=0.5, label=f'Group 1 (M={np.mean(group1):.1f})')
    ax.fill_between(x_range, kde2(x_range), alpha=0.5, label=f'Group 2 (M={np.mean(group2):.1f})')
    
    # Add vertical lines at means
    ax.axvline(np.mean(group1), color='C0', linestyle='--', alpha=0.7)
    ax.axvline(np.mean(group2), color='C1', linestyle='--', alpha=0.7)
    
    # Annotate effect size
    interpretation = "small" if abs(d) < 0.5 else "medium" if abs(d) < 0.8 else "large"
    ax.set_title(f"Cohen's d = {d:.2f} ({interpretation} effect)", fontsize=14)
    ax.set_xlabel('Value')
    ax.set_ylabel('Density')
    ax.legend()
    
    return ax

# Create comparison plots
np.random.seed(42)
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

# Small effect
g1_small = np.random.normal(100, 15, 200)
g2_small = np.random.normal(103, 15, 200)
plot_effect_size(g1_small, g2_small, axes[0])

# Medium effect
g1_med = np.random.normal(100, 15, 200)
g2_med = np.random.normal(107.5, 15, 200)
plot_effect_size(g1_med, g2_med, axes[1])

# Large effect
g1_large = np.random.normal(100, 15, 200)
g2_large = np.random.normal(112, 15, 200)
plot_effect_size(g1_large, g2_large, axes[2])

plt.tight_layout()
plt.savefig('effect_size_comparison.png', dpi=150, bbox_inches='tight')
plt.show()

Common Pitfalls and Best Practices

Use Hedges’ g for small samples. Cohen’s d overestimates effect size when n < 20 per group. Hedges’ g applies a correction factor that becomes negligible with larger samples.

Report confidence intervals. A point estimate of d = 0.5 means nothing without uncertainty bounds. Here’s how to bootstrap them:

import numpy as np
from typing import Tuple

def cohens_d_ci(
    group1: np.ndarray, 
    group2: np.ndarray, 
    confidence: float = 0.95,
    n_bootstrap: int = 10000,
    random_state: int = None
) -> Tuple[float, float, float]:
    """
    Calculate Cohen's d with bootstrapped confidence interval.
    
    Returns
    -------
    tuple : (point_estimate, ci_lower, ci_upper)
    """
    rng = np.random.default_rng(random_state)
    
    point_estimate = cohens_d(group1, group2)
    
    bootstrap_d = np.zeros(n_bootstrap)
    n1, n2 = len(group1), len(group2)
    
    for i in range(n_bootstrap):
        # Resample with replacement
        sample1 = rng.choice(group1, size=n1, replace=True)
        sample2 = rng.choice(group2, size=n2, replace=True)
        bootstrap_d[i] = cohens_d(sample1, sample2)
    
    # Calculate percentile confidence interval
    alpha = 1 - confidence
    ci_lower = np.percentile(bootstrap_d, 100 * alpha / 2)
    ci_upper = np.percentile(bootstrap_d, 100 * (1 - alpha / 2))
    
    return point_estimate, ci_lower, ci_upper


# Example usage
np.random.seed(42)
control = np.random.normal(50, 10, 30)
treatment = np.random.normal(56, 10, 30)

d, ci_low, ci_high = cohens_d_ci(control, treatment, random_state=42)
print(f"Cohen's d = {d:.3f}, 95% CI [{ci_low:.3f}, {ci_high:.3f}]")

# Compare with Hedges' g for this small sample
g = cohens_d(control, treatment, correction='hedges')
print(f"Hedges' g = {g:.3f} (small-sample corrected)")

Don’t cherry-pick effect size measures. Decide before analysis whether you’ll use Cohen’s d, Hedges’ g, or Glass’s delta. Switching after seeing results is p-hacking’s cousin.

Consider domain context. Cohen’s benchmarks came from behavioral sciences. In medicine, a “small” effect might save thousands of lives. In education, a “large” effect might be impractical to achieve at scale.

Conclusion

Effect size calculation belongs in every statistical analysis. P-values answer “is there an effect?” while Cohen’s d answers “how big is it?"—and the second question usually matters more.

For most Python workflows, use pingouin.compute_effsize() or include effect sizes automatically via pingouin.ttest(). For small samples (n < 20 per group), apply Hedges’ g correction. Always report confidence intervals alongside point estimates.

The functions in this article give you everything needed to calculate, interpret, and visualize effect sizes. Stop publishing naked p-values.