How to Calculate Effect Sizes Using Pingouin in Python

Key Insights

Effect sizes quantify the magnitude of differences or relationships, providing essential context that p-values alone cannot offer—a statistically significant result can be practically meaningless without knowing how large the effect actually is.
Pingouin provides a clean, pandas-friendly API for calculating effect sizes directly within your statistical tests, eliminating the need to manually compute Cohen’s d, eta-squared, or correlation coefficients.
Choosing the correct effect size measure depends on your research design: use Cohen’s d for two-group comparisons, eta-squared or omega-squared for ANOVA, and Pearson’s r for correlations.

Introduction to Effect Sizes

Statistical significance tells you whether an effect exists. Effect sizes tell you whether anyone should care. A drug trial with 100,000 participants might achieve p < 0.001 for a treatment that reduces symptoms by 0.5%—statistically significant, practically useless.

Effect sizes quantify the magnitude of a phenomenon. The most common measures include:

Cohen’s d: Standardized difference between two means, expressed in standard deviation units
Eta-squared (η²): Proportion of variance explained in ANOVA designs
Pearson’s r: Correlation coefficient, which doubles as an effect size for relationships

Pingouin is a Python statistics library built on top of pandas that makes calculating these measures straightforward. Unlike scipy.stats, which often requires manual effect size computation, Pingouin bakes effect sizes directly into its test outputs. It’s opinionated about best practices and produces publication-ready results.

Setting Up Your Environment

Install Pingouin via pip:

pip install pingouin

Here’s the standard import pattern and a sample dataset we’ll use throughout:

import pingouin as pg
import pandas as pd
import numpy as np

# Set random seed for reproducibility
np.random.seed(42)

# Create sample dataset: exam scores across teaching methods
n_per_group = 30

data = pd.DataFrame({
    'student_id': range(n_per_group * 3),
    'method': ['Traditional'] * n_per_group + 
              ['Flipped'] * n_per_group + 
              ['Hybrid'] * n_per_group,
    'score': np.concatenate([
        np.random.normal(72, 10, n_per_group),  # Traditional
        np.random.normal(78, 12, n_per_group),  # Flipped
        np.random.normal(80, 11, n_per_group)   # Hybrid
    ]),
    'hours_studied': np.random.uniform(5, 25, n_per_group * 3)
})

# Add pre/post scores for paired comparisons
data['pre_score'] = np.random.normal(65, 8, n_per_group * 3)
data['post_score'] = data['pre_score'] + np.random.normal(10, 5, n_per_group * 3)

print(data.head())

Cohen’s d for Comparing Two Groups

Cohen’s d measures the standardized difference between two group means. Use it whenever you’re running a t-test or comparing two conditions. The formula is simple: the difference in means divided by the pooled standard deviation.

Interpretation guidelines (per Cohen’s conventions):

Small: d = 0.2
Medium: d = 0.5
Large: d = 0.8

Pingouin’s compute_effsize() function handles both independent and paired samples:

# Extract two groups for comparison
traditional = data[data['method'] == 'Traditional']['score']
flipped = data[data['method'] == 'Flipped']['score']

# Cohen's d for independent samples
d_independent = pg.compute_effsize(traditional, flipped, eftype='cohen')
print(f"Cohen's d (independent): {d_independent:.3f}")

# You can also get hedges' g, which corrects for small sample bias
g_hedges = pg.compute_effsize(traditional, flipped, eftype='hedges')
print(f"Hedges' g: {g_hedges:.3f}")

# For paired samples (pre/post design)
pre_scores = data['pre_score'].values
post_scores = data['post_score'].values

d_paired = pg.compute_effsize(pre_scores, post_scores, paired=True, eftype='cohen')
print(f"Cohen's d (paired): {d_paired:.3f}")

The ttest() function includes effect size automatically:

# T-test with effect size included in output
ttest_result = pg.ttest(traditional, flipped)
print(ttest_result[['T', 'p-val', 'cohen-d', 'BF10']])

This returns Cohen’s d alongside the t-statistic, p-value, and Bayes factor—everything you need for a complete report.

Eta-Squared and Omega-Squared for ANOVA

When comparing more than two groups, Cohen’s d doesn’t apply directly. Instead, use eta-squared (η²) or omega-squared (ω²), which represent the proportion of total variance explained by the grouping variable.

Eta-squared is the ratio of between-group variance to total variance. It’s intuitive but positively biased, especially with small samples.

Omega-squared applies a correction for this bias and provides a better population estimate. Use omega-squared for reporting; use eta-squared for quick exploratory analysis.

Interpretation guidelines for η²:

Small: η² = 0.01
Medium: η² = 0.06
Large: η² = 0.14

# One-way ANOVA with effect size
anova_result = pg.anova(data=data, dv='score', between='method', detailed=True)
print(anova_result)

# The output includes np2 (partial eta-squared)
# For one-way ANOVA, partial eta-squared equals eta-squared

# To get omega-squared, use the welch_anova or calculate manually
welch_result = pg.welch_anova(data=data, dv='score', between='method')
print(welch_result)

For more control, calculate omega-squared from ANOVA components:

# Manual omega-squared calculation
ss_between = anova_result['SS'][0]
ss_within = anova_result['SS'][1]
df_between = anova_result['DF'][0]
ms_within = anova_result['MS'][1]
n_total = len(data)

omega_squared = (ss_between - df_between * ms_within) / (ss_between + ss_within + ms_within)
print(f"Omega-squared: {omega_squared:.3f}")

Post-hoc tests also include effect sizes:

# Pairwise comparisons with effect sizes
posthoc = pg.pairwise_tukey(data=data, dv='score', between='method')
print(posthoc[['A', 'B', 'diff', 'p-tukey', 'hedges']])

Correlation-Based Effect Sizes

Pearson’s r is both a correlation coefficient and an effect size. It ranges from -1 to 1 and represents the strength of linear association.

Interpretation:

Small: r = 0.1
Medium: r = 0.3
Large: r = 0.5

# Correlation with full statistics
corr_result = pg.corr(data['hours_studied'], data['score'])
print(corr_result)

# The output includes r, CI95%, p-value, and power
# r itself is the effect size

# For multiple correlations at once
numeric_cols = ['score', 'hours_studied', 'pre_score', 'post_score']
corr_matrix = pg.pairwise_corr(data[numeric_cols], method='pearson')
print(corr_matrix[['X', 'Y', 'r', 'p-unc', 'power']])

For comparing a continuous variable across a binary grouping, point-biserial correlation applies:

# Create binary variable
data['passed'] = (data['score'] >= 75).astype(int)

# Point-biserial correlation (Pingouin handles this automatically)
pb_corr = pg.corr(data['passed'], data['hours_studied'])
print(f"Point-biserial r: {pb_corr['r'].values[0]:.3f}")

Practical Workflow: Complete Analysis Example

Here’s a complete analysis pipeline that demonstrates proper effect size reporting:

def analyze_teaching_methods(df):
    """Complete analysis of teaching method effectiveness."""
    
    results = {}
    
    # 1. Descriptive statistics
    descriptives = df.groupby('method')['score'].agg(['mean', 'std', 'count'])
    results['descriptives'] = descriptives
    print("=== Descriptive Statistics ===")
    print(descriptives.round(2))
    print()
    
    # 2. Test assumptions
    normality = pg.normality(df, dv='score', group='method')
    homoscedasticity = pg.homoscedasticity(df, dv='score', group='method')
    results['normality'] = normality
    results['homoscedasticity'] = homoscedasticity
    
    print("=== Assumption Tests ===")
    print(f"Normality (Shapiro-Wilk): all p > 0.05? {(normality['pval'] > 0.05).all()}")
    print(f"Homoscedasticity (Levene): p = {homoscedasticity['pval'].values[0]:.3f}")
    print()
    
    # 3. Main analysis: ANOVA with effect size
    anova = pg.anova(data=df, dv='score', between='method', detailed=True)
    results['anova'] = anova
    
    eta_squared = anova['np2'].values[0]
    f_value = anova['F'].values[0]
    p_value = anova['p-unc'].values[0]
    df_between = int(anova['DF'].values[0])
    df_within = int(anova['DF'].values[1])
    
    print("=== ANOVA Results ===")
    print(f"F({df_between}, {df_within}) = {f_value:.2f}, p = {p_value:.3f}, η² = {eta_squared:.3f}")
    print()
    
    # 4. Post-hoc comparisons with effect sizes
    posthoc = pg.pairwise_tukey(data=df, dv='score', between='method')
    results['posthoc'] = posthoc
    
    print("=== Post-hoc Comparisons (Tukey HSD) ===")
    for _, row in posthoc.iterrows():
        print(f"{row['A']} vs {row['B']}: diff = {row['diff']:.2f}, "
              f"p = {row['p-tukey']:.3f}, Hedges' g = {row['hedges']:.3f}")
    print()
    
    # 5. APA-formatted summary
    print("=== APA Format Summary ===")
    effect_interp = "large" if eta_squared >= 0.14 else "medium" if eta_squared >= 0.06 else "small"
    print(f"A one-way ANOVA revealed a significant effect of teaching method on exam scores, "
          f"F({df_between}, {df_within}) = {f_value:.2f}, p = {p_value:.3f}, η² = {eta_squared:.2f}, "
          f"indicating a {effect_interp} effect.")
    
    return results

# Run the complete analysis
analysis_results = analyze_teaching_methods(data)

Summary and Best Practices

Choosing the right effect size:

Design	Test	Effect Size	Pingouin Function
Two independent groups	Independent t-test	Cohen’s d	`pg.ttest()`, `pg.compute_effsize()`
Two paired groups	Paired t-test	Cohen’s d (paired)	`pg.ttest(..., paired=True)`
3+ groups	ANOVA	η², ω²	`pg.anova()`
Continuous relationship	Correlation	Pearson’s r	`pg.corr()`
Regression	Linear regression	R², f²	`pg.linear_regression()`

Common pitfalls to avoid:

Ignoring effect sizes entirely: A p-value without an effect size is incomplete reporting.
Using eta-squared for small samples: Switch to omega-squared for less biased estimates.
Forgetting to specify paired: For repeated measures, always set paired=True in compute_effsize().
Over-interpreting Cohen’s benchmarks: Small/medium/large are context-dependent. A “small” effect in education might be huge in medicine.

Always report confidence intervals for effect sizes when possible. Pingouin’s compute_effsize_from_t() can generate these from t-test results. Effect sizes make your research interpretable, replicable, and honest about practical significance.