How to Calculate Effect Sizes Using Pingouin in Python
Statistical significance tells you whether an effect exists. Effect sizes tell you whether anyone should care. A drug trial with 100,000 participants might achieve p < 0.001 for a treatment that...
Key Insights
- Effect sizes quantify the magnitude of differences or relationships, providing essential context that p-values alone cannot offer—a statistically significant result can be practically meaningless without knowing how large the effect actually is.
- Pingouin provides a clean, pandas-friendly API for calculating effect sizes directly within your statistical tests, eliminating the need to manually compute Cohen’s d, eta-squared, or correlation coefficients.
- Choosing the correct effect size measure depends on your research design: use Cohen’s d for two-group comparisons, eta-squared or omega-squared for ANOVA, and Pearson’s r for correlations.
Introduction to Effect Sizes
Statistical significance tells you whether an effect exists. Effect sizes tell you whether anyone should care. A drug trial with 100,000 participants might achieve p < 0.001 for a treatment that reduces symptoms by 0.5%—statistically significant, practically useless.
Effect sizes quantify the magnitude of a phenomenon. The most common measures include:
- Cohen’s d: Standardized difference between two means, expressed in standard deviation units
- Eta-squared (η²): Proportion of variance explained in ANOVA designs
- Pearson’s r: Correlation coefficient, which doubles as an effect size for relationships
Pingouin is a Python statistics library built on top of pandas that makes calculating these measures straightforward. Unlike scipy.stats, which often requires manual effect size computation, Pingouin bakes effect sizes directly into its test outputs. It’s opinionated about best practices and produces publication-ready results.
Setting Up Your Environment
Install Pingouin via pip:
pip install pingouin
Here’s the standard import pattern and a sample dataset we’ll use throughout:
import pingouin as pg
import pandas as pd
import numpy as np
# Set random seed for reproducibility
np.random.seed(42)
# Create sample dataset: exam scores across teaching methods
n_per_group = 30
data = pd.DataFrame({
'student_id': range(n_per_group * 3),
'method': ['Traditional'] * n_per_group +
['Flipped'] * n_per_group +
['Hybrid'] * n_per_group,
'score': np.concatenate([
np.random.normal(72, 10, n_per_group), # Traditional
np.random.normal(78, 12, n_per_group), # Flipped
np.random.normal(80, 11, n_per_group) # Hybrid
]),
'hours_studied': np.random.uniform(5, 25, n_per_group * 3)
})
# Add pre/post scores for paired comparisons
data['pre_score'] = np.random.normal(65, 8, n_per_group * 3)
data['post_score'] = data['pre_score'] + np.random.normal(10, 5, n_per_group * 3)
print(data.head())
Cohen’s d for Comparing Two Groups
Cohen’s d measures the standardized difference between two group means. Use it whenever you’re running a t-test or comparing two conditions. The formula is simple: the difference in means divided by the pooled standard deviation.
Interpretation guidelines (per Cohen’s conventions):
- Small: d = 0.2
- Medium: d = 0.5
- Large: d = 0.8
Pingouin’s compute_effsize() function handles both independent and paired samples:
# Extract two groups for comparison
traditional = data[data['method'] == 'Traditional']['score']
flipped = data[data['method'] == 'Flipped']['score']
# Cohen's d for independent samples
d_independent = pg.compute_effsize(traditional, flipped, eftype='cohen')
print(f"Cohen's d (independent): {d_independent:.3f}")
# You can also get hedges' g, which corrects for small sample bias
g_hedges = pg.compute_effsize(traditional, flipped, eftype='hedges')
print(f"Hedges' g: {g_hedges:.3f}")
# For paired samples (pre/post design)
pre_scores = data['pre_score'].values
post_scores = data['post_score'].values
d_paired = pg.compute_effsize(pre_scores, post_scores, paired=True, eftype='cohen')
print(f"Cohen's d (paired): {d_paired:.3f}")
The ttest() function includes effect size automatically:
# T-test with effect size included in output
ttest_result = pg.ttest(traditional, flipped)
print(ttest_result[['T', 'p-val', 'cohen-d', 'BF10']])
This returns Cohen’s d alongside the t-statistic, p-value, and Bayes factor—everything you need for a complete report.
Eta-Squared and Omega-Squared for ANOVA
When comparing more than two groups, Cohen’s d doesn’t apply directly. Instead, use eta-squared (η²) or omega-squared (ω²), which represent the proportion of total variance explained by the grouping variable.
Eta-squared is the ratio of between-group variance to total variance. It’s intuitive but positively biased, especially with small samples.
Omega-squared applies a correction for this bias and provides a better population estimate. Use omega-squared for reporting; use eta-squared for quick exploratory analysis.
Interpretation guidelines for η²:
- Small: η² = 0.01
- Medium: η² = 0.06
- Large: η² = 0.14
# One-way ANOVA with effect size
anova_result = pg.anova(data=data, dv='score', between='method', detailed=True)
print(anova_result)
# The output includes np2 (partial eta-squared)
# For one-way ANOVA, partial eta-squared equals eta-squared
# To get omega-squared, use the welch_anova or calculate manually
welch_result = pg.welch_anova(data=data, dv='score', between='method')
print(welch_result)
For more control, calculate omega-squared from ANOVA components:
# Manual omega-squared calculation
ss_between = anova_result['SS'][0]
ss_within = anova_result['SS'][1]
df_between = anova_result['DF'][0]
ms_within = anova_result['MS'][1]
n_total = len(data)
omega_squared = (ss_between - df_between * ms_within) / (ss_between + ss_within + ms_within)
print(f"Omega-squared: {omega_squared:.3f}")
Post-hoc tests also include effect sizes:
# Pairwise comparisons with effect sizes
posthoc = pg.pairwise_tukey(data=data, dv='score', between='method')
print(posthoc[['A', 'B', 'diff', 'p-tukey', 'hedges']])
Correlation-Based Effect Sizes
Pearson’s r is both a correlation coefficient and an effect size. It ranges from -1 to 1 and represents the strength of linear association.
Interpretation:
- Small: r = 0.1
- Medium: r = 0.3
- Large: r = 0.5
# Correlation with full statistics
corr_result = pg.corr(data['hours_studied'], data['score'])
print(corr_result)
# The output includes r, CI95%, p-value, and power
# r itself is the effect size
# For multiple correlations at once
numeric_cols = ['score', 'hours_studied', 'pre_score', 'post_score']
corr_matrix = pg.pairwise_corr(data[numeric_cols], method='pearson')
print(corr_matrix[['X', 'Y', 'r', 'p-unc', 'power']])
For comparing a continuous variable across a binary grouping, point-biserial correlation applies:
# Create binary variable
data['passed'] = (data['score'] >= 75).astype(int)
# Point-biserial correlation (Pingouin handles this automatically)
pb_corr = pg.corr(data['passed'], data['hours_studied'])
print(f"Point-biserial r: {pb_corr['r'].values[0]:.3f}")
Practical Workflow: Complete Analysis Example
Here’s a complete analysis pipeline that demonstrates proper effect size reporting:
def analyze_teaching_methods(df):
"""Complete analysis of teaching method effectiveness."""
results = {}
# 1. Descriptive statistics
descriptives = df.groupby('method')['score'].agg(['mean', 'std', 'count'])
results['descriptives'] = descriptives
print("=== Descriptive Statistics ===")
print(descriptives.round(2))
print()
# 2. Test assumptions
normality = pg.normality(df, dv='score', group='method')
homoscedasticity = pg.homoscedasticity(df, dv='score', group='method')
results['normality'] = normality
results['homoscedasticity'] = homoscedasticity
print("=== Assumption Tests ===")
print(f"Normality (Shapiro-Wilk): all p > 0.05? {(normality['pval'] > 0.05).all()}")
print(f"Homoscedasticity (Levene): p = {homoscedasticity['pval'].values[0]:.3f}")
print()
# 3. Main analysis: ANOVA with effect size
anova = pg.anova(data=df, dv='score', between='method', detailed=True)
results['anova'] = anova
eta_squared = anova['np2'].values[0]
f_value = anova['F'].values[0]
p_value = anova['p-unc'].values[0]
df_between = int(anova['DF'].values[0])
df_within = int(anova['DF'].values[1])
print("=== ANOVA Results ===")
print(f"F({df_between}, {df_within}) = {f_value:.2f}, p = {p_value:.3f}, η² = {eta_squared:.3f}")
print()
# 4. Post-hoc comparisons with effect sizes
posthoc = pg.pairwise_tukey(data=df, dv='score', between='method')
results['posthoc'] = posthoc
print("=== Post-hoc Comparisons (Tukey HSD) ===")
for _, row in posthoc.iterrows():
print(f"{row['A']} vs {row['B']}: diff = {row['diff']:.2f}, "
f"p = {row['p-tukey']:.3f}, Hedges' g = {row['hedges']:.3f}")
print()
# 5. APA-formatted summary
print("=== APA Format Summary ===")
effect_interp = "large" if eta_squared >= 0.14 else "medium" if eta_squared >= 0.06 else "small"
print(f"A one-way ANOVA revealed a significant effect of teaching method on exam scores, "
f"F({df_between}, {df_within}) = {f_value:.2f}, p = {p_value:.3f}, η² = {eta_squared:.2f}, "
f"indicating a {effect_interp} effect.")
return results
# Run the complete analysis
analysis_results = analyze_teaching_methods(data)
Summary and Best Practices
Choosing the right effect size:
| Design | Test | Effect Size | Pingouin Function |
|---|---|---|---|
| Two independent groups | Independent t-test | Cohen’s d | pg.ttest(), pg.compute_effsize() |
| Two paired groups | Paired t-test | Cohen’s d (paired) | pg.ttest(..., paired=True) |
| 3+ groups | ANOVA | η², ω² | pg.anova() |
| Continuous relationship | Correlation | Pearson’s r | pg.corr() |
| Regression | Linear regression | R², f² | pg.linear_regression() |
Common pitfalls to avoid:
- Ignoring effect sizes entirely: A p-value without an effect size is incomplete reporting.
- Using eta-squared for small samples: Switch to omega-squared for less biased estimates.
- Forgetting to specify paired: For repeated measures, always set
paired=Trueincompute_effsize(). - Over-interpreting Cohen’s benchmarks: Small/medium/large are context-dependent. A “small” effect in education might be huge in medicine.
Always report confidence intervals for effect sizes when possible. Pingouin’s compute_effsize_from_t() can generate these from t-test results. Effect sizes make your research interpretable, replicable, and honest about practical significance.