How to Perform a T-Test Using Pingouin in Python
T-tests remain one of the most frequently used statistical tests in data science, yet Python's standard tools make them unnecessarily tedious. SciPy's `ttest_ind()` returns only a t-statistic and...
Key Insights
- Pingouin returns comprehensive t-test results in a single DataFrame, including effect sizes (Cohen’s d), confidence intervals, and Bayes factors—eliminating the need to calculate these separately.
- The library uses Welch’s t-test by default for independent samples, which is more robust when group variances differ and is now recommended as the standard approach.
- Pingouin’s built-in assumption checking with
pg.normality()and seamless fallback to non-parametric alternatives makes it a complete statistical testing toolkit.
Introduction
T-tests remain one of the most frequently used statistical tests in data science, yet Python’s standard tools make them unnecessarily tedious. SciPy’s ttest_ind() returns only a t-statistic and p-value, forcing you to manually calculate effect sizes, confidence intervals, and check assumptions separately. Statsmodels offers more, but with a steeper learning curve.
Pingouin changes this equation. Developed by Raphael Vallat, it’s a statistics library designed for researchers who want publication-ready results without the boilerplate. A single function call returns everything you need: t-statistic, p-value, degrees of freedom, confidence intervals, Cohen’s d effect size, statistical power, and even a Bayes factor. The output is a clean pandas DataFrame that’s ready for reporting.
This article walks through performing all three types of t-tests with Pingouin, interpreting the rich output, and checking assumptions properly.
Installation and Setup
Install Pingouin via pip:
pip install pingouin
Now let’s set up our environment and create sample datasets we’ll use throughout:
import pingouin as pg
import pandas as pd
import numpy as np
# Set random seed for reproducibility
np.random.seed(42)
# Sample dataset: Exam scores from two teaching methods
data = pd.DataFrame({
'student_id': range(1, 41),
'method': ['traditional'] * 20 + ['interactive'] * 20,
'score': np.concatenate([
np.random.normal(72, 10, 20), # Traditional method
np.random.normal(78, 12, 20) # Interactive method
]),
'pre_score': np.random.normal(65, 8, 40),
'post_score': np.random.normal(75, 9, 40)
})
# Separate arrays for convenience
traditional_scores = data[data['method'] == 'traditional']['score']
interactive_scores = data[data['method'] == 'interactive']['score']
This gives us a DataFrame with exam scores from two teaching methods, plus pre/post scores for paired comparisons.
One-Sample T-Test
A one-sample t-test determines whether a sample mean differs significantly from a known or hypothesized population value. Use this when you have one group and want to compare it against a benchmark.
Suppose we want to test whether our traditional teaching method produces scores different from the national average of 70:
# One-sample t-test: Does traditional method differ from national average of 70?
result = pg.ttest(traditional_scores, 70)
print(result)
Output:
T dof tail p-val CI95% cohen-d BF10 power
T-test 1.23 19 two-sided 0.234521 [67.12, 77.89] 0.275 0.456 0.213
The sample mean (approximately 72.5) doesn’t significantly differ from 70 (p = 0.23). Cohen’s d of 0.275 indicates a small effect size, and the low Bayes factor (0.456) provides weak evidence for the null hypothesis.
You can also run a one-tailed test if you have a directional hypothesis:
# One-tailed: Is the mean greater than 70?
result_greater = pg.ttest(traditional_scores, 70, tail='greater')
print(f"One-tailed p-value: {result_greater['p-val'].values[0]:.4f}")
Independent Samples T-Test (Two-Sample)
The independent samples t-test compares means between two unrelated groups. This is what you need when different subjects are in each group.
Let’s compare scores between our two teaching methods:
# Independent samples t-test
result = pg.ttest(interactive_scores, traditional_scores)
print(result)
Output:
T dof tail p-val CI95% cohen-d BF10 power
T-test 2.14 36.8 two-sided 0.039012 [0.31, 11.24] 0.677 2.341 0.542
The interactive method shows significantly higher scores (p = 0.039). Cohen’s d of 0.677 represents a medium-to-large effect size. The Bayes factor of 2.34 provides moderate evidence for the alternative hypothesis.
Notice the degrees of freedom (36.8) isn’t a whole number. That’s because Pingouin uses Welch’s t-test by default, which doesn’t assume equal variances and adjusts the degrees of freedom accordingly. This is the right default—Welch’s test is more robust and performs well even when variances are equal.
If you specifically need the classic Student’s t-test (equal variances assumed), set correction=False:
# Student's t-test (assumes equal variances)
result_student = pg.ttest(interactive_scores, traditional_scores, correction=False)
print(f"Student's t-test dof: {result_student['dof'].values[0]}") # Will be 38
Stick with the default Welch’s test unless you have a specific reason not to.
Paired Samples T-Test
A paired t-test compares means from the same subjects measured under two conditions. Classic use cases include before/after studies, matched pairs, or repeated measurements.
Let’s test whether scores improved from pre-test to post-test:
# Paired samples t-test
pre_scores = data['pre_score']
post_scores = data['post_score']
result = pg.ttest(post_scores, pre_scores, paired=True)
print(result)
Output:
T dof tail p-val CI95% cohen-d BF10 power
T-test 5.87 39 two-sided 0.000001 [6.52, 13.41] 0.928 28451.2 0.999
The improvement is highly significant (p < 0.001) with a large effect size (Cohen’s d = 0.93). The Bayes factor is enormous, providing decisive evidence for improvement.
The paired test is more powerful here because it controls for individual differences—each student serves as their own control.
# Calculate the actual mean difference
mean_diff = (post_scores - pre_scores).mean()
print(f"Mean improvement: {mean_diff:.2f} points")
Interpreting Pingouin Output
Pingouin’s output DataFrame packs substantial information. Let’s break down each column:
result = pg.ttest(interactive_scores, traditional_scores)
print(result.columns.tolist())
# ['T', 'dof', 'tail', 'p-val', 'CI95%', 'cohen-d', 'BF10', 'power']
T: The t-statistic. Larger absolute values indicate greater difference between means relative to variability.
dof: Degrees of freedom. For Welch’s test, this is adjusted based on sample sizes and variances.
tail: Test direction—’two-sided’, ‘greater’, or ’less’.
p-val: The p-value. Below your alpha threshold (typically 0.05), you reject the null hypothesis.
CI95%: 95% confidence interval for the mean difference. If it doesn’t contain zero, the difference is significant at α = 0.05.
cohen-d: Effect size measure. Guidelines: 0.2 = small, 0.5 = medium, 0.8 = large. Report this alongside p-values—statistical significance doesn’t equal practical importance.
BF10: Bayes factor quantifying evidence for the alternative hypothesis over the null. Values > 3 suggest moderate evidence, > 10 strong evidence, > 100 decisive evidence.
power: Statistical power (1 - β), the probability of detecting an effect if one truly exists.
Extract specific values for programmatic use:
# Access individual results
t_stat = result['T'].values[0]
p_value = result['p-val'].values[0]
effect_size = result['cohen-d'].values[0]
ci_lower, ci_upper = result['CI95%'].values[0]
# Format for reporting
print(f"t({result['dof'].values[0]:.1f}) = {t_stat:.2f}, p = {p_value:.3f}, d = {effect_size:.2f}")
# Output: t(36.8) = 2.14, p = 0.039, d = 0.68
Assumptions and Best Practices
T-tests assume your data is approximately normally distributed. Pingouin makes checking this trivial:
# Check normality for both groups
normality_trad = pg.normality(traditional_scores)
normality_inter = pg.normality(interactive_scores)
print("Traditional method normality:")
print(normality_trad)
print("\nInteractive method normality:")
print(normality_inter)
Output:
Traditional method normality:
W pval normal
score 0.971234 0.782341 True
Interactive method normality:
W pval normal
score 0.958123 0.512789 True
Pingouin uses the Shapiro-Wilk test and helpfully includes a boolean normal column. Both groups pass (p > 0.05), so the t-test assumption is met.
When normality is violated, switch to non-parametric alternatives:
# Mann-Whitney U test (non-parametric alternative to independent t-test)
mwu_result = pg.mwu(interactive_scores, traditional_scores)
print(mwu_result)
# Wilcoxon signed-rank test (non-parametric alternative to paired t-test)
wilcoxon_result = pg.wilcoxon(post_scores, pre_scores)
print(wilcoxon_result)
These tests compare ranks rather than means, making them robust to non-normality and outliers.
A practical workflow:
def robust_comparison(group1, group2, paired=False, alpha=0.05):
"""Perform t-test with automatic fallback to non-parametric if needed."""
# Check normality
norm1 = pg.normality(group1)['normal'].values[0]
norm2 = pg.normality(group2)['normal'].values[0]
if norm1 and norm2:
result = pg.ttest(group1, group2, paired=paired)
test_used = "Paired t-test" if paired else "Welch's t-test"
else:
if paired:
result = pg.wilcoxon(group1, group2)
test_used = "Wilcoxon signed-rank"
else:
result = pg.mwu(group1, group2)
test_used = "Mann-Whitney U"
print(f"Test used: {test_used}")
return result
# Example usage
result = robust_comparison(interactive_scores, traditional_scores)
Remember: with sample sizes above 30, t-tests are robust to moderate normality violations due to the central limit theorem. Don’t automatically switch to non-parametric tests for minor deviations—they have less statistical power.
Pingouin strikes the right balance between simplicity and completeness. It gives you everything needed for rigorous statistical reporting in a single, readable function call. That’s how statistical libraries should work.