How to Perform Correlation Analysis Using Pingouin in Python
Correlation analysis quantifies the strength and direction of relationships between variables. It's foundational to exploratory data analysis, feature selection, and hypothesis testing. Yet Python's...
Key Insights
- Pingouin’s
pg.corr()returns a DataFrame with coefficient, p-value, confidence intervals, Bayes Factor, and statistical power in a single call—information that would require multiple scipy functions to assemble. - Use
pg.pairwise_corr()with filtering parameters to efficiently analyze correlation matrices while controlling for multiple comparisons and focusing on statistically significant relationships. - Partial correlations via
pg.partial_corr()let you isolate the true relationship between two variables by controlling for confounders—essential for avoiding spurious conclusions in observational data.
Introduction to Correlation Analysis and Pingouin
Correlation analysis quantifies the strength and direction of relationships between variables. It’s foundational to exploratory data analysis, feature selection, and hypothesis testing. Yet Python’s standard tools make this unnecessarily tedious.
With scipy, you call pearsonr() and get back a tuple of coefficient and p-value. Want confidence intervals? That’s a separate calculation. Statistical power? Another function. Bayes Factor? Good luck.
Pingouin solves this by returning everything you need in a single, well-formatted DataFrame. It’s built on top of scipy and numpy but provides a cleaner API designed for statistical analysis rather than numerical computing.
# Installation
# pip install pingouin
import pingouin as pg
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
# Set random seed for reproducibility
np.random.seed(42)
Pingouin isn’t trying to replace scipy—it’s providing a statistical interface on top of it. If you’re doing serious data analysis and want interpretable output without writing boilerplate, Pingouin is the right choice.
Pearson Correlation with Pingouin
Pearson correlation measures linear relationships between continuous variables. It assumes both variables are normally distributed and the relationship is linear. Use it when you’re working with interval or ratio data and expect a straight-line relationship.
# Generate sample data
n = 100
x = np.random.normal(50, 10, n)
y = 2.5 * x + np.random.normal(0, 15, n) # Linear relationship with noise
# Pearson correlation with Pingouin
result = pg.corr(x, y, method='pearson')
print(result)
Output:
n r CI95% p-val BF10 power
pearson 100 0.857621 [0.79, 0.91] 1.57e-29 1.163e+26 1.0
Let’s break down each column:
- n: Sample size (100 observations)
- r: Correlation coefficient (0.86 indicates strong positive correlation)
- CI95%: 95% confidence interval for the coefficient
- p-val: Probability of observing this correlation under the null hypothesis
- BF10: Bayes Factor quantifying evidence for the alternative hypothesis (values > 100 indicate extreme evidence)
- power: Statistical power of the test (1.0 means virtually certain to detect this effect)
This single function call gives you everything needed for a complete statistical report. With scipy, you’d need to manually calculate confidence intervals using Fisher’s z-transformation and power using separate libraries.
Spearman and Kendall Rank Correlations
Pearson assumes linearity and normality. When those assumptions fail—ordinal data, non-linear monotonic relationships, or outlier-heavy distributions—switch to rank-based methods.
Spearman correlation converts values to ranks before computing correlation. Kendall’s tau counts concordant and discordant pairs. Both are robust to outliers and non-linearity.
# Create non-linear monotonic relationship
x = np.random.uniform(1, 100, 100)
y = np.log(x) + np.random.normal(0, 0.5, 100) # Logarithmic relationship
# Compare all three methods
pearson = pg.corr(x, y, method='pearson')
spearman = pg.corr(x, y, method='spearman')
kendall = pg.corr(x, y, method='kendall')
print("Pearson:")
print(pearson[['r', 'p-val']].to_string())
print("\nSpearman:")
print(spearman[['r', 'p-val']].to_string())
print("\nKendall:")
print(kendall[['r', 'p-val']].to_string())
For monotonic but non-linear relationships, Spearman and Kendall typically show stronger correlations than Pearson. Kendall’s tau is more robust with small samples and handles ties better, but Spearman is more commonly reported in literature.
My recommendation: default to Spearman when you have any doubt about linearity. It’s robust and interpretable. Use Kendall for ordinal data with many ties.
Correlation Matrices with pg.pairwise_corr()
Real analysis involves multiple variables. pg.pairwise_corr() computes correlations between all pairs in a DataFrame and returns results in long format—perfect for filtering and sorting.
# Create a multi-variable dataset
df = pd.DataFrame({
'age': np.random.normal(45, 12, 200),
'income': np.random.normal(65000, 20000, 200),
'education_years': np.random.normal(14, 3, 200),
'health_score': np.random.normal(70, 15, 200)
})
# Add some realistic correlations
df['income'] = df['income'] + df['education_years'] * 3000
df['health_score'] = df['health_score'] - df['age'] * 0.3
# Compute pairwise correlations
pairwise = pg.pairwise_corr(df, method='pearson')
print(pairwise[['X', 'Y', 'r', 'p-unc', 'CI95%']].to_string())
The p-unc column shows uncorrected p-values. For multiple comparisons, filter appropriately:
# Filter for significant correlations only
significant = pairwise[pairwise['p-unc'] < 0.05]
# Sort by absolute correlation strength
significant_sorted = significant.reindex(
significant['r'].abs().sort_values(ascending=False).index
)
print(significant_sorted[['X', 'Y', 'r', 'p-unc']])
You can also specify which columns to correlate:
# Correlate specific columns against others
targeted = pg.pairwise_corr(
df,
columns=['age', 'income'], # These variables
method='spearman'
)
Partial and Semi-Partial Correlations
Here’s where Pingouin really shines. Partial correlation measures the relationship between two variables while controlling for one or more confounders. This is critical for observational data where spurious correlations abound.
# Classic example: ice cream sales and drowning deaths
# Both correlate with temperature (the confounder)
n = 150
temperature = np.random.normal(75, 15, n)
ice_cream_sales = 50 + 2 * temperature + np.random.normal(0, 20, n)
drowning_deaths = 5 + 0.3 * temperature + np.random.normal(0, 5, n)
confounded_df = pd.DataFrame({
'ice_cream': ice_cream_sales,
'drowning': drowning_deaths,
'temperature': temperature
})
# Naive correlation (spurious)
naive = pg.corr(confounded_df['ice_cream'], confounded_df['drowning'])
print("Naive correlation:")
print(f"r = {naive['r'].values[0]:.3f}, p = {naive['p-val'].values[0]:.4f}")
# Partial correlation controlling for temperature
partial = pg.partial_corr(
data=confounded_df,
x='ice_cream',
y='drowning',
covar='temperature'
)
print("\nPartial correlation (controlling for temperature):")
print(f"r = {partial['r'].values[0]:.3f}, p = {partial['p-val'].values[0]:.4f}")
The naive correlation shows a significant relationship. The partial correlation, controlling for temperature, reveals the true picture: ice cream and drowning aren’t directly related.
For multiple covariates, pass a list:
partial_multi = pg.partial_corr(
data=df,
x='income',
y='health_score',
covar=['age', 'education_years']
)
Handling Assumptions and Robustness
Outliers destroy Pearson correlations. A single extreme point can flip the sign of your coefficient. Pingouin offers robust alternatives.
# Create data with outliers
x_clean = np.random.normal(50, 10, 97)
y_clean = 0.8 * x_clean + np.random.normal(0, 8, 97)
# Add outliers
x_outliers = np.concatenate([x_clean, [100, 105, 110]])
y_outliers = np.concatenate([y_clean, [10, 5, 15]])
# Compare methods
standard = pg.corr(x_outliers, y_outliers, method='pearson')
spearman_robust = pg.corr(x_outliers, y_outliers, method='spearman')
skipped = pg.corr(x_outliers, y_outliers, method='skipped')
shepherd = pg.corr(x_outliers, y_outliers, method='shepherd')
print(f"Pearson: r = {standard['r'].values[0]:.3f}")
print(f"Spearman: r = {spearman_robust['r'].values[0]:.3f}")
print(f"Skipped: r = {skipped['r'].values[0]:.3f}")
print(f"Shepherd: r = {shepherd['r'].values[0]:.3f}")
Skipped correlation uses a multivariate outlier detection method and excludes outliers before computing Spearman correlation. Shepherd’s Pi combines Spearman with a bootstrapped Mahalanobis distance outlier removal.
When to use each:
- Pearson: Clean data, linear relationships, normally distributed
- Spearman: Non-linear monotonic, ordinal data, moderate outliers
- Skipped/Shepherd: Suspected outliers, need robust estimates
Practical Example: End-to-End Analysis
Let’s analyze a realistic health metrics dataset from start to finish.
# Generate realistic health data
np.random.seed(123)
n_subjects = 250
age = np.random.normal(50, 15, n_subjects).clip(20, 80)
bmi = 22 + 0.1 * age + np.random.normal(0, 4, n_subjects)
exercise_hours = np.maximum(0, 10 - 0.08 * age + np.random.normal(0, 3, n_subjects))
blood_pressure = 90 + 0.5 * age + 1.5 * bmi - 0.8 * exercise_hours + np.random.normal(0, 10, n_subjects)
cholesterol = 150 + 0.8 * age + 2 * bmi - exercise_hours + np.random.normal(0, 25, n_subjects)
health_df = pd.DataFrame({
'age': age,
'bmi': bmi,
'exercise_hours': exercise_hours,
'blood_pressure': blood_pressure,
'cholesterol': cholesterol
})
# Step 1: Pairwise correlations
print("=" * 50)
print("PAIRWISE CORRELATIONS")
print("=" * 50)
pairwise_results = pg.pairwise_corr(health_df, method='spearman')
significant_pairs = pairwise_results[pairwise_results['p-unc'] < 0.01]
print(significant_pairs[['X', 'Y', 'r', 'p-unc']].to_string(index=False))
# Step 2: Partial correlation - BP and cholesterol controlling for age/BMI
print("\n" + "=" * 50)
print("PARTIAL CORRELATION: BP vs Cholesterol")
print("(controlling for age and BMI)")
print("=" * 50)
partial_result = pg.partial_corr(
data=health_df,
x='blood_pressure',
y='cholesterol',
covar=['age', 'bmi'],
method='spearman'
)
print(partial_result[['r', 'CI95%', 'p-val']].to_string())
# Step 3: Visualize correlation matrix
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# Heatmap of correlations
corr_matrix = health_df.corr(method='spearman')
sns.heatmap(corr_matrix, annot=True, cmap='RdBu_r', center=0,
vmin=-1, vmax=1, ax=axes[0], fmt='.2f')
axes[0].set_title('Spearman Correlation Matrix')
# Scatter plot of key relationship
axes[1].scatter(health_df['exercise_hours'], health_df['blood_pressure'],
alpha=0.5, edgecolors='black', linewidth=0.5)
axes[1].set_xlabel('Exercise Hours per Week')
axes[1].set_ylabel('Blood Pressure (mmHg)')
# Add correlation annotation
corr_result = pg.corr(health_df['exercise_hours'], health_df['blood_pressure'])
axes[1].set_title(f"Exercise vs Blood Pressure\nr={corr_result['r'].values[0]:.2f}, "
f"p={corr_result['p-val'].values[0]:.2e}")
plt.tight_layout()
plt.savefig('correlation_analysis.png', dpi=150)
plt.show()
This workflow demonstrates the practical power of Pingouin: compute pairwise correlations, identify significant relationships, control for confounders with partial correlations, and visualize results—all with clean, readable code.
The key takeaway: Pingouin transforms correlation analysis from a tedious multi-step process into a streamlined workflow. Use pg.corr() for single pairs, pg.pairwise_corr() for matrices, and pg.partial_corr() when confounders lurk in your data. Your statistical reports will be more complete, and your code will be more maintainable.