How to Calculate Pearson Correlation in Python
Pearson correlation coefficient is the workhorse of statistical relationship analysis. It quantifies how strongly two continuous variables move together in a linear fashion. If you've ever needed to...
Key Insights
- Pearson correlation measures the linear relationship between two continuous variables, returning a value from -1 (perfect negative) to +1 (perfect positive), with 0 indicating no linear relationship.
- Python offers three main approaches: NumPy’s
np.corrcoef()for quick calculations, SciPy’spearsonr()for statistical significance testing, and Pandas’.corr()for DataFrame-wide analysis. - Always check for outliers and verify linearity before trusting correlation values—a single extreme point can dramatically skew your results.
Introduction to Pearson Correlation
Pearson correlation coefficient is the workhorse of statistical relationship analysis. It quantifies how strongly two continuous variables move together in a linear fashion. If you’ve ever needed to answer “does X increase when Y increases?” you’ve needed Pearson correlation.
The mathematical formula looks intimidating but captures a simple idea:
$$r = \frac{\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^{n}(x_i - \bar{x})^2 \times \sum_{i=1}^{n}(y_i - \bar{y})^2}}$$
In plain terms: we’re measuring how much X and Y deviate from their means together, normalized by how much they deviate individually.
The coefficient ranges from -1 to +1:
- +1: Perfect positive correlation (as X increases, Y increases proportionally)
- 0: No linear relationship
- -1: Perfect negative correlation (as X increases, Y decreases proportionally)
Use Pearson correlation when you have two continuous variables and suspect a linear relationship. It’s not appropriate for ordinal data, non-linear relationships, or when your data violates normality assumptions severely.
Prerequisites and Setup
You’ll need three libraries that form the foundation of scientific Python:
import numpy as np
from scipy import stats
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Set random seed for reproducibility
np.random.seed(42)
Let’s create a realistic dataset to work with throughout this article. We’ll simulate data representing the relationship between study hours and exam scores:
# Generate correlated sample data
n_samples = 100
# Study hours (independent variable)
study_hours = np.random.uniform(1, 10, n_samples)
# Exam scores with some noise (dependent variable)
# True relationship: score ≈ 50 + 5 * hours + noise
exam_scores = 50 + 5 * study_hours + np.random.normal(0, 5, n_samples)
# Clip scores to realistic range
exam_scores = np.clip(exam_scores, 0, 100)
# Create a DataFrame for later use
df = pd.DataFrame({
'study_hours': study_hours,
'exam_scores': exam_scores,
'sleep_hours': np.random.uniform(4, 9, n_samples),
'practice_tests': np.random.poisson(3, n_samples)
})
print(df.head())
This gives us a dataset with a known positive correlation between study hours and exam scores, plus some additional variables for multi-column analysis.
Method 1: Using NumPy
NumPy’s np.corrcoef() is the fastest way to calculate Pearson correlation. It returns a correlation matrix, which is useful when comparing multiple variables but requires an extra step for pairwise analysis.
# Calculate correlation between two arrays
correlation_matrix = np.corrcoef(study_hours, exam_scores)
print("Correlation Matrix:")
print(correlation_matrix)
# Extract the correlation coefficient
r_value = correlation_matrix[0, 1]
print(f"\nPearson r: {r_value:.4f}")
Output:
Correlation Matrix:
[[1. 0.91842356]
[0.91842356 1. ]]
Pearson r: 0.9184
The matrix shows correlations between all pairs. The diagonal is always 1 (a variable correlates perfectly with itself), and the off-diagonal elements give you the correlation between different variables.
For multiple variables at once:
# Correlation matrix for multiple arrays
variables = np.array([study_hours, exam_scores, df['sleep_hours'].values])
multi_corr = np.corrcoef(variables)
print("Multi-variable correlation matrix:")
print(np.round(multi_corr, 3))
When to use NumPy: You want speed and simplicity, don’t need p-values, and are working with NumPy arrays directly.
Method 2: Using SciPy
SciPy’s pearsonr() function is my go-to for serious statistical analysis. Unlike NumPy, it returns both the correlation coefficient and a p-value for hypothesis testing.
from scipy.stats import pearsonr
# Calculate correlation with p-value
r_value, p_value = pearsonr(study_hours, exam_scores)
print(f"Pearson r: {r_value:.4f}")
print(f"P-value: {p_value:.2e}")
# Interpret the results
alpha = 0.05
if p_value < alpha:
print(f"Correlation is statistically significant (p < {alpha})")
else:
print(f"Correlation is NOT statistically significant (p >= {alpha})")
Output:
Pearson r: 0.9184
P-value: 1.23e-42
Correlation is statistically significant (p < 0.05)
The p-value tests the null hypothesis that the true correlation is zero. A small p-value (typically < 0.05) suggests the observed correlation is unlikely to occur by chance.
You can also get confidence intervals for your correlation estimate:
from scipy.stats import pearsonr
# SciPy 1.9+ supports confidence intervals
result = pearsonr(study_hours, exam_scores)
ci = result.confidence_interval(confidence_level=0.95)
print(f"Pearson r: {result.statistic:.4f}")
print(f"95% CI: [{ci.low:.4f}, {ci.high:.4f}]")
When to use SciPy: You need statistical significance testing, confidence intervals, or are publishing results that require p-values.
Method 3: Using Pandas
Pandas shines when you’re working with DataFrames and want to analyze correlations across multiple columns simultaneously.
# Correlation between two columns
r = df['study_hours'].corr(df['exam_scores'])
print(f"Correlation between study_hours and exam_scores: {r:.4f}")
# Full correlation matrix for the DataFrame
correlation_matrix = df.corr()
print("\nFull correlation matrix:")
print(correlation_matrix.round(3))
Output:
Correlation between study_hours and exam_scores: 0.9184
Full correlation matrix:
study_hours exam_scores sleep_hours practice_tests
study_hours 1.000 0.918 -0.023 0.087
exam_scores 0.918 1.000 -0.041 0.102
sleep_hours -0.023 -0.041 1.000 -0.156
practice_tests 0.087 0.102 -0.156 1.000
Pandas also lets you specify the correlation method, though Pearson is the default:
# Explicitly specify Pearson (default)
pearson_corr = df.corr(method='pearson')
# For comparison, Spearman (rank-based, handles non-linear monotonic relationships)
spearman_corr = df.corr(method='spearman')
When to use Pandas: You’re doing exploratory data analysis on DataFrames, need quick correlation matrices, or want to chain with other Pandas operations.
Visualizing Correlation
Numbers tell part of the story, but visualization reveals patterns that statistics miss.
Scatter Plot with Trend Line
fig, ax = plt.subplots(figsize=(10, 6))
# Scatter plot
ax.scatter(study_hours, exam_scores, alpha=0.6, edgecolors='black', linewidth=0.5)
# Add trend line
z = np.polyfit(study_hours, exam_scores, 1)
p = np.poly1d(z)
x_line = np.linspace(study_hours.min(), study_hours.max(), 100)
ax.plot(x_line, p(x_line), "r--", linewidth=2, label=f'Trend line (r={r_value:.3f})')
ax.set_xlabel('Study Hours', fontsize=12)
ax.set_ylabel('Exam Score', fontsize=12)
ax.set_title('Study Hours vs Exam Scores', fontsize=14)
ax.legend()
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('scatter_correlation.png', dpi=150)
plt.show()
Correlation Heatmap
For multi-variable analysis, heatmaps make patterns immediately visible:
fig, ax = plt.subplots(figsize=(8, 6))
# Create heatmap
sns.heatmap(
df.corr(),
annot=True,
fmt='.2f',
cmap='coolwarm',
center=0,
square=True,
linewidths=0.5,
ax=ax,
vmin=-1,
vmax=1
)
ax.set_title('Correlation Heatmap', fontsize=14)
plt.tight_layout()
plt.savefig('correlation_heatmap.png', dpi=150)
plt.show()
The coolwarm colormap centers on white (zero correlation), with blue for negative and red for positive correlations.
Common Pitfalls and Best Practices
Outliers Distort Everything
A single outlier can dramatically change your correlation coefficient. Always visualize your data first.
# Original data
r_original, _ = pearsonr(study_hours, exam_scores)
# Add a single outlier
study_hours_outlier = np.append(study_hours, 2) # Low study hours
exam_scores_outlier = np.append(exam_scores, 100) # But perfect score
r_with_outlier, _ = pearsonr(study_hours_outlier, exam_scores_outlier)
print(f"Original correlation: {r_original:.4f}")
print(f"With one outlier: {r_with_outlier:.4f}")
print(f"Difference: {abs(r_original - r_with_outlier):.4f}")
# Visualize the impact
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
axes[0].scatter(study_hours, exam_scores, alpha=0.6)
axes[0].set_title(f'Original Data (r = {r_original:.3f})')
axes[0].set_xlabel('Study Hours')
axes[0].set_ylabel('Exam Score')
axes[1].scatter(study_hours_outlier, exam_scores_outlier, alpha=0.6)
axes[1].scatter([2], [100], color='red', s=100, zorder=5, label='Outlier')
axes[1].set_title(f'With Outlier (r = {r_with_outlier:.3f})')
axes[1].set_xlabel('Study Hours')
axes[1].set_ylabel('Exam Score')
axes[1].legend()
plt.tight_layout()
plt.show()
Check Linearity Assumptions
Pearson correlation measures linear relationships. A perfect quadratic relationship will show a correlation near zero:
# Non-linear relationship
x = np.linspace(-10, 10, 100)
y = x ** 2 # Perfect quadratic relationship
r_nonlinear, _ = pearsonr(x, y)
print(f"Correlation for y = x²: {r_nonlinear:.4f}") # Near zero!
Correlation Is Not Causation
This bears repeating: a high correlation between X and Y doesn’t mean X causes Y. Both could be caused by a third variable, or the relationship could be coincidental. Use domain knowledge and experimental design to establish causation.
Sample Size Matters
Small samples produce unreliable correlations. With n=10, you need r > 0.63 for significance at α=0.05. With n=100, r > 0.20 is sufficient. Always report sample sizes alongside correlation values.
# Check required sample size for detecting a given correlation
from scipy.stats import pearsonr
# Simulate detecting r=0.3 with different sample sizes
for n in [10, 30, 50, 100, 200]:
# Generate data with true r ≈ 0.3
x = np.random.normal(0, 1, n)
y = 0.3 * x + np.sqrt(1 - 0.3**2) * np.random.normal(0, 1, n)
r, p = pearsonr(x, y)
sig = "significant" if p < 0.05 else "not significant"
print(f"n={n:3d}: r={r:.3f}, p={p:.4f} ({sig})")
Conclusion
Python gives you multiple paths to Pearson correlation, each suited to different workflows. Use NumPy for speed, SciPy for statistical rigor, and Pandas for DataFrame-centric analysis. Whichever method you choose, always visualize your data, check for outliers, and verify that a linear model makes sense for your variables. Correlation is a powerful tool, but only when applied thoughtfully.