How to Use scipy.stats.pearsonr in Python
The Pearson correlation coefficient measures the linear relationship between two continuous variables. It produces a value between -1 and 1, where -1 indicates a perfect negative linear relationship,...
Key Insights
scipy.stats.pearsonrreturns both the correlation coefficient and a p-value, making it the go-to choice when you need statistical significance testing rather than just the correlation value.- The function requires clean, numeric data with no NaN values and at least two data points—failing to handle these edge cases is the most common source of errors.
- While NumPy and pandas offer faster correlation calculations,
pearsonrremains essential for hypothesis testing workflows where you need to determine if a correlation is statistically meaningful.
Introduction to Pearson Correlation
The Pearson correlation coefficient measures the linear relationship between two continuous variables. It produces a value between -1 and 1, where -1 indicates a perfect negative linear relationship, 0 indicates no linear relationship, and 1 indicates a perfect positive linear relationship.
Use Pearson correlation when:
- Both variables are continuous and approximately normally distributed
- You’re looking for linear relationships (not curved or complex patterns)
- Your data doesn’t contain significant outliers
Don’t use it when:
- Your data is ordinal or ranked (use Spearman’s correlation instead)
- The relationship is non-linear (consider Spearman or other methods)
- You have heavy outliers that could skew results
The scipy.stats.pearsonr function is the standard tool for computing this coefficient in Python because it provides both the correlation value and a p-value for significance testing.
Function Syntax and Parameters
The function signature is straightforward:
from scipy.stats import pearsonr
correlation, p_value = pearsonr(x, y)
Parameters:
x: Array-like, first set of observationsy: Array-like, second set of observations (must be same length as x)
Returns:
statistic: The Pearson correlation coefficient (float between -1 and 1)pvalue: Two-tailed p-value for testing non-correlation
Here’s the most basic usage:
from scipy.stats import pearsonr
# Two simple arrays representing observations
x = [1, 2, 3, 4, 5]
y = [2, 4, 5, 4, 5]
# Calculate Pearson correlation
result = pearsonr(x, y)
print(f"Correlation coefficient: {result.statistic}")
print(f"P-value: {result.pvalue}")
Output:
Correlation coefficient: 0.8320502943378437
P-value: 0.08050112948805689
Note that pearsonr returns a PearsonRResult object in recent SciPy versions, which you can unpack directly or access via attributes. Both approaches work:
# Unpacking directly
corr, p_val = pearsonr(x, y)
# Or using attributes
result = pearsonr(x, y)
corr = result.statistic
p_val = result.pvalue
Interpreting the Results
Understanding what the numbers mean is more important than calculating them. Here’s a practical interpretation guide:
Correlation Coefficient Interpretation:
- 0.7 to 1.0 (or -0.7 to -1.0): Strong relationship
- 0.4 to 0.7 (or -0.4 to -0.7): Moderate relationship
- 0.2 to 0.4 (or -0.2 to -0.4): Weak relationship
- 0.0 to 0.2 (or 0.0 to -0.2): Very weak or no relationship
P-value Interpretation:
- p < 0.05: The correlation is statistically significant (commonly used threshold)
- p >= 0.05: Cannot reject the null hypothesis that there’s no correlation
Here’s a function that provides human-readable interpretation:
from scipy.stats import pearsonr
def interpret_correlation(x, y, alpha=0.05):
"""Calculate and interpret Pearson correlation."""
corr, p_value = pearsonr(x, y)
# Determine strength
abs_corr = abs(corr)
if abs_corr >= 0.7:
strength = "strong"
elif abs_corr >= 0.4:
strength = "moderate"
elif abs_corr >= 0.2:
strength = "weak"
else:
strength = "very weak or no"
# Determine direction
direction = "positive" if corr > 0 else "negative"
# Determine significance
significant = "statistically significant" if p_value < alpha else "not statistically significant"
print(f"Correlation coefficient: {corr:.4f}")
print(f"P-value: {p_value:.4f}")
print(f"\nInterpretation:")
print(f"There is a {strength} {direction} linear relationship (r = {corr:.4f}).")
print(f"This correlation is {significant} at α = {alpha}.")
return corr, p_value
# Example usage
hours_studied = [1, 2, 3, 4, 5, 6, 7, 8]
exam_scores = [52, 58, 65, 70, 74, 80, 85, 90]
interpret_correlation(hours_studied, exam_scores)
Output:
Correlation coefficient: 0.9894
P-value: 0.0000
Interpretation:
There is a strong positive linear relationship (r = 0.9894).
This correlation is statistically significant at α = 0.05.
Practical Example with Real Data
Let’s work through a realistic example analyzing the relationship between advertising spend and sales revenue:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.stats import pearsonr
# Create sample business data
np.random.seed(42)
n_months = 50
# Advertising spend (in thousands)
ad_spend = np.random.uniform(10, 100, n_months)
# Sales revenue with some noise (correlated with ad spend)
sales = 2.5 * ad_spend + np.random.normal(0, 20, n_months) + 50
# Create DataFrame
df = pd.DataFrame({
'ad_spend_k': ad_spend,
'sales_k': sales
})
# Calculate correlation
corr, p_value = pearsonr(df['ad_spend_k'], df['sales_k'])
# Create visualization
fig, ax = plt.subplots(figsize=(10, 6))
ax.scatter(df['ad_spend_k'], df['sales_k'], alpha=0.6, edgecolors='black', linewidth=0.5)
# Add trend line
z = np.polyfit(df['ad_spend_k'], df['sales_k'], 1)
p = np.poly1d(z)
x_line = np.linspace(df['ad_spend_k'].min(), df['ad_spend_k'].max(), 100)
ax.plot(x_line, p(x_line), "r--", alpha=0.8, label='Trend line')
# Add correlation annotation
textstr = f'r = {corr:.3f}\np = {p_value:.4f}'
props = dict(boxstyle='round', facecolor='wheat', alpha=0.5)
ax.text(0.05, 0.95, textstr, transform=ax.transAxes, fontsize=12,
verticalalignment='top', bbox=props)
ax.set_xlabel('Advertising Spend ($K)', fontsize=12)
ax.set_ylabel('Sales Revenue ($K)', fontsize=12)
ax.set_title('Advertising Spend vs. Sales Revenue', fontsize=14)
ax.legend()
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('correlation_plot.png', dpi=150)
plt.show()
print(f"\nStatistical Summary:")
print(f"Correlation: {corr:.4f}")
print(f"P-value: {p_value:.2e}")
print(f"Sample size: {len(df)}")
This example demonstrates the typical workflow: load data, calculate correlation, and visualize results with proper annotation.
Handling Edge Cases and Common Errors
Real-world data is messy. Here are the common issues you’ll encounter and how to handle them:
import numpy as np
import warnings
from scipy.stats import pearsonr, ConstantInputWarning
# Issue 1: Constant input arrays
print("Issue 1: Constant input")
x_constant = [5, 5, 5, 5, 5]
y_normal = [1, 2, 3, 4, 5]
with warnings.catch_warnings(record=True) as w:
warnings.simplefilter("always")
result = pearsonr(x_constant, y_normal)
if w:
print(f"Warning: {w[0].category.__name__}")
print(f"Result: r={result.statistic}, p={result.pvalue}")
# Returns NaN because correlation is undefined for constant data
print("\n" + "="*50 + "\n")
# Issue 2: NaN values in data
print("Issue 2: NaN values")
x_with_nan = [1, 2, np.nan, 4, 5]
y_with_nan = [2, 4, 6, np.nan, 10]
try:
pearsonr(x_with_nan, y_with_nan)
except ValueError as e:
print(f"Error: {e}")
# Solution: Clean the data first
def clean_for_correlation(x, y):
"""Remove pairs where either value is NaN."""
x = np.array(x)
y = np.array(y)
mask = ~(np.isnan(x) | np.isnan(y))
return x[mask], y[mask]
x_clean, y_clean = clean_for_correlation(x_with_nan, y_with_nan)
print(f"Cleaned data: x={x_clean}, y={y_clean}")
corr, p = pearsonr(x_clean, y_clean)
print(f"Correlation after cleaning: {corr:.4f}")
print("\n" + "="*50 + "\n")
# Issue 3: Insufficient data points
print("Issue 3: Too few data points")
x_short = [1, 2]
y_short = [3, 4]
# This works but p-value may not be meaningful
corr, p = pearsonr(x_short, y_short)
print(f"With 2 points: r={corr:.4f}, p={p:.4f}")
print("Warning: With only 2 points, correlation will always be ±1")
print("\n" + "="*50 + "\n")
# Issue 4: Mismatched array lengths
print("Issue 4: Different array lengths")
x_long = [1, 2, 3, 4, 5]
y_short = [1, 2, 3]
try:
pearsonr(x_long, y_short)
except ValueError as e:
print(f"Error: {e}")
Best practice: Always validate your data before calling pearsonr:
def safe_pearsonr(x, y, min_samples=3):
"""Safely calculate Pearson correlation with validation."""
x = np.array(x, dtype=float)
y = np.array(y, dtype=float)
# Remove NaN pairs
mask = ~(np.isnan(x) | np.isnan(y))
x, y = x[mask], y[mask]
if len(x) < min_samples:
raise ValueError(f"Need at least {min_samples} valid data points")
if np.std(x) == 0 or np.std(y) == 0:
raise ValueError("Cannot calculate correlation for constant data")
return pearsonr(x, y)
Comparing with Alternative Methods
When should you use pearsonr versus other options? Here’s a direct comparison:
import numpy as np
import pandas as pd
from scipy.stats import pearsonr
# Sample data
x = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
y = np.array([2.1, 4.2, 5.8, 8.1, 9.9, 12.1, 14.0, 15.8, 18.2, 20.0])
# Method 1: scipy.stats.pearsonr
corr_scipy, p_value = pearsonr(x, y)
print(f"scipy.stats.pearsonr:")
print(f" Correlation: {corr_scipy:.6f}")
print(f" P-value: {p_value:.6f}")
# Method 2: numpy.corrcoef
corr_matrix = np.corrcoef(x, y)
corr_numpy = corr_matrix[0, 1]
print(f"\nnumpy.corrcoef:")
print(f" Correlation: {corr_numpy:.6f}")
print(f" P-value: Not available")
# Method 3: pandas.DataFrame.corr()
df = pd.DataFrame({'x': x, 'y': y})
corr_pandas = df['x'].corr(df['y'])
print(f"\npandas.Series.corr():")
print(f" Correlation: {corr_pandas:.6f}")
print(f" P-value: Not available")
# Performance comparison for large datasets
print("\n" + "="*50)
print("When to use each method:")
print("="*50)
print("""
scipy.stats.pearsonr:
✓ Need p-value for hypothesis testing
✓ Statistical analysis workflows
✗ Slower for correlation matrices
numpy.corrcoef:
✓ Fast for computing correlation matrices
✓ Multiple variables at once
✗ No p-value
pandas.DataFrame.corr():
✓ Working with DataFrames
✓ Handles NaN automatically (pairwise deletion)
✓ Easy correlation matrix for many columns
✗ No p-value
""")
The bottom line: Use pearsonr when you need the p-value. Use NumPy or pandas when you just need the correlation coefficient and want better performance or convenience.
Conclusion
scipy.stats.pearsonr is the right tool when you need both the correlation coefficient and statistical significance testing. Remember these key practices:
- Always clean your data first—remove NaN values and ensure arrays have equal length
- Check for constant inputs—correlation is undefined when one variable doesn’t vary
- Interpret both values—the correlation coefficient tells you strength and direction, the p-value tells you if it’s statistically meaningful
- Use the right tool—if you don’t need the p-value, NumPy or pandas may be simpler
- Visualize your data—a scatter plot reveals patterns that correlation alone might miss
Correlation is a starting point, not an endpoint. A statistically significant correlation doesn’t imply causation, and it only captures linear relationships. Always combine statistical measures with visual inspection and domain knowledge.