How to Calculate Correlation with NumPy
Correlation measures the strength and direction of a linear relationship between two variables. It's one of the most fundamental tools in data analysis, and you'll reach for it constantly: during...
Key Insights
- NumPy’s
np.corrcoef()returns a correlation matrix, not a single value—even for two arrays, you’ll get a 2x2 matrix where the off-diagonal elements contain the correlation coefficient you want. - Correlation matrices are symmetric, so you only need to examine the upper or lower triangle when analyzing relationships between multiple variables.
- NumPy calculates Pearson correlation by default; for Spearman (rank-based) correlation or handling missing values gracefully, you’ll need to reach for SciPy instead.
Introduction
Correlation measures the strength and direction of a linear relationship between two variables. It’s one of the most fundamental tools in data analysis, and you’ll reach for it constantly: during exploratory data analysis to understand your dataset, in feature selection to identify redundant variables, and when building intuition about which variables might predict your target.
NumPy provides fast, vectorized correlation calculations that work seamlessly with the rest of the scientific Python stack. Whether you’re analyzing two variables or computing pairwise correlations across hundreds of features, np.corrcoef() handles the heavy lifting efficiently.
This article covers practical correlation calculations with NumPy—from basic two-variable analysis to full correlation matrices, including the gotchas that trip up newcomers.
Understanding Correlation Coefficients
Before diving into code, let’s establish what correlation actually tells us.
Pearson correlation measures linear relationships. It assumes both variables are continuous and roughly normally distributed. The coefficient ranges from -1 to +1:
- +1: Perfect positive linear relationship (as X increases, Y increases proportionally)
- 0: No linear relationship
- -1: Perfect negative linear relationship (as X increases, Y decreases proportionally)
In practice, here’s how I interpret correlation strength:
| Absolute Value | Interpretation |
|---|---|
| 0.0 - 0.3 | Weak |
| 0.3 - 0.7 | Moderate |
| 0.7 - 1.0 | Strong |
Spearman correlation measures monotonic relationships using rank ordering rather than raw values. It’s more robust to outliers and works with ordinal data. NumPy doesn’t provide Spearman directly—you’ll need SciPy for that.
For most numerical analysis, Pearson correlation (what NumPy calculates) is your starting point.
Basic Correlation with np.corrcoef()
The core function is np.corrcoef(). Let’s start with a straightforward example: measuring the relationship between daily temperature and ice cream sales.
import numpy as np
# Daily temperature (Fahrenheit) and ice cream sales (units)
temperature = np.array([65, 70, 75, 80, 85, 90, 72, 78, 82, 88])
ice_cream_sales = np.array([120, 145, 180, 210, 250, 290, 155, 195, 225, 275])
# Calculate correlation
correlation_matrix = np.corrcoef(temperature, ice_cream_sales)
print(correlation_matrix)
Output:
[[1. 0.99634463]
[0.99634463 1. ]]
Here’s what trips up newcomers: np.corrcoef() always returns a matrix, even for two variables. The result is a 2x2 matrix where:
[0, 0]and[1, 1]are the correlations of each variable with itself (always 1.0)[0, 1]and[1, 0]contain the correlation between the two variables (identical values)
To extract just the correlation coefficient:
correlation = np.corrcoef(temperature, ice_cream_sales)[0, 1]
print(f"Correlation: {correlation:.4f}")
# Output: Correlation: 0.9963
A correlation of 0.996 indicates an extremely strong positive relationship—as temperature rises, ice cream sales rise almost perfectly in proportion. This makes intuitive sense.
Important: By default, np.corrcoef() treats each input array as a separate variable. If you pass a single 2D array, it treats each row as a variable (not each column). This catches people off guard:
# Each ROW is treated as a variable
data = np.array([
[65, 70, 75, 80, 85], # Variable 1 (temperature)
[120, 145, 180, 210, 250] # Variable 2 (sales)
])
correlation_matrix = np.corrcoef(data)
print(correlation_matrix[0, 1])
# Output: 0.9993...
If your data has variables in columns (the common pandas convention), transpose it first or use the rowvar=False parameter.
Correlation Matrices for Multiple Variables
Real-world analysis typically involves many variables. Computing pairwise correlations reveals which features move together—essential for understanding multicollinearity in regression or identifying redundant features.
import numpy as np
# Simulated dataset: 100 observations, 5 variables
np.random.seed(42)
n_samples = 100
# Create variables with known relationships
var1 = np.random.randn(n_samples)
var2 = var1 * 0.8 + np.random.randn(n_samples) * 0.5 # Correlated with var1
var3 = np.random.randn(n_samples) # Independent
var4 = -var1 * 0.6 + np.random.randn(n_samples) * 0.7 # Negatively correlated with var1
var5 = var2 * 0.9 + np.random.randn(n_samples) * 0.3 # Correlated with var2
# Stack into 2D array (variables as rows)
data = np.vstack([var1, var2, var3, var4, var5])
# Compute correlation matrix
corr_matrix = np.corrcoef(data)
# Pretty print with labels
variables = ['var1', 'var2', 'var3', 'var4', 'var5']
print("Correlation Matrix:")
print("-" * 50)
print(f"{'':>8}", end="")
for v in variables:
print(f"{v:>8}", end="")
print()
for i, row in enumerate(corr_matrix):
print(f"{variables[i]:>8}", end="")
for val in row:
print(f"{val:>8.3f}", end="")
print()
Output:
Correlation Matrix:
--------------------------------------------------
var1 var2 var3 var4 var5
var1 1.000 0.834 -0.039 -0.603 0.763
var2 0.834 1.000 -0.014 -0.513 0.922
var3 -0.039 -0.014 1.000 0.089 -0.032
var4 -0.603 -0.513 0.089 1.000 -0.481
var5 0.763 0.922 -0.032 -0.481 1.000
The matrix is symmetric—corr_matrix[i, j] equals corr_matrix[j, i]. The diagonal is always 1.0 (each variable correlates perfectly with itself).
From this output, we can immediately see:
- var1 and var2 are strongly positively correlated (0.834)
- var2 and var5 are very strongly correlated (0.922)—potential multicollinearity
- var3 is essentially independent of everything else
- var4 has a moderate negative correlation with var1 (-0.603)
Practical Considerations
Handling Missing Values
NumPy’s np.corrcoef() doesn’t handle np.nan gracefully—any NaN in your data produces NaN in the output:
temperature = np.array([65, 70, np.nan, 80, 85, 90])
sales = np.array([120, 145, 180, 210, 250, 290])
result = np.corrcoef(temperature, sales)[0, 1]
print(result) # Output: nan
You have two options. First, filter out NaN values before calculation:
temperature = np.array([65, 70, np.nan, 80, 85, 90])
sales = np.array([120, 145, 180, 210, 250, 290])
# Create mask for valid (non-NaN) values
mask = ~np.isnan(temperature) & ~np.isnan(sales)
# Filter both arrays using the mask
temp_clean = temperature[mask]
sales_clean = sales[mask]
correlation = np.corrcoef(temp_clean, sales_clean)[0, 1]
print(f"Correlation (excluding NaN): {correlation:.4f}")
# Output: Correlation (excluding NaN): 0.9970
Second, use SciPy’s pearsonr which provides more flexibility, or pandas which handles NaN automatically with its corr() method.
Common Pitfalls
Correlation does not imply causation. A strong correlation between ice cream sales and drowning deaths doesn’t mean ice cream causes drowning—both are driven by summer weather.
Non-linear relationships fool Pearson correlation. A perfect quadratic relationship (Y = X²) can show near-zero Pearson correlation:
x = np.linspace(-10, 10, 100)
y = x ** 2
print(f"Correlation: {np.corrcoef(x, y)[0, 1]:.4f}")
# Output: Correlation: 0.0000 (approximately)
The relationship is perfectly deterministic, but Pearson correlation misses it entirely because it’s not linear. Always visualize your data.
Visualizing Correlation Results
Correlation matrices become unwieldy with many variables. Heatmaps make patterns immediately visible:
import numpy as np
import matplotlib.pyplot as plt
# Generate sample data
np.random.seed(42)
n_samples = 100
var1 = np.random.randn(n_samples)
var2 = var1 * 0.8 + np.random.randn(n_samples) * 0.5
var3 = np.random.randn(n_samples)
var4 = -var1 * 0.6 + np.random.randn(n_samples) * 0.7
var5 = var2 * 0.9 + np.random.randn(n_samples) * 0.3
data = np.vstack([var1, var2, var3, var4, var5])
corr_matrix = np.corrcoef(data)
variables = ['var1', 'var2', 'var3', 'var4', 'var5']
# Create heatmap
fig, ax = plt.subplots(figsize=(8, 6))
im = ax.imshow(corr_matrix, cmap='RdBu_r', vmin=-1, vmax=1)
# Add colorbar
cbar = ax.figure.colorbar(im, ax=ax)
cbar.ax.set_ylabel('Correlation', rotation=-90, va='bottom')
# Configure ticks and labels
ax.set_xticks(np.arange(len(variables)))
ax.set_yticks(np.arange(len(variables)))
ax.set_xticklabels(variables)
ax.set_yticklabels(variables)
# Add correlation values as text
for i in range(len(variables)):
for j in range(len(variables)):
text = ax.text(j, i, f'{corr_matrix[i, j]:.2f}',
ha='center', va='center', color='black', fontsize=10)
ax.set_title('Correlation Matrix Heatmap')
plt.tight_layout()
plt.savefig('correlation_heatmap.png', dpi=150)
plt.show()
The RdBu_r colormap works well: red indicates positive correlations, blue indicates negative, and white represents no correlation. Setting vmin=-1 and vmax=1 ensures consistent color scaling across different datasets.
For production work, seaborn’s heatmap() function provides a more polished result with less code, but the matplotlib approach gives you full control.
Conclusion
NumPy’s np.corrcoef() is your go-to function for Pearson correlation calculations. Remember these key points:
- It always returns a matrix—extract
[0, 1]for two-variable correlation - Rows are treated as variables by default; use
rowvar=Falseif your variables are in columns - NaN values propagate; filter them out before calculation
For Spearman correlation (rank-based, robust to outliers) or Kendall’s tau, use scipy.stats.spearmanr() and scipy.stats.kendalltau(). These also return p-values, which np.corrcoef() doesn’t provide.
When working with pandas DataFrames, df.corr() is often more convenient—it handles NaN automatically and keeps your column labels. But for raw NumPy arrays and maximum performance, np.corrcoef() is the right tool.