NumPy - Correlation Coefficient (np.corrcoef)
The Pearson correlation coefficient measures linear relationships between variables. NumPy's `np.corrcoef()` calculates these coefficients efficiently, producing a correlation matrix that reveals how...
Key Insights
- np.corrcoef() computes Pearson correlation coefficients between variables, returning a symmetric matrix where diagonal values equal 1.0 and off-diagonal values show correlations ranging from -1 to 1
- The function accepts both 1D and 2D arrays, treating each row as a variable by default, with rowvar parameter controlling interpretation
- Understanding correlation matrices is critical for feature selection, multicollinearity detection, and exploratory data analysis in machine learning pipelines
Understanding Correlation Coefficients
The Pearson correlation coefficient measures linear relationships between variables. NumPy’s np.corrcoef() calculates these coefficients efficiently, producing a correlation matrix that reveals how variables move together. A coefficient of 1.0 indicates perfect positive correlation, -1.0 indicates perfect negative correlation, and 0 indicates no linear relationship.
import numpy as np
# Two variables with positive correlation
x = np.array([1, 2, 3, 4, 5])
y = np.array([2, 4, 6, 8, 10])
correlation_matrix = np.corrcoef(x, y)
print(correlation_matrix)
Output:
[[1. 1.]
[1. 1.]]
The result is a 2×2 matrix where correlation_matrix[0, 1] and correlation_matrix[1, 0] represent the correlation between x and y. Both diagonal elements equal 1.0 because any variable perfectly correlates with itself.
Working with Multiple Variables
When analyzing datasets with multiple features, np.corrcoef() accepts 2D arrays where each row represents a variable. This becomes particularly useful when examining relationships across numerous dimensions simultaneously.
# Three variables
data = np.array([
[1, 2, 3, 4, 5], # Variable 1
[2, 4, 6, 8, 10], # Variable 2 (positively correlated with Var 1)
[5, 4, 3, 2, 1] # Variable 3 (negatively correlated with Var 1)
])
corr_matrix = np.corrcoef(data)
print(corr_matrix)
print(f"\nCorrelation between Var1 and Var2: {corr_matrix[0, 1]:.4f}")
print(f"Correlation between Var1 and Var3: {corr_matrix[0, 2]:.4f}")
print(f"Correlation between Var2 and Var3: {corr_matrix[1, 2]:.4f}")
Output:
[[ 1. 1. -1.]
[ 1. 1. -1.]
[-1. -1. 1.]]
Correlation between Var1 and Var2: 1.0000
Correlation between Var1 and Var3: -1.0000
Correlation between Var2 and Var3: -1.0000
The rowvar Parameter
By default, np.corrcoef() treats rows as variables. When working with datasets where columns represent features (the standard format in machine learning), set rowvar=False to transpose the interpretation.
# Dataset: rows are observations, columns are features
observations = np.array([
[1, 2, 5],
[2, 4, 4],
[3, 6, 3],
[4, 8, 2],
[5, 10, 1]
])
# Incorrect: treats observations as variables
wrong_corr = np.corrcoef(observations)
print("Wrong shape:", wrong_corr.shape) # (5, 5) - correlating observations
# Correct: treats features as variables
correct_corr = np.corrcoef(observations, rowvar=False)
print("Correct shape:", correct_corr.shape) # (3, 3) - correlating features
print("\nFeature correlation matrix:")
print(correct_corr)
Detecting Multicollinearity in Features
Multicollinearity occurs when features are highly correlated, causing issues in regression models. Identifying correlation coefficients above 0.8 or 0.9 helps detect problematic feature pairs.
# Simulate feature set with multicollinearity
np.random.seed(42)
n_samples = 100
feature1 = np.random.randn(n_samples)
feature2 = feature1 + np.random.randn(n_samples) * 0.1 # Highly correlated
feature3 = np.random.randn(n_samples) # Independent
feature4 = feature1 * 2 + 3 # Perfectly correlated
features = np.column_stack([feature1, feature2, feature3, feature4])
corr = np.corrcoef(features, rowvar=False)
# Find highly correlated pairs (excluding diagonal)
threshold = 0.8
high_corr_pairs = []
for i in range(len(corr)):
for j in range(i + 1, len(corr)):
if abs(corr[i, j]) > threshold:
high_corr_pairs.append((i, j, corr[i, j]))
print("Highly correlated feature pairs (|r| > 0.8):")
for i, j, coef in high_corr_pairs:
print(f"Feature {i} and Feature {j}: {coef:.4f}")
Output:
Highly correlated feature pairs (|r| > 0.8):
Feature 0 and Feature 1: 0.9953
Feature 0 and Feature 3: 1.0000
Feature 1 and Feature 3: 0.9953
Handling Missing Data
NumPy’s np.corrcoef() doesn’t handle NaN values by default. Use masking or preprocessing to manage missing data before computing correlations.
# Data with missing values
data_with_nan = np.array([
[1.0, 2.0, np.nan],
[2.0, 4.0, 4.0],
[3.0, 6.0, 3.0],
[4.0, np.nan, 2.0],
[5.0, 10.0, 1.0]
])
# Approach 1: Remove rows with any NaN
clean_data = data_with_nan[~np.isnan(data_with_nan).any(axis=1)]
corr_clean = np.corrcoef(clean_data, rowvar=False)
print("Correlation after removing NaN rows:")
print(corr_clean)
# Approach 2: Column-wise removal for pairwise correlations
def pairwise_correlation(data):
n_features = data.shape[1]
corr_matrix = np.zeros((n_features, n_features))
for i in range(n_features):
for j in range(n_features):
# Get valid pairs for this combination
valid_mask = ~(np.isnan(data[:, i]) | np.isnan(data[:, j]))
if np.sum(valid_mask) > 1:
corr_matrix[i, j] = np.corrcoef(
data[valid_mask, i],
data[valid_mask, j]
)[0, 1]
else:
corr_matrix[i, j] = np.nan
return corr_matrix
pairwise_corr = pairwise_correlation(data_with_nan)
print("\nPairwise correlation (uses all available data):")
print(pairwise_corr)
Correlation vs Covariance
While related, correlation and covariance serve different purposes. Covariance measures how variables change together but depends on variable scales. Correlation normalizes this to a -1 to 1 range, making it scale-independent.
# Compare correlation and covariance
x = np.array([1, 2, 3, 4, 5])
y = np.array([2, 4, 6, 8, 10])
# Scaled version of x
x_scaled = x * 100
correlation_xy = np.corrcoef(x, y)[0, 1]
correlation_scaled = np.corrcoef(x_scaled, y)[0, 1]
covariance_xy = np.cov(x, y)[0, 1]
covariance_scaled = np.cov(x_scaled, y)[0, 1]
print(f"Correlation (original): {correlation_xy:.4f}")
print(f"Correlation (scaled): {correlation_scaled:.4f}")
print(f"Covariance (original): {covariance_xy:.4f}")
print(f"Covariance (scaled): {covariance_scaled:.4f}")
Output:
Correlation (original): 1.0000
Correlation (scaled): 1.0000
Covariance (original): 2.5000
Covariance (scaled): 250.0000
Correlation remains unchanged under scaling, making it preferable for comparing relationships between variables with different units or magnitudes.
Visualizing Correlation Matrices
Understanding large correlation matrices requires visualization. Heatmaps provide intuitive representations of correlation structures.
import matplotlib.pyplot as plt
# Generate synthetic dataset
np.random.seed(123)
n_samples = 200
n_features = 6
# Create features with varying correlations
data = np.random.randn(n_samples, n_features)
data[:, 1] = data[:, 0] + np.random.randn(n_samples) * 0.3 # Correlated with feature 0
data[:, 4] = -data[:, 0] + np.random.randn(n_samples) * 0.5 # Negatively correlated
corr = np.corrcoef(data, rowvar=False)
# Create heatmap
fig, ax = plt.subplots(figsize=(8, 6))
im = ax.imshow(corr, cmap='coolwarm', vmin=-1, vmax=1)
# Add colorbar
cbar = plt.colorbar(im, ax=ax)
cbar.set_label('Correlation Coefficient', rotation=270, labelpad=20)
# Set ticks and labels
ax.set_xticks(np.arange(n_features))
ax.set_yticks(np.arange(n_features))
ax.set_xticklabels([f'F{i}' for i in range(n_features)])
ax.set_yticklabels([f'F{i}' for i in range(n_features)])
# Annotate cells with correlation values
for i in range(n_features):
for j in range(n_features):
text = ax.text(j, i, f'{corr[i, j]:.2f}',
ha="center", va="center", color="black", fontsize=10)
plt.title('Feature Correlation Matrix')
plt.tight_layout()
plt.savefig('correlation_heatmap.png', dpi=150)
Performance Considerations
For large datasets, computing full correlation matrices becomes computationally expensive. Consider computing only necessary correlations or using optimized libraries.
import time
# Compare performance with dataset size
sizes = [100, 1000, 5000]
for n in sizes:
data = np.random.randn(n, 50)
start = time.time()
corr = np.corrcoef(data, rowvar=False)
elapsed = time.time() - start
print(f"n={n}: {elapsed:.4f} seconds")
For extremely large datasets, consider computing correlations for feature subsets or using parallel processing libraries that leverage NumPy’s underlying BLAS operations.