NumPy - Correlation Coefficient (np.corrcoef)

Key Insights

np.corrcoef() computes Pearson correlation coefficients between variables, returning a symmetric matrix where diagonal values equal 1.0 and off-diagonal values show correlations ranging from -1 to 1
The function accepts both 1D and 2D arrays, treating each row as a variable by default, with rowvar parameter controlling interpretation
Understanding correlation matrices is critical for feature selection, multicollinearity detection, and exploratory data analysis in machine learning pipelines

Understanding Correlation Coefficients

The Pearson correlation coefficient measures linear relationships between variables. NumPy’s np.corrcoef() calculates these coefficients efficiently, producing a correlation matrix that reveals how variables move together. A coefficient of 1.0 indicates perfect positive correlation, -1.0 indicates perfect negative correlation, and 0 indicates no linear relationship.

import numpy as np

# Two variables with positive correlation
x = np.array([1, 2, 3, 4, 5])
y = np.array([2, 4, 6, 8, 10])

correlation_matrix = np.corrcoef(x, y)
print(correlation_matrix)

Output:

[[1. 1.]
 [1. 1.]]

The result is a 2×2 matrix where correlation_matrix[0, 1] and correlation_matrix[1, 0] represent the correlation between x and y. Both diagonal elements equal 1.0 because any variable perfectly correlates with itself.

Working with Multiple Variables

When analyzing datasets with multiple features, np.corrcoef() accepts 2D arrays where each row represents a variable. This becomes particularly useful when examining relationships across numerous dimensions simultaneously.

# Three variables
data = np.array([
    [1, 2, 3, 4, 5],      # Variable 1
    [2, 4, 6, 8, 10],     # Variable 2 (positively correlated with Var 1)
    [5, 4, 3, 2, 1]       # Variable 3 (negatively correlated with Var 1)
])

corr_matrix = np.corrcoef(data)
print(corr_matrix)
print(f"\nCorrelation between Var1 and Var2: {corr_matrix[0, 1]:.4f}")
print(f"Correlation between Var1 and Var3: {corr_matrix[0, 2]:.4f}")
print(f"Correlation between Var2 and Var3: {corr_matrix[1, 2]:.4f}")

Output:

[[ 1.  1. -1.]
 [ 1.  1. -1.]
 [-1. -1.  1.]]

Correlation between Var1 and Var2: 1.0000
Correlation between Var1 and Var3: -1.0000
Correlation between Var2 and Var3: -1.0000

The rowvar Parameter

By default, np.corrcoef() treats rows as variables. When working with datasets where columns represent features (the standard format in machine learning), set rowvar=False to transpose the interpretation.

# Dataset: rows are observations, columns are features
observations = np.array([
    [1, 2, 5],
    [2, 4, 4],
    [3, 6, 3],
    [4, 8, 2],
    [5, 10, 1]
])

# Incorrect: treats observations as variables
wrong_corr = np.corrcoef(observations)
print("Wrong shape:", wrong_corr.shape)  # (5, 5) - correlating observations

# Correct: treats features as variables
correct_corr = np.corrcoef(observations, rowvar=False)
print("Correct shape:", correct_corr.shape)  # (3, 3) - correlating features
print("\nFeature correlation matrix:")
print(correct_corr)

Detecting Multicollinearity in Features

Multicollinearity occurs when features are highly correlated, causing issues in regression models. Identifying correlation coefficients above 0.8 or 0.9 helps detect problematic feature pairs.

# Simulate feature set with multicollinearity
np.random.seed(42)
n_samples = 100

feature1 = np.random.randn(n_samples)
feature2 = feature1 + np.random.randn(n_samples) * 0.1  # Highly correlated
feature3 = np.random.randn(n_samples)  # Independent
feature4 = feature1 * 2 + 3  # Perfectly correlated

features = np.column_stack([feature1, feature2, feature3, feature4])
corr = np.corrcoef(features, rowvar=False)

# Find highly correlated pairs (excluding diagonal)
threshold = 0.8
high_corr_pairs = []

for i in range(len(corr)):
    for j in range(i + 1, len(corr)):
        if abs(corr[i, j]) > threshold:
            high_corr_pairs.append((i, j, corr[i, j]))

print("Highly correlated feature pairs (|r| > 0.8):")
for i, j, coef in high_corr_pairs:
    print(f"Feature {i} and Feature {j}: {coef:.4f}")

Output:

Highly correlated feature pairs (|r| > 0.8):
Feature 0 and Feature 1: 0.9953
Feature 0 and Feature 3: 1.0000
Feature 1 and Feature 3: 0.9953

Handling Missing Data

NumPy’s np.corrcoef() doesn’t handle NaN values by default. Use masking or preprocessing to manage missing data before computing correlations.

# Data with missing values
data_with_nan = np.array([
    [1.0, 2.0, np.nan],
    [2.0, 4.0, 4.0],
    [3.0, 6.0, 3.0],
    [4.0, np.nan, 2.0],
    [5.0, 10.0, 1.0]
])

# Approach 1: Remove rows with any NaN
clean_data = data_with_nan[~np.isnan(data_with_nan).any(axis=1)]
corr_clean = np.corrcoef(clean_data, rowvar=False)
print("Correlation after removing NaN rows:")
print(corr_clean)

# Approach 2: Column-wise removal for pairwise correlations
def pairwise_correlation(data):
    n_features = data.shape[1]
    corr_matrix = np.zeros((n_features, n_features))
    
    for i in range(n_features):
        for j in range(n_features):
            # Get valid pairs for this combination
            valid_mask = ~(np.isnan(data[:, i]) | np.isnan(data[:, j]))
            if np.sum(valid_mask) > 1:
                corr_matrix[i, j] = np.corrcoef(
                    data[valid_mask, i],
                    data[valid_mask, j]
                )[0, 1]
            else:
                corr_matrix[i, j] = np.nan
    
    return corr_matrix

pairwise_corr = pairwise_correlation(data_with_nan)
print("\nPairwise correlation (uses all available data):")
print(pairwise_corr)

Correlation vs Covariance

While related, correlation and covariance serve different purposes. Covariance measures how variables change together but depends on variable scales. Correlation normalizes this to a -1 to 1 range, making it scale-independent.

# Compare correlation and covariance
x = np.array([1, 2, 3, 4, 5])
y = np.array([2, 4, 6, 8, 10])

# Scaled version of x
x_scaled = x * 100

correlation_xy = np.corrcoef(x, y)[0, 1]
correlation_scaled = np.corrcoef(x_scaled, y)[0, 1]

covariance_xy = np.cov(x, y)[0, 1]
covariance_scaled = np.cov(x_scaled, y)[0, 1]

print(f"Correlation (original): {correlation_xy:.4f}")
print(f"Correlation (scaled): {correlation_scaled:.4f}")
print(f"Covariance (original): {covariance_xy:.4f}")
print(f"Covariance (scaled): {covariance_scaled:.4f}")

Output:

Correlation (original): 1.0000
Correlation (scaled): 1.0000
Covariance (original): 2.5000
Covariance (scaled): 250.0000

Correlation remains unchanged under scaling, making it preferable for comparing relationships between variables with different units or magnitudes.

Visualizing Correlation Matrices

Understanding large correlation matrices requires visualization. Heatmaps provide intuitive representations of correlation structures.

import matplotlib.pyplot as plt

# Generate synthetic dataset
np.random.seed(123)
n_samples = 200
n_features = 6

# Create features with varying correlations
data = np.random.randn(n_samples, n_features)
data[:, 1] = data[:, 0] + np.random.randn(n_samples) * 0.3  # Correlated with feature 0
data[:, 4] = -data[:, 0] + np.random.randn(n_samples) * 0.5  # Negatively correlated

corr = np.corrcoef(data, rowvar=False)

# Create heatmap
fig, ax = plt.subplots(figsize=(8, 6))
im = ax.imshow(corr, cmap='coolwarm', vmin=-1, vmax=1)

# Add colorbar
cbar = plt.colorbar(im, ax=ax)
cbar.set_label('Correlation Coefficient', rotation=270, labelpad=20)

# Set ticks and labels
ax.set_xticks(np.arange(n_features))
ax.set_yticks(np.arange(n_features))
ax.set_xticklabels([f'F{i}' for i in range(n_features)])
ax.set_yticklabels([f'F{i}' for i in range(n_features)])

# Annotate cells with correlation values
for i in range(n_features):
    for j in range(n_features):
        text = ax.text(j, i, f'{corr[i, j]:.2f}',
                      ha="center", va="center", color="black", fontsize=10)

plt.title('Feature Correlation Matrix')
plt.tight_layout()
plt.savefig('correlation_heatmap.png', dpi=150)

Performance Considerations

For large datasets, computing full correlation matrices becomes computationally expensive. Consider computing only necessary correlations or using optimized libraries.

import time

# Compare performance with dataset size
sizes = [100, 1000, 5000]

for n in sizes:
    data = np.random.randn(n, 50)
    
    start = time.time()
    corr = np.corrcoef(data, rowvar=False)
    elapsed = time.time() - start
    
    print(f"n={n}: {elapsed:.4f} seconds")

For extremely large datasets, consider computing correlations for feature subsets or using parallel processing libraries that leverage NumPy’s underlying BLAS operations.

Understanding Correlation Coefficients

Working with Multiple Variables

The rowvar Parameter

Detecting Multicollinearity in Features

Handling Missing Data

Correlation vs Covariance

Visualizing Correlation Matrices

Performance Considerations

Liked this? There's more.