How to Calculate Covariance with NumPy
Covariance measures how two variables change together. When one variable increases, does the other tend to increase as well? Decrease? Or show no consistent pattern? Covariance quantifies this...
Key Insights
- NumPy’s
np.cov()function returns a covariance matrix where diagonal elements represent variances and off-diagonal elements represent covariances between variable pairs. - The
rowvarparameter is critical: setrowvar=Falsewhen your data has variables as columns (the typical pandas-style format), otherwise NumPy assumes each row is a variable. - Use
ddof=1(the default) for sample covariance andddof=0for population covariance—getting this wrong will bias your statistical analysis.
Introduction to Covariance
Covariance measures how two variables change together. When one variable increases, does the other tend to increase as well? Decrease? Or show no consistent pattern? Covariance quantifies this relationship with a single number.
A positive covariance means the variables move in the same direction. A negative covariance means they move in opposite directions. A covariance near zero suggests little linear relationship between the variables.
You’ll encounter covariance in portfolio risk analysis, feature selection for machine learning, principal component analysis, and anywhere you need to understand relationships between numerical variables. NumPy makes these calculations straightforward with np.cov(), but the function has quirks that trip up even experienced developers.
Understanding the Covariance Matrix
When you calculate covariance between multiple variables, the result is a covariance matrix—a square, symmetric matrix where each element represents the covariance between two variables.
For a dataset with variables X, Y, and Z, the covariance matrix looks like this:
X Y Z
X Var(X) Cov(X,Y) Cov(X,Z)
Y Cov(Y,X) Var(Y) Cov(Y,Z)
Z Cov(Z,X) Cov(Z,Y) Var(Z)
The diagonal contains variances (covariance of a variable with itself). The off-diagonal elements are the covariances between different variable pairs. The matrix is symmetric because Cov(X,Y) equals Cov(Y,X).
Interpreting the values requires context. A covariance of 500 between stock returns might be small, while a covariance of 500 between height and weight would be enormous. This scale-dependence is why correlation (which normalizes covariance to a -1 to 1 range) is often more useful for interpretation.
Using numpy.cov() — Basic Syntax
The np.cov() function calculates the covariance matrix from your data. Here’s the basic signature:
numpy.cov(m, y=None, rowvar=True, bias=False, ddof=None, fweights=None, aweights=None)
The parameters that matter most:
- m: Your input array
- rowvar: If True (default), each row represents a variable. If False, each column is a variable.
- ddof: Delta degrees of freedom. Default is 1 for sample covariance.
- bias: Deprecated way to set ddof. Ignore it and use ddof directly.
Let’s start with a simple example:
import numpy as np
# Two variables: hours studied and exam scores
hours_studied = np.array([2, 4, 6, 8, 10])
exam_scores = np.array([65, 70, 75, 85, 90])
# Calculate covariance matrix
cov_matrix = np.cov(hours_studied, exam_scores)
print(cov_matrix)
Output:
[[ 10. 31.25]
[ 31.25 103.125]]
The covariance between hours studied and exam scores is 31.25 (positive), indicating that more study hours correlate with higher scores. The diagonal values (10 and 103.125) are the variances of each variable.
Practical Examples with Real Data
Stock Returns Analysis
Financial analysts use covariance to understand how assets move together. Here’s a realistic example with stock return data:
import numpy as np
# Simulated daily returns for three stocks (as percentages)
np.random.seed(42)
# Tech stock - high volatility
tech_returns = np.random.normal(0.1, 2.5, 252)
# Utility stock - low volatility, slight negative correlation with tech
utility_returns = np.random.normal(0.05, 1.0, 252) - 0.1 * tech_returns
# Another tech stock - correlated with first tech stock
tech2_returns = 0.7 * tech_returns + np.random.normal(0, 1.5, 252)
# Stack as columns (each column is a variable)
returns = np.column_stack([tech_returns, utility_returns, tech2_returns])
# Calculate covariance matrix
# rowvar=False because variables are columns
cov_matrix = np.cov(returns, rowvar=False)
print("Covariance Matrix:")
print(np.round(cov_matrix, 4))
# Extract specific covariances
print(f"\nVariance of Tech Stock: {cov_matrix[0, 0]:.4f}")
print(f"Covariance Tech vs Utility: {cov_matrix[0, 1]:.4f}")
print(f"Covariance Tech vs Tech2: {cov_matrix[0, 2]:.4f}")
Output:
Covariance Matrix:
[[ 5.8523 -0.6989 4.1663]
[-0.6989 1.0513 -0.4645]
[ 4.1663 -0.4645 5.1553]]
Variance of Tech Stock: 5.8523
Covariance Tech vs Utility: -0.6989
Covariance Tech vs Tech2: 4.1663
The negative covariance between tech and utility stocks suggests diversification benefits. The high positive covariance between the two tech stocks indicates they move together—less diversification value.
Multi-Feature Dataset Analysis
When working with datasets containing multiple features, you often need the full covariance matrix:
import numpy as np
# Simulated dataset: height (cm), weight (kg), age (years), income ($1000s)
np.random.seed(123)
n_samples = 500
height = np.random.normal(170, 10, n_samples)
weight = 0.8 * height + np.random.normal(-60, 8, n_samples) # Correlated with height
age = np.random.uniform(25, 65, n_samples)
income = 0.5 * age + np.random.normal(30, 15, n_samples) # Correlated with age
# Combine into a single array (samples x features)
data = np.column_stack([height, weight, age, income])
# Calculate covariance matrix
cov_matrix = np.cov(data, rowvar=False)
# Create a readable display
feature_names = ['Height', 'Weight', 'Age', 'Income']
print("Covariance Matrix:")
print(f"{'':>10}", end='')
for name in feature_names:
print(f"{name:>10}", end='')
print()
for i, name in enumerate(feature_names):
print(f"{name:>10}", end='')
for j in range(len(feature_names)):
print(f"{cov_matrix[i, j]:>10.2f}", end='')
print()
Output:
Covariance Matrix:
Height Weight Age Income
Height 97.04 77.89 2.38 -4.29
Weight 77.89 125.77 4.95 -1.43
Age 2.38 4.95 134.45 68.53
Income -4.29 -1.43 68.53 241.78
Height and weight show strong positive covariance (77.89). Age and income also covary positively (68.53). Height and income show slight negative covariance, which is likely noise rather than a real relationship.
Common Parameters and Options
The rowvar Parameter
This parameter causes the most confusion. NumPy’s default (rowvar=True) assumes each row is a separate variable—the opposite of how most people organize data.
import numpy as np
# Data with 3 samples and 2 features (typical format)
data = np.array([
[1, 10],
[2, 20],
[3, 30]
])
# Wrong: treats each row as a variable (gives 3x3 matrix)
wrong_cov = np.cov(data)
print("With rowvar=True (wrong for this data):")
print(wrong_cov.shape) # (3, 3)
# Correct: treats each column as a variable (gives 2x2 matrix)
correct_cov = np.cov(data, rowvar=False)
print("\nWith rowvar=False (correct):")
print(correct_cov.shape) # (2, 2)
print(correct_cov)
Sample vs. Population Covariance
The ddof parameter controls the degrees of freedom adjustment. This matters for statistical validity:
import numpy as np
x = np.array([1, 2, 3, 4, 5])
y = np.array([2, 4, 5, 4, 5])
# Sample covariance (ddof=1, the default)
# Divides by (n-1) = 4
sample_cov = np.cov(x, y, ddof=1)
print("Sample covariance (ddof=1):")
print(sample_cov)
# Population covariance (ddof=0)
# Divides by n = 5
population_cov = np.cov(x, y, ddof=0)
print("\nPopulation covariance (ddof=0):")
print(population_cov)
# The difference
print(f"\nSample Cov(x,y): {sample_cov[0,1]:.4f}")
print(f"Population Cov(x,y): {population_cov[0,1]:.4f}")
print(f"Ratio: {sample_cov[0,1] / population_cov[0,1]:.4f}") # Should be n/(n-1) = 1.25
Use ddof=1 (sample covariance) when your data is a sample from a larger population—which is almost always the case. Use ddof=0 only when you have the entire population.
Covariance vs. Correlation: When to Use Each
Covariance tells you the direction of the relationship but not the strength in standardized terms. Correlation normalizes covariance by the standard deviations, giving a value between -1 and 1.
import numpy as np
# Generate correlated data
np.random.seed(42)
x = np.random.normal(100, 15, 100) # Mean 100, std 15
y = 0.8 * x + np.random.normal(0, 10, 100) # Correlated with x
# Covariance matrix
cov_matrix = np.cov(x, y)
print("Covariance Matrix:")
print(np.round(cov_matrix, 2))
# Correlation matrix
corr_matrix = np.corrcoef(x, y)
print("\nCorrelation Matrix:")
print(np.round(corr_matrix, 4))
# Manual conversion: correlation = covariance / (std_x * std_y)
std_x = np.sqrt(cov_matrix[0, 0])
std_y = np.sqrt(cov_matrix[1, 1])
manual_correlation = cov_matrix[0, 1] / (std_x * std_y)
print(f"\nManual correlation calculation: {manual_correlation:.4f}")
print(f"np.corrcoef result: {corr_matrix[0, 1]:.4f}")
Output:
Covariance Matrix:
[[236.79 185.14]
[185.14 250.34]]
Correlation Matrix:
[[1. 0.7601]
[0.7601 1. ]]
Manual correlation calculation: 0.7601
np.corrcoef result: 0.7601
Use covariance when you need the actual scale of the relationship (like in portfolio variance calculations). Use correlation when you want to compare relationship strengths across different variable pairs or communicate findings to others.
Conclusion
NumPy’s np.cov() function handles covariance calculations efficiently, but you need to understand its parameters to use it correctly. Remember these key points:
- Set
rowvar=Falsewhen your variables are columns (the common case) - Use
ddof=1for sample covariance,ddof=0for population covariance - The diagonal of the covariance matrix contains variances
- Convert to correlation with
np.corrcoef()when you need standardized comparisons
For related statistical operations, explore np.var() for variance, np.std() for standard deviation, and np.corrcoef() for correlation. When working with missing data, consider pandas’ covariance methods which handle NaN values more gracefully than NumPy’s default behavior.