How to Calculate Covariance
Covariance quantifies the directional relationship between two variables. When one variable increases, does the other tend to increase (positive covariance), decrease (negative covariance), or show...
Key Insights
- Covariance measures how two variables change together, with positive values indicating they move in the same direction and negative values showing inverse relationships
- The formula differs slightly between population and sample covariance—use sample covariance (n-1 denominator) when working with data subsets
- Covariance is scale-dependent and unbounded, making correlation often more useful for comparing relationship strength across different datasets
Introduction to Covariance
Covariance quantifies the directional relationship between two variables. When one variable increases, does the other tend to increase (positive covariance), decrease (negative covariance), or show no consistent pattern (near-zero covariance)?
You’ll encounter covariance in portfolio optimization, where it measures how asset returns move together, and in machine learning feature engineering, where understanding variable relationships helps with feature selection and dimensionality reduction. It’s also fundamental to principal component analysis (PCA) and many statistical models.
The critical distinction from correlation: covariance is unbounded and scale-dependent. A covariance of 50 could represent a strong or weak relationship depending on your data’s units. Correlation normalizes this to a -1 to 1 range, making it easier to interpret. However, covariance is essential for certain calculations and provides the foundation for understanding correlation itself.
The Mathematical Formula
Population covariance assumes you have data for an entire population:
Cov(X,Y) = Σ[(Xi - μX)(Yi - μY)] / N
Sample covariance adjusts for working with a subset of data:
Cov(X,Y) = Σ[(Xi - X̄)(Yi - Ȳ)] / (n-1)
The components:
- Xi, Yi: Individual data points from variables X and Y
- μX, μY or X̄, Ȳ: Population means or sample means
- N or n: Total number of observations
- (n-1): Bessel’s correction, which provides an unbiased estimator
The formula multiplies each pair of deviations from their respective means, then averages these products. Positive products (both variables above or below their means) accumulate to positive covariance. Mixed products (one above, one below) produce negative covariance.
Here’s a from-scratch implementation:
def calculate_covariance(x, y, sample=True):
"""
Calculate covariance between two variables.
Args:
x: List or array of values for variable X
y: List or array of values for variable Y
sample: If True, use sample covariance (n-1), else population (n)
Returns:
Covariance value
"""
if len(x) != len(y):
raise ValueError("Variables must have equal length")
n = len(x)
mean_x = sum(x) / n
mean_y = sum(y) / n
# Calculate sum of products of deviations
covariance_sum = sum((x[i] - mean_x) * (y[i] - mean_y) for i in range(n))
# Divide by n-1 for sample, n for population
divisor = n - 1 if sample else n
return covariance_sum / divisor
Manual Calculation Walkthrough
Let’s work through a concrete example with study hours and test scores:
study_hours = [2, 3, 4, 5, 6, 7, 8]
test_scores = [65, 70, 75, 82, 88, 90, 95]
print("Step 1: Calculate means")
mean_hours = sum(study_hours) / len(study_hours)
mean_scores = sum(test_scores) / len(test_scores)
print(f"Mean study hours: {mean_hours:.2f}")
print(f"Mean test scores: {mean_scores:.2f}")
print("\nStep 2: Calculate deviations from mean")
deviations_hours = [x - mean_hours for x in study_hours]
deviations_scores = [y - mean_scores for y in test_scores]
print("Hours deviations:", [f"{d:.2f}" for d in deviations_hours])
print("Scores deviations:", [f"{d:.2f}" for d in deviations_scores])
print("\nStep 3: Multiply corresponding deviations")
products = [deviations_hours[i] * deviations_scores[i]
for i in range(len(study_hours))]
print("Products:", [f"{p:.2f}" for p in products])
print("\nStep 4: Sum products and divide by n-1")
sum_products = sum(products)
n = len(study_hours)
covariance = sum_products / (n - 1)
print(f"Sum of products: {sum_products:.2f}")
print(f"Sample covariance: {covariance:.2f}")
Output:
Step 1: Calculate means
Mean study hours: 5.00
Mean test scores: 80.71
Step 2: Calculate deviations from mean
Hours deviations: ['-3.00', '-2.00', '-1.00', '0.00', '1.00', '2.00', '3.00']
Scores deviations: ['-15.71', '-10.71', '-5.71', '1.29', '7.29', '9.29', '14.29']
Step 3: Multiply corresponding deviations
Products: ['47.14', '21.43', '5.71', '0.00', '7.29', '18.57', '42.86']
Step 4: Sum products and divide by n-1
Sum of products: 143.00
Sample covariance: 23.83
The positive covariance of 23.83 confirms that study hours and test scores move together—more study hours associate with higher scores.
Using Built-in Libraries
Don’t implement covariance manually in production. Use NumPy or Pandas:
import numpy as np
import pandas as pd
study_hours = [2, 3, 4, 5, 6, 7, 8]
test_scores = [65, 70, 75, 82, 88, 90, 95]
# NumPy returns a covariance matrix
cov_matrix = np.cov(study_hours, test_scores)
print("NumPy covariance matrix:")
print(cov_matrix)
print(f"\nCovariance between variables: {cov_matrix[0, 1]:.2f}")
# Pandas method on DataFrame
df = pd.DataFrame({
'study_hours': study_hours,
'test_scores': test_scores
})
print("\nPandas covariance matrix:")
print(df.cov())
print(f"\nCovariance: {df['study_hours'].cov(df['test_scores']):.2f}")
# Verify against manual calculation
manual_cov = calculate_covariance(study_hours, test_scores)
print(f"\nManual calculation: {manual_cov:.2f}")
The covariance matrix is symmetric—position [0,1] equals [1,0], both representing the covariance between variables. The diagonal contains variances (covariance of a variable with itself).
Practical Applications
Here’s a portfolio risk analysis example using real stock data:
import pandas as pd
import numpy as np
# Simulating daily returns for two stocks
np.random.seed(42)
days = 252 # Trading days in a year
# Stock A: tech stock with higher volatility
stock_a_returns = np.random.normal(0.001, 0.02, days)
# Stock B: correlated with A but lower volatility
stock_b_returns = stock_a_returns * 0.6 + np.random.normal(0.0008, 0.01, days)
df = pd.DataFrame({
'Stock_A': stock_a_returns,
'Stock_B': stock_b_returns
})
# Calculate covariance
cov_matrix = df.cov()
print("Covariance Matrix:")
print(cov_matrix)
covariance_ab = cov_matrix.loc['Stock_A', 'Stock_B']
print(f"\nCovariance between Stock A and B: {covariance_ab:.6f}")
# Portfolio variance calculation uses covariance
weights = np.array([0.5, 0.5]) # Equal weight portfolio
portfolio_variance = np.dot(weights, np.dot(cov_matrix, weights))
portfolio_std = np.sqrt(portfolio_variance)
print(f"Portfolio standard deviation: {portfolio_std:.4f}")
print(f"Annualized portfolio volatility: {portfolio_std * np.sqrt(252):.2%}")
In machine learning, you might check feature covariance before training:
# Example: Feature analysis
from sklearn.datasets import load_diabetes
data = load_diabetes()
df = pd.DataFrame(data.data, columns=data.feature_names)
# Calculate covariance matrix for all features
cov_matrix = df.cov()
# Find highly covarying features
threshold = 0.5
high_cov_pairs = []
for i in range(len(cov_matrix.columns)):
for j in range(i+1, len(cov_matrix.columns)):
cov_val = cov_matrix.iloc[i, j]
if abs(cov_val) > threshold:
high_cov_pairs.append((
cov_matrix.columns[i],
cov_matrix.columns[j],
cov_val
))
print("Feature pairs with high covariance:")
for feat1, feat2, cov in high_cov_pairs:
print(f"{feat1} & {feat2}: {cov:.3f}")
Common Pitfalls and Limitations
The biggest issue with covariance is scale dependency. Consider this example:
# Same relationship, different scales
height_cm = [150, 160, 170, 180, 190]
weight_kg = [50, 60, 70, 80, 90]
height_m = [h/100 for h in height_cm] # Convert to meters
cov_cm = np.cov(height_cm, weight_kg)[0, 1]
cov_m = np.cov(height_m, weight_kg)[0, 1]
print(f"Covariance (height in cm): {cov_cm:.2f}")
print(f"Covariance (height in m): {cov_m:.2f}")
print(f"Ratio: {cov_cm / cov_m:.0f}x difference")
# Correlation remains constant
corr_cm = np.corrcoef(height_cm, weight_kg)[0, 1]
corr_m = np.corrcoef(height_m, weight_kg)[0, 1]
print(f"\nCorrelation (height in cm): {corr_cm:.4f}")
print(f"Correlation (height in m): {corr_m:.4f}")
The covariance changes 100-fold with unit conversion, but correlation stays identical. This makes comparing covariances across different datasets nearly meaningless.
Additional limitations:
Sample size matters: With small samples, covariance estimates are unreliable. The standard error decreases with √n, so you need substantial data for precision.
Outliers have outsized impact: A single extreme value can dominate the covariance calculation since deviations are multiplied.
Non-linear relationships: Covariance only captures linear relationships. Variables with strong non-linear associations might show near-zero covariance.
Zero doesn’t mean independence: Zero covariance indicates no linear relationship, but variables could still be dependent in non-linear ways.
Use correlation coefficients for standardized comparison, and always visualize your data with scatter plots before relying solely on covariance values. In most practical scenarios, you’ll calculate covariance as an intermediate step toward other metrics rather than as your final analytical output.