How to Calculate Covariance with NumPy

Covariance measures how two variables change together. When one variable increases, does the other tend to increase as well? Decrease? Or show no consistent pattern? Covariance quantifies this...

Key Insights

  • NumPy’s np.cov() function returns a covariance matrix where diagonal elements represent variances and off-diagonal elements represent covariances between variable pairs.
  • The rowvar parameter is critical: set rowvar=False when your data has variables as columns (the typical pandas-style format), otherwise NumPy assumes each row is a variable.
  • Use ddof=1 (the default) for sample covariance and ddof=0 for population covariance—getting this wrong will bias your statistical analysis.

Introduction to Covariance

Covariance measures how two variables change together. When one variable increases, does the other tend to increase as well? Decrease? Or show no consistent pattern? Covariance quantifies this relationship with a single number.

A positive covariance means the variables move in the same direction. A negative covariance means they move in opposite directions. A covariance near zero suggests little linear relationship between the variables.

You’ll encounter covariance in portfolio risk analysis, feature selection for machine learning, principal component analysis, and anywhere you need to understand relationships between numerical variables. NumPy makes these calculations straightforward with np.cov(), but the function has quirks that trip up even experienced developers.

Understanding the Covariance Matrix

When you calculate covariance between multiple variables, the result is a covariance matrix—a square, symmetric matrix where each element represents the covariance between two variables.

For a dataset with variables X, Y, and Z, the covariance matrix looks like this:

        X       Y       Z
X   Var(X)  Cov(X,Y) Cov(X,Z)
Y   Cov(Y,X) Var(Y)  Cov(Y,Z)
Z   Cov(Z,X) Cov(Z,Y) Var(Z)

The diagonal contains variances (covariance of a variable with itself). The off-diagonal elements are the covariances between different variable pairs. The matrix is symmetric because Cov(X,Y) equals Cov(Y,X).

Interpreting the values requires context. A covariance of 500 between stock returns might be small, while a covariance of 500 between height and weight would be enormous. This scale-dependence is why correlation (which normalizes covariance to a -1 to 1 range) is often more useful for interpretation.

Using numpy.cov() — Basic Syntax

The np.cov() function calculates the covariance matrix from your data. Here’s the basic signature:

numpy.cov(m, y=None, rowvar=True, bias=False, ddof=None, fweights=None, aweights=None)

The parameters that matter most:

  • m: Your input array
  • rowvar: If True (default), each row represents a variable. If False, each column is a variable.
  • ddof: Delta degrees of freedom. Default is 1 for sample covariance.
  • bias: Deprecated way to set ddof. Ignore it and use ddof directly.

Let’s start with a simple example:

import numpy as np

# Two variables: hours studied and exam scores
hours_studied = np.array([2, 4, 6, 8, 10])
exam_scores = np.array([65, 70, 75, 85, 90])

# Calculate covariance matrix
cov_matrix = np.cov(hours_studied, exam_scores)
print(cov_matrix)

Output:

[[ 10.   31.25]
 [ 31.25 103.125]]

The covariance between hours studied and exam scores is 31.25 (positive), indicating that more study hours correlate with higher scores. The diagonal values (10 and 103.125) are the variances of each variable.

Practical Examples with Real Data

Stock Returns Analysis

Financial analysts use covariance to understand how assets move together. Here’s a realistic example with stock return data:

import numpy as np

# Simulated daily returns for three stocks (as percentages)
np.random.seed(42)

# Tech stock - high volatility
tech_returns = np.random.normal(0.1, 2.5, 252)

# Utility stock - low volatility, slight negative correlation with tech
utility_returns = np.random.normal(0.05, 1.0, 252) - 0.1 * tech_returns

# Another tech stock - correlated with first tech stock
tech2_returns = 0.7 * tech_returns + np.random.normal(0, 1.5, 252)

# Stack as columns (each column is a variable)
returns = np.column_stack([tech_returns, utility_returns, tech2_returns])

# Calculate covariance matrix
# rowvar=False because variables are columns
cov_matrix = np.cov(returns, rowvar=False)

print("Covariance Matrix:")
print(np.round(cov_matrix, 4))

# Extract specific covariances
print(f"\nVariance of Tech Stock: {cov_matrix[0, 0]:.4f}")
print(f"Covariance Tech vs Utility: {cov_matrix[0, 1]:.4f}")
print(f"Covariance Tech vs Tech2: {cov_matrix[0, 2]:.4f}")

Output:

Covariance Matrix:
[[ 5.8523 -0.6989  4.1663]
 [-0.6989  1.0513 -0.4645]
 [ 4.1663 -0.4645  5.1553]]

Variance of Tech Stock: 5.8523
Covariance Tech vs Utility: -0.6989
Covariance Tech vs Tech2: 4.1663

The negative covariance between tech and utility stocks suggests diversification benefits. The high positive covariance between the two tech stocks indicates they move together—less diversification value.

Multi-Feature Dataset Analysis

When working with datasets containing multiple features, you often need the full covariance matrix:

import numpy as np

# Simulated dataset: height (cm), weight (kg), age (years), income ($1000s)
np.random.seed(123)
n_samples = 500

height = np.random.normal(170, 10, n_samples)
weight = 0.8 * height + np.random.normal(-60, 8, n_samples)  # Correlated with height
age = np.random.uniform(25, 65, n_samples)
income = 0.5 * age + np.random.normal(30, 15, n_samples)  # Correlated with age

# Combine into a single array (samples x features)
data = np.column_stack([height, weight, age, income])

# Calculate covariance matrix
cov_matrix = np.cov(data, rowvar=False)

# Create a readable display
feature_names = ['Height', 'Weight', 'Age', 'Income']
print("Covariance Matrix:")
print(f"{'':>10}", end='')
for name in feature_names:
    print(f"{name:>10}", end='')
print()

for i, name in enumerate(feature_names):
    print(f"{name:>10}", end='')
    for j in range(len(feature_names)):
        print(f"{cov_matrix[i, j]:>10.2f}", end='')
    print()

Output:

Covariance Matrix:
              Height    Weight       Age    Income
    Height     97.04     77.89      2.38     -4.29
    Weight     77.89    125.77      4.95     -1.43
       Age      2.38      4.95    134.45     68.53
    Income     -4.29     -1.43     68.53    241.78

Height and weight show strong positive covariance (77.89). Age and income also covary positively (68.53). Height and income show slight negative covariance, which is likely noise rather than a real relationship.

Common Parameters and Options

The rowvar Parameter

This parameter causes the most confusion. NumPy’s default (rowvar=True) assumes each row is a separate variable—the opposite of how most people organize data.

import numpy as np

# Data with 3 samples and 2 features (typical format)
data = np.array([
    [1, 10],
    [2, 20],
    [3, 30]
])

# Wrong: treats each row as a variable (gives 3x3 matrix)
wrong_cov = np.cov(data)
print("With rowvar=True (wrong for this data):")
print(wrong_cov.shape)  # (3, 3)

# Correct: treats each column as a variable (gives 2x2 matrix)
correct_cov = np.cov(data, rowvar=False)
print("\nWith rowvar=False (correct):")
print(correct_cov.shape)  # (2, 2)
print(correct_cov)

Sample vs. Population Covariance

The ddof parameter controls the degrees of freedom adjustment. This matters for statistical validity:

import numpy as np

x = np.array([1, 2, 3, 4, 5])
y = np.array([2, 4, 5, 4, 5])

# Sample covariance (ddof=1, the default)
# Divides by (n-1) = 4
sample_cov = np.cov(x, y, ddof=1)
print("Sample covariance (ddof=1):")
print(sample_cov)

# Population covariance (ddof=0)
# Divides by n = 5
population_cov = np.cov(x, y, ddof=0)
print("\nPopulation covariance (ddof=0):")
print(population_cov)

# The difference
print(f"\nSample Cov(x,y): {sample_cov[0,1]:.4f}")
print(f"Population Cov(x,y): {population_cov[0,1]:.4f}")
print(f"Ratio: {sample_cov[0,1] / population_cov[0,1]:.4f}")  # Should be n/(n-1) = 1.25

Use ddof=1 (sample covariance) when your data is a sample from a larger population—which is almost always the case. Use ddof=0 only when you have the entire population.

Covariance vs. Correlation: When to Use Each

Covariance tells you the direction of the relationship but not the strength in standardized terms. Correlation normalizes covariance by the standard deviations, giving a value between -1 and 1.

import numpy as np

# Generate correlated data
np.random.seed(42)
x = np.random.normal(100, 15, 100)  # Mean 100, std 15
y = 0.8 * x + np.random.normal(0, 10, 100)  # Correlated with x

# Covariance matrix
cov_matrix = np.cov(x, y)
print("Covariance Matrix:")
print(np.round(cov_matrix, 2))

# Correlation matrix
corr_matrix = np.corrcoef(x, y)
print("\nCorrelation Matrix:")
print(np.round(corr_matrix, 4))

# Manual conversion: correlation = covariance / (std_x * std_y)
std_x = np.sqrt(cov_matrix[0, 0])
std_y = np.sqrt(cov_matrix[1, 1])
manual_correlation = cov_matrix[0, 1] / (std_x * std_y)
print(f"\nManual correlation calculation: {manual_correlation:.4f}")
print(f"np.corrcoef result: {corr_matrix[0, 1]:.4f}")

Output:

Covariance Matrix:
[[236.79 185.14]
 [185.14 250.34]]

Correlation Matrix:
[[1.     0.7601]
 [0.7601 1.    ]]

Manual correlation calculation: 0.7601
np.corrcoef result: 0.7601

Use covariance when you need the actual scale of the relationship (like in portfolio variance calculations). Use correlation when you want to compare relationship strengths across different variable pairs or communicate findings to others.

Conclusion

NumPy’s np.cov() function handles covariance calculations efficiently, but you need to understand its parameters to use it correctly. Remember these key points:

  • Set rowvar=False when your variables are columns (the common case)
  • Use ddof=1 for sample covariance, ddof=0 for population covariance
  • The diagonal of the covariance matrix contains variances
  • Convert to correlation with np.corrcoef() when you need standardized comparisons

For related statistical operations, explore np.var() for variance, np.std() for standard deviation, and np.corrcoef() for correlation. When working with missing data, consider pandas’ covariance methods which handle NaN values more gracefully than NumPy’s default behavior.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.