How to Calculate Variance of a Random Variable

Key Insights

Variance measures how spread out a distribution is by calculating the average squared deviation from the mean, with the formula σ² = E[(X - μ)²] applying universally across discrete and continuous random variables.
The computational formula σ² = E[X²] - (E[X])² is mathematically equivalent but often more efficient for calculations, though it can suffer from numerical instability with large values.
Use np.var(data, ddof=1) for sample variance in NumPy—the ddof parameter applies Bessel’s correction to provide an unbiased estimate when working with samples rather than entire populations.

Introduction to Variance

Variance quantifies how much a random variable’s values deviate from its expected value. While the mean tells you the center of a distribution, variance tells you how spread out the values are around that center. A low variance means values cluster tightly around the mean; high variance means they’re scattered widely.

The fundamental formula is:

σ² = E[(X - μ)²]

This reads as “the expected value of the squared deviation from the mean.” We square the deviations because positive and negative deviations would otherwise cancel out. The squaring also makes variance sensitive to outliers—large deviations contribute disproportionately to the total variance.

import numpy as np
import matplotlib.pyplot as plt

# Generate two datasets with different variances
np.random.seed(42)
low_variance = np.random.normal(loc=50, scale=5, size=1000)
high_variance = np.random.normal(loc=50, scale=20, size=1000)

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))

ax1.hist(low_variance, bins=30, alpha=0.7, edgecolor='black')
ax1.axvline(low_variance.mean(), color='red', linestyle='--', linewidth=2)
ax1.set_title(f'Low Variance (σ² = {low_variance.var():.2f})')
ax1.set_xlabel('Value')
ax1.set_ylabel('Frequency')

ax2.hist(high_variance, bins=30, alpha=0.7, edgecolor='black')
ax2.axvline(high_variance.mean(), color='red', linestyle='--', linewidth=2)
ax2.set_title(f'High Variance (σ² = {high_variance.var():.2f})')
ax2.set_xlabel('Value')
ax2.set_ylabel('Frequency')

plt.tight_layout()
plt.show()

Both distributions center around 50, but the high-variance distribution spreads much wider.

Discrete Random Variables

For discrete random variables, variance is calculated by summing the squared deviations weighted by their probabilities:

σ² = Σ(xᵢ - μ)²P(xᵢ)

The calculation process:

Calculate the mean: μ = ΣxᵢP(xᵢ)
For each value, compute (xᵢ - μ)²
Multiply each squared deviation by its probability P(xᵢ)
Sum all weighted squared deviations

The alternative computational formula is often easier to work with:

σ² = E[X²] - (E[X])²

This formula says: calculate the expected value of X squared, then subtract the square of the expected value of X. It’s algebraically identical but requires fewer intermediate calculations.

import numpy as np

# Example 1: Fair six-sided die
values = np.array([1, 2, 3, 4, 5, 6])
probabilities = np.array([1/6] * 6)

# Method 1: Definition formula
mean = np.sum(values * probabilities)
variance_def = np.sum((values - mean)**2 * probabilities)

# Method 2: Computational formula
expected_x_squared = np.sum(values**2 * probabilities)
variance_comp = expected_x_squared - mean**2

print(f"Die roll variance (definition): {variance_def:.4f}")
print(f"Die roll variance (computational): {variance_comp:.4f}")

# Example 2: Custom probability distribution
custom_values = np.array([0, 1, 2, 5, 10])
custom_probs = np.array([0.1, 0.2, 0.4, 0.2, 0.1])

mean_custom = np.sum(custom_values * custom_probs)
variance_custom = np.sum((custom_values - mean_custom)**2 * custom_probs)

print(f"\nCustom distribution:")
print(f"Mean: {mean_custom:.4f}")
print(f"Variance: {variance_custom:.4f}")
print(f"Standard deviation: {np.sqrt(variance_custom):.4f}")

Continuous Random Variables

For continuous random variables, we replace summation with integration:

σ² = ∫(x - μ)²f(x)dx

where f(x) is the probability density function. The integral runs over the entire support of the distribution.

Common distributions have well-known variance formulas:

Normal distribution N(μ, σ²): Variance = σ²
Exponential distribution Exp(λ): Variance = 1/λ²
Uniform distribution U(a, b): Variance = (b - a)²/12

You rarely need to compute these integrals by hand. Use statistical libraries:

from scipy import stats
import numpy as np

# Normal distribution with mean=0, std=2 (variance=4)
normal_dist = stats.norm(loc=0, scale=2)
print(f"Normal(0, 2²) variance: {normal_dist.var():.4f}")

# Exponential distribution with rate parameter λ=0.5
exp_dist = stats.expon(scale=1/0.5)  # scale = 1/λ
print(f"Exponential(λ=0.5) variance: {exp_dist.var():.4f}")
print(f"Theoretical variance: {(1/0.5)**2:.4f}")

# Uniform distribution on [0, 10]
uniform_dist = stats.uniform(loc=0, scale=10)
print(f"Uniform(0, 10) variance: {uniform_dist.var():.4f}")
print(f"Theoretical variance: {(10-0)**2/12:.4f}")

# Verify with sampling
samples = normal_dist.rvs(size=100000)
print(f"\nSample variance from Normal(0, 2²): {np.var(samples):.4f}")

Properties of Variance

Understanding variance properties helps you manipulate and combine random variables efficiently.

Variance of linear transformations: Var(aX + b) = a²Var(X)

Adding a constant shifts the distribution but doesn’t change spread. Multiplying by a constant scales the spread quadratically.

Variance of sums (independent variables): Var(X + Y) = Var(X) + Var(Y)

This only holds when X and Y are independent. For dependent variables, you need to account for covariance.

Non-negativity: Variance is always ≥ 0 because it’s a sum (or integral) of squared terms. Variance equals zero only when the random variable is constant.

import numpy as np

np.random.seed(42)
X = np.random.normal(loc=10, scale=3, size=10000)

# Property 1: Var(aX + b) = a²Var(X)
a, b = 5, 100
transformed = a * X + b

print(f"Var(X): {np.var(X):.4f}")
print(f"Var({a}X + {b}): {np.var(transformed):.4f}")
print(f"{a}² × Var(X): {a**2 * np.var(X):.4f}")

# Property 2: Var(X + Y) = Var(X) + Var(Y) for independent X, Y
Y = np.random.normal(loc=5, scale=2, size=10000)
sum_xy = X + Y

print(f"\nVar(X): {np.var(X):.4f}")
print(f"Var(Y): {np.var(Y):.4f}")
print(f"Var(X + Y): {np.var(sum_xy):.4f}")
print(f"Var(X) + Var(Y): {np.var(X) + np.var(Y):.4f}")

# Property 3: Constant has zero variance
constant = np.full(10000, 42)
print(f"\nVar(constant): {np.var(constant):.4f}")

Practical Implementation

NumPy provides np.var() for variance calculation, but you need to understand the ddof (delta degrees of freedom) parameter.

Population variance (ddof=0): Use when you have the entire population. Sample variance (ddof=1): Use when you have a sample and want to estimate population variance. This applies Bessel’s correction, dividing by (n-1) instead of n to produce an unbiased estimator.

import numpy as np

data = np.array([2, 4, 4, 4, 5, 5, 7, 9])

# Manual calculation - population variance
n = len(data)
mean = np.mean(data)
variance_manual = np.sum((data - mean)**2) / n

print(f"Manual population variance: {variance_manual:.4f}")
print(f"NumPy population variance (ddof=0): {np.var(data, ddof=0):.4f}")
print(f"NumPy sample variance (ddof=1): {np.var(data, ddof=1):.4f}")

# Verify manual sample variance
variance_sample_manual = np.sum((data - mean)**2) / (n - 1)
print(f"Manual sample variance: {variance_sample_manual:.4f}")

# For large datasets, NumPy is much faster
large_data = np.random.randn(1000000)

import time
start = time.time()
var_numpy = np.var(large_data)
numpy_time = time.time() - start

start = time.time()
var_manual = np.sum((large_data - np.mean(large_data))**2) / len(large_data)
manual_time = time.time() - start

print(f"\nNumPy time: {numpy_time:.6f}s")
print(f"Manual time: {manual_time:.6f}s")

Common Pitfalls and Best Practices

Numerical instability: The computational formula σ² = E[X²] - (E[X])² can suffer from catastrophic cancellation when dealing with large values. If E[X²] and (E[X])² are both large and nearly equal, subtracting them loses precision.

Variance vs. standard deviation: Use standard deviation (√variance) when you need a measure in the same units as your data. Use variance for theoretical work and when combining independent random variables.

Missing data: Decide whether to drop missing values or impute them. NumPy’s np.nanvar() ignores NaN values.

import numpy as np

# Demonstrate numerical instability
large_values = np.array([1e10, 1e10 + 1, 1e10 + 2, 1e10 + 3])

# Unstable computational formula
mean = np.mean(large_values)
expected_x_squared = np.mean(large_values**2)
variance_unstable = expected_x_squared - mean**2

# Stable definition formula (two-pass)
variance_stable = np.mean((large_values - mean)**2)

print(f"Unstable variance: {variance_unstable:.10f}")
print(f"Stable variance: {variance_stable:.10f}")
print(f"NumPy variance: {np.var(large_values):.10f}")

# Handling missing data
data_with_nan = np.array([1.0, 2.0, np.nan, 4.0, 5.0])

# This produces NaN
print(f"\nnp.var with NaN: {np.var(data_with_nan):.4f}")

# Use nanvar to ignore NaN
print(f"np.nanvar: {np.nanvar(data_with_nan):.4f}")

# Or drop NaN explicitly
clean_data = data_with_nan[~np.isnan(data_with_nan)]
print(f"Variance after dropping NaN: {np.var(clean_data):.4f}")

Always use ddof=1 when working with samples unless you have a specific reason not to. For production code, prefer NumPy’s built-in functions over manual implementations—they’re optimized and numerically stable. When variance seems unexpectedly high, check for outliers and data quality issues before concluding your distribution has high spread.