Variance: Formula and Examples
• Variance measures how spread out data points are from the mean—use population variance (divide by N) when you have complete data, and sample variance (divide by n-1) when working with a subset to...
Key Insights
• Variance measures how spread out data points are from the mean—use population variance (divide by N) when you have complete data, and sample variance (divide by n-1) when working with a subset to avoid underestimating spread.
• The (n-1) denominator in sample variance, called Bessel’s correction, compensates for the fact that sample means tend to be closer to sample points than the true population mean, preventing systematic underestimation of variability.
• Variance is measured in squared units, making it less intuitive than standard deviation but computationally essential for many statistical operations; always check for outliers since a single extreme value can dramatically inflate variance.
Introduction to Variance
Variance quantifies how much individual data points deviate from the mean. While the mean tells you the central tendency of your data, variance tells you how reliable that central tendency is. A dataset with low variance clusters tightly around the mean; high variance indicates wide dispersion.
Understanding variance is critical for machine learning model evaluation, financial risk assessment, and quality control. Many algorithms—from linear regression to principal component analysis—rely on variance calculations. In feature engineering, you’ll often remove low-variance features since they contribute little information.
Variance relates directly to standard deviation: standard deviation is simply the square root of variance. While standard deviation has the advantage of being in the same units as your data, variance is mathematically cleaner for many operations, particularly when combining multiple sources of variation.
import numpy as np
import matplotlib.pyplot as plt
# Two datasets with identical means but different variances
low_variance = np.array([48, 49, 50, 51, 52])
high_variance = np.array([20, 40, 50, 60, 80])
print(f"Low variance mean: {low_variance.mean()}")
print(f"High variance mean: {high_variance.mean()}")
print(f"Low variance: {low_variance.var(ddof=1):.2f}")
print(f"High variance: {high_variance.var(ddof=1):.2f}")
# Output:
# Low variance mean: 50.0
# High variance mean: 50.0
# Low variance: 2.50
# High variance: 500.00
Both datasets center around 50, but their spread differs dramatically. This illustrates why mean alone is insufficient for understanding your data.
The Variance Formula Explained
The population variance formula is:
σ² = Σ(xi - μ)² / N
Where:
- σ² (sigma squared) is the population variance
- xi represents each individual data point
- μ (mu) is the population mean
- N is the total number of data points
For sample variance, we use:
s² = Σ(xi - x̄)² / (n-1)
Where:
- s² is the sample variance
- x̄ (x-bar) is the sample mean
- n is the sample size
- (n-1) is the degrees of freedom
The (n-1) denominator is Bessel’s correction. When you calculate a sample mean, you’re using the data itself to estimate the population mean. This creates a dependency: your sample mean is guaranteed to be closer to your sample points than the true population mean would be. Dividing by (n-1) instead of n inflates the variance slightly, correcting this systematic underestimation.
The formula works by:
- Finding the mean of all values
- Calculating how far each point deviates from the mean
- Squaring those deviations (eliminating negative values and emphasizing outliers)
- Averaging the squared deviations
import numpy as np
data = np.array([10, 12, 23, 23, 16, 23, 21, 16])
# Step-by-step variance calculation
mean = data.mean()
print(f"Mean: {mean}")
deviations = data - mean
print(f"Deviations: {deviations}")
squared_deviations = deviations ** 2
print(f"Squared deviations: {squared_deviations}")
population_variance = squared_deviations.sum() / len(data)
sample_variance = squared_deviations.sum() / (len(data) - 1)
print(f"\nPopulation variance (N={len(data)}): {population_variance:.2f}")
print(f"Sample variance (n-1={len(data)-1}): {sample_variance:.2f}")
# Output:
# Mean: 18.0
# Deviations: [-8. -6. 5. 5. -2. 5. 3. -2.]
# Squared deviations: [64. 36. 25. 25. 4. 25. 9. 4.]
# Population variance (N=8): 24.00
# Sample variance (n-1=7): 27.43
Calculating Variance: Manual Examples
Let’s walk through a complete manual calculation with a small dataset representing daily website visitors: [120, 135, 142, 128, 155].
import numpy as np
visitors = np.array([120, 135, 142, 128, 155])
# Step 1: Calculate the mean
mean = visitors.sum() / len(visitors)
print(f"Step 1 - Mean: {visitors.sum()} / {len(visitors)} = {mean}")
# Step 2: Calculate deviations from mean
print("\nStep 2 - Deviations from mean:")
for i, value in enumerate(visitors):
deviation = value - mean
print(f" {value} - {mean} = {deviation}")
deviations = visitors - mean
# Step 3: Square the deviations
print("\nStep 3 - Squared deviations:")
squared_devs = deviations ** 2
for i, (dev, sq_dev) in enumerate(zip(deviations, squared_devs)):
print(f" ({dev})² = {sq_dev}")
# Step 4: Sum squared deviations
sum_squared = squared_devs.sum()
print(f"\nStep 4 - Sum of squared deviations: {sum_squared}")
# Step 5: Divide by (n-1) for sample variance
sample_var = sum_squared / (len(visitors) - 1)
print(f"\nStep 5 - Sample variance: {sum_squared} / {len(visitors)-1} = {sample_var}")
# Verify with NumPy
print(f"\nVerification with NumPy: {np.var(visitors, ddof=1)}")
# Output:
# Step 1 - Mean: 680 / 5 = 136.0
# Step 2 - Deviations from mean:
# 120 - 136.0 = -16.0
# 135 - 136.0 = -1.0
# 142 - 136.0 = 6.0
# 128 - 136.0 = -8.0
# 155 - 136.0 = 19.0
# Step 3 - Squared deviations:
# (-16.0)² = 256.0
# (-1.0)² = 1.0
# (6.0)² = 36.0
# (-8.0)² = 64.0
# (19.0)² = 361.0
# Step 4 - Sum of squared deviations: 718.0
# Step 5 - Sample variance: 718.0 / 4 = 179.5
# Verification with NumPy: 179.5
This manual approach helps build intuition for what variance actually measures.
Computing Variance with NumPy and Pandas
In production code, use library functions. The ddof (delta degrees of freedom) parameter controls whether you’re calculating population or sample variance.
import numpy as np
import pandas as pd
data = np.array([23, 45, 67, 34, 56, 78, 90, 12, 34, 56])
# NumPy variance calculations
pop_var = np.var(data) # ddof=0 by default (population)
sample_var = np.var(data, ddof=1) # sample variance
print(f"Population variance: {pop_var:.2f}")
print(f"Sample variance: {sample_var:.2f}")
# Multi-dimensional arrays
matrix = np.array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
print(f"\nVariance across all elements: {np.var(matrix, ddof=1):.2f}")
print(f"Variance across rows (axis=0): {np.var(matrix, axis=0, ddof=1)}")
print(f"Variance across columns (axis=1): {np.var(matrix, axis=1, ddof=1)}")
# Pandas DataFrames
df = pd.DataFrame({
'temperature': [72, 75, 71, 73, 76, 74],
'humidity': [45, 50, 48, 46, 52, 49],
'pressure': [1013, 1015, 1012, 1014, 1016, 1013]
})
print("\nDataFrame variance (sample variance by default):")
print(df.var())
print("\nVariance for specific column:")
print(f"Temperature variance: {df['temperature'].var():.2f}")
The ddof=1 parameter is critical when working with samples. NumPy defaults to ddof=0 (population variance), while Pandas defaults to ddof=1 (sample variance). This inconsistency catches many developers off guard.
Practical Applications
Variance appears everywhere in data analysis. Here’s a financial volatility analysis:
import numpy as np
import pandas as pd
# Simulated daily stock returns (percentage)
stock_a = np.array([0.5, -0.3, 0.8, -0.2, 0.4, 0.1, -0.5, 0.6, 0.2, -0.1])
stock_b = np.array([2.1, -1.8, 3.2, -2.5, 1.9, -1.2, 2.8, -2.1, 1.5, -1.9])
var_a = np.var(stock_a, ddof=1)
var_b = np.var(stock_b, ddof=1)
std_a = np.std(stock_a, ddof=1)
std_b = np.std(stock_b, ddof=1)
print(f"Stock A - Mean return: {stock_a.mean():.2f}%, Variance: {var_a:.2f}, Std Dev: {std_a:.2f}%")
print(f"Stock B - Mean return: {stock_b.mean():.2f}%, Variance: {var_b:.2f}, Std Dev: {std_b:.2f}%")
print(f"\nStock B is {var_b/var_a:.1f}x more volatile than Stock A")
# A/B test comparison
control_group = np.array([12, 15, 13, 14, 16, 13, 15, 14, 13, 15])
treatment_group = np.array([18, 16, 19, 17, 20, 16, 18, 19, 17, 18])
print(f"\nA/B Test Results:")
print(f"Control - Mean: {control_group.mean():.1f}, Variance: {np.var(control_group, ddof=1):.2f}")
print(f"Treatment - Mean: {treatment_group.mean():.1f}, Variance: {np.var(treatment_group, ddof=1):.2f}")
In quality control, variance helps detect process inconsistencies. High variance in manufacturing measurements indicates poor process control, even if the mean is on target.
Common Pitfalls and Best Practices
Use population variance only when you have the complete dataset. If you’re analyzing a sample from a larger population, always use sample variance (ddof=1). When in doubt, use sample variance—it’s the conservative choice.
Variance is extremely sensitive to outliers because deviations are squared:
import numpy as np
normal_data = np.array([10, 12, 11, 13, 12, 11, 10, 12])
data_with_outlier = np.array([10, 12, 11, 13, 12, 11, 10, 100])
print(f"Normal data variance: {np.var(normal_data, ddof=1):.2f}")
print(f"Data with outlier variance: {np.var(data_with_outlier, ddof=1):.2f}")
# Output:
# Normal data variance: 1.14
# Data with outlier variance: 1010.27
A single outlier inflated the variance by nearly 1000x. Always visualize your data and consider robust alternatives like median absolute deviation for outlier-heavy datasets.
Remember that variance uses squared units. If your data is in dollars, variance is in dollars-squared, which is difficult to interpret. This is why standard deviation (the square root of variance) is often preferred for reporting, even though variance is used internally for calculations.
For large datasets, use numerically stable algorithms. The naive two-pass algorithm (calculate mean, then calculate squared deviations) can suffer from floating-point errors. NumPy and Pandas use Welford’s online algorithm, which is both numerically stable and memory-efficient.
Variance is foundational to statistics and machine learning. Master it, understand when to use population versus sample calculations, and always consider whether outliers are distorting your measure of spread. The squared units make variance less intuitive than standard deviation for interpretation, but its mathematical properties make it indispensable for analysis.