How to Calculate Variance in NumPy

Variance measures how spread out your data is from its mean. It's one of the most fundamental statistical concepts you'll encounter in data analysis, machine learning, and scientific computing. A low...

Key Insights

  • NumPy’s np.var() calculates population variance by default (divides by N), so you must use ddof=1 to get sample variance for statistical inference on data samples.
  • The axis parameter lets you compute variance across specific dimensions of multi-dimensional arrays, enabling efficient batch calculations on datasets with multiple features or time series.
  • Use np.nanvar() instead of np.var() when your data contains missing values (NaN) to avoid propagating NaN through your entire result.

Introduction

Variance measures how spread out your data is from its mean. It’s one of the most fundamental statistical concepts you’ll encounter in data analysis, machine learning, and scientific computing. A low variance indicates data points cluster tightly around the mean, while a high variance signals wide dispersion.

In practice, you’ll use variance for everything from feature scaling in machine learning pipelines to quality control in manufacturing data. NumPy provides efficient, vectorized functions for calculating variance that outperform pure Python loops by orders of magnitude. Understanding how to use these functions correctly—especially the subtle differences between population and sample variance—will save you from statistical errors that can invalidate your analysis.

Understanding Variance: Population vs. Sample

Before diving into NumPy functions, you need to understand the two types of variance calculations and when to use each.

Population variance (σ²) is used when you have data for an entire population. You divide the sum of squared deviations by N, the total number of data points:

$$\sigma^2 = \frac{\sum_{i=1}^{N}(x_i - \mu)^2}{N}$$

Sample variance (s²) is used when you’re working with a sample from a larger population and want to estimate the population’s variance. You divide by N-1 instead of N:

$$s^2 = \frac{\sum_{i=1}^{N}(x_i - \bar{x})^2}{N-1}$$

The N-1 denominator is called Bessel’s correction. It compensates for the fact that sample variance calculated with N systematically underestimates population variance. This happens because your sample mean is closer to your sample data than the true population mean would be.

Here’s a manual implementation to make the formulas concrete:

import numpy as np

def manual_population_variance(data):
    """Calculate population variance (divide by N)."""
    n = len(data)
    mean = sum(data) / n
    squared_deviations = [(x - mean) ** 2 for x in data]
    return sum(squared_deviations) / n

def manual_sample_variance(data):
    """Calculate sample variance (divide by N-1)."""
    n = len(data)
    mean = sum(data) / n
    squared_deviations = [(x - mean) ** 2 for x in data]
    return sum(squared_deviations) / (n - 1)

# Test with sample data
data = [2, 4, 4, 4, 5, 5, 7, 9]

print(f"Population variance: {manual_population_variance(data)}")  # 4.0
print(f"Sample variance: {manual_sample_variance(data)}")          # 4.571428...

The difference between these two values might seem small, but it becomes significant with smaller sample sizes and matters greatly for statistical inference.

Using numpy.var() for Basic Variance Calculation

NumPy’s np.var() function handles variance calculation efficiently on arrays of any size. The basic syntax is straightforward:

import numpy as np

# Create sample data
data = np.array([2, 4, 4, 4, 5, 5, 7, 9])

# Calculate variance
variance = np.var(data)
print(f"Variance: {variance}")  # Output: 4.0

The function accepts lists, tuples, or NumPy arrays and returns a scalar for 1D input. For larger datasets, NumPy’s vectorized operations provide dramatic performance improvements:

import numpy as np
import time

# Compare performance on large dataset
large_data = np.random.randn(1_000_000)

# NumPy approach
start = time.time()
np_variance = np.var(large_data)
np_time = time.time() - start

# Pure Python approach
start = time.time()
mean = sum(large_data) / len(large_data)
py_variance = sum((x - mean) ** 2 for x in large_data) / len(large_data)
py_time = time.time() - start

print(f"NumPy time: {np_time:.4f}s")
print(f"Python time: {py_time:.4f}s")
print(f"NumPy is {py_time/np_time:.1f}x faster")

On most systems, NumPy will be 50-100x faster for this operation. This performance gap widens with larger datasets.

The ddof Parameter: Controlling Degrees of Freedom

Here’s where many developers make mistakes. By default, np.var() calculates population variance with ddof=0. If you’re analyzing a sample and want to estimate population variance, you must explicitly set ddof=1.

The ddof parameter stands for “delta degrees of freedom.” The denominator in the variance calculation becomes N - ddof.

import numpy as np

data = np.array([2, 4, 4, 4, 5, 5, 7, 9])

# Population variance (default): divides by N
pop_variance = np.var(data)  # ddof=0 is implicit
print(f"Population variance (ddof=0): {pop_variance}")  # 4.0

# Sample variance: divides by N-1
sample_variance = np.var(data, ddof=1)
print(f"Sample variance (ddof=1): {sample_variance}")   # 4.571428...

# Verify the relationship
n = len(data)
print(f"Ratio check: {sample_variance * (n-1) / n}")   # Should equal pop_variance

When to use which:

  • Use ddof=0 (default) when your data represents the entire population
  • Use ddof=1 when your data is a sample and you’re estimating population variance
  • Use ddof=1 for most statistical inference and hypothesis testing

A practical rule: if you’re unsure, use ddof=1. Overestimating variance slightly is usually safer than underestimating it, especially with smaller samples.

import numpy as np

# Demonstrate the impact of sample size on the difference
for n in [5, 10, 50, 100, 1000]:
    data = np.random.randn(n)
    pop_var = np.var(data)
    sample_var = np.var(data, ddof=1)
    pct_diff = (sample_var - pop_var) / pop_var * 100
    print(f"N={n:4d}: Pop={pop_var:.4f}, Sample={sample_var:.4f}, Diff={pct_diff:.2f}%")

Notice how the percentage difference decreases as sample size increases. With small samples, the distinction matters more.

Calculating Variance Along Axes in Multi-Dimensional Arrays

Real-world data often comes in multi-dimensional arrays—think of spreadsheets with rows as observations and columns as features. The axis parameter lets you calculate variance along specific dimensions.

import numpy as np

# Create a 2D array: 4 samples, 3 features
data = np.array([
    [1, 2, 3],
    [4, 5, 6],
    [7, 8, 9],
    [10, 11, 12]
])

print("Original data:")
print(data)
print(f"Shape: {data.shape}")  # (4, 3)

# Variance of all elements (flattened)
total_var = np.var(data)
print(f"\nTotal variance (all elements): {total_var:.2f}")

# Variance along axis 0 (across rows, per column/feature)
var_axis0 = np.var(data, axis=0)
print(f"Variance per feature (axis=0): {var_axis0}")  # [11.25, 11.25, 11.25]

# Variance along axis 1 (across columns, per row/sample)
var_axis1 = np.var(data, axis=1)
print(f"Variance per sample (axis=1): {var_axis1}")   # [0.67, 0.67, 0.67, 0.67]

Understanding axis orientation is crucial:

  • axis=0: Collapse rows, compute variance down each column
  • axis=1: Collapse columns, compute variance across each row
  • axis=None (default): Flatten array first, compute single variance

Here’s a more realistic example with feature scaling:

import numpy as np

# Simulated dataset: 100 samples, 4 features with different scales
np.random.seed(42)
features = np.column_stack([
    np.random.normal(0, 1, 100),      # Feature 1: small variance
    np.random.normal(0, 10, 100),     # Feature 2: medium variance
    np.random.normal(0, 100, 100),    # Feature 3: large variance
    np.random.normal(0, 0.1, 100)     # Feature 4: tiny variance
])

# Calculate variance per feature for standardization
feature_variances = np.var(features, axis=0, ddof=1)
feature_stds = np.sqrt(feature_variances)

print("Feature variances:", feature_variances.round(2))
print("Feature std devs:", feature_stds.round(2))

# Standardize features (zero mean, unit variance)
feature_means = np.mean(features, axis=0)
standardized = (features - feature_means) / feature_stds

# Verify standardization worked
print("Standardized variances:", np.var(standardized, axis=0, ddof=1).round(2))

Handling Missing Data and Edge Cases

Real datasets often contain missing values. NumPy represents these as np.nan, and standard np.var() propagates NaN through calculations:

import numpy as np

data_with_nan = np.array([1, 2, np.nan, 4, 5])

# Standard var propagates NaN
result = np.var(data_with_nan)
print(f"np.var with NaN: {result}")  # nan

# np.nanvar ignores NaN values
result_clean = np.nanvar(data_with_nan)
print(f"np.nanvar with NaN: {result_clean}")  # 2.5

# Equivalent to removing NaN first
clean_data = data_with_nan[~np.isnan(data_with_nan)]
print(f"Manual NaN removal: {np.var(clean_data)}")  # 2.5

The np.nanvar() function accepts the same parameters as np.var(), including ddof and axis:

import numpy as np

# 2D array with scattered NaN values
data = np.array([
    [1, np.nan, 3],
    [4, 5, 6],
    [np.nan, 8, 9]
])

# Variance per column, ignoring NaN
col_variances = np.nanvar(data, axis=0, ddof=1)
print(f"Column variances (ignoring NaN): {col_variances}")

Handle edge cases defensively:

import numpy as np

# Empty array
try:
    np.var(np.array([]))
except Exception as e:
    print(f"Empty array warning: RuntimeWarning for mean of empty slice")

# Single element: variance is 0 (no spread)
single = np.array([5])
print(f"Single element variance: {np.var(single)}")  # 0.0

# Single element with ddof=1: undefined (dividing by 0)
print(f"Single element with ddof=1: {np.var(single, ddof=1)}")  # nan

Conclusion

NumPy’s variance functions provide everything you need for efficient statistical calculations. Remember these key points:

  1. Use ddof=1 for sample variance when estimating population parameters
  2. Use the axis parameter for multi-dimensional calculations
  3. Use np.nanvar() when your data contains missing values

For related calculations, explore np.std() for standard deviation (the square root of variance), np.mean() for the arithmetic mean, and np.cov() for covariance matrices when analyzing relationships between multiple variables. These functions share similar interfaces, so the patterns you’ve learned here transfer directly.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.