NumPy - np.std() and np.var()
Variance measures how spread out data points are from their mean. Standard deviation is simply the square root of variance, providing a measure in the same units as the original data. NumPy...
Key Insights
- NumPy’s
np.std()andnp.var()calculate standard deviation and variance with configurable degrees of freedom (ddof parameter) that determines whether you’re working with population or sample statistics - Both functions support multi-dimensional operations with axis specification, enabling efficient statistical analysis across rows, columns, or the entire dataset without explicit loops
- Understanding when to use ddof=0 (population) versus ddof=1 (sample) is critical for accurate statistical inference, especially when working with subsets of larger datasets
Understanding Variance and Standard Deviation
Variance measures how spread out data points are from their mean. Standard deviation is simply the square root of variance, providing a measure in the same units as the original data. NumPy implements both through np.var() and np.std().
import numpy as np
data = np.array([2, 4, 4, 4, 5, 5, 7, 9])
# Population variance and standard deviation
pop_var = np.var(data)
pop_std = np.std(data)
print(f"Population variance: {pop_var}") # 4.0
print(f"Population std dev: {pop_std}") # 2.0
# Sample variance and standard deviation
sample_var = np.var(data, ddof=1)
sample_std = np.std(data, ddof=1)
print(f"Sample variance: {sample_var}") # 4.571428571428571
print(f"Sample std dev: {sample_std}") # 2.138089935299395
The ddof (delta degrees of freedom) parameter controls the divisor used in calculation. For population statistics, use ddof=0 (default). For sample statistics, use ddof=1.
The Mathematics Behind ddof
Population variance divides by N (total count), while sample variance divides by N-1. This Bessel’s correction compensates for the bias when estimating population parameters from a sample.
import numpy as np
data = np.array([10, 12, 23, 23, 16, 23, 21, 16])
n = len(data)
mean = np.mean(data)
# Manual population variance calculation
pop_var_manual = np.sum((data - mean) ** 2) / n
pop_var_numpy = np.var(data, ddof=0)
print(f"Manual population variance: {pop_var_manual}")
print(f"NumPy population variance: {pop_var_numpy}")
print(f"Match: {np.isclose(pop_var_manual, pop_var_numpy)}")
# Manual sample variance calculation
sample_var_manual = np.sum((data - mean) ** 2) / (n - 1)
sample_var_numpy = np.var(data, ddof=1)
print(f"\nManual sample variance: {sample_var_manual}")
print(f"NumPy sample variance: {sample_var_numpy}")
print(f"Match: {np.isclose(sample_var_manual, sample_var_numpy)}")
When you have the entire population, use ddof=0. When working with a sample to infer population statistics, use ddof=1.
Multi-Dimensional Operations with Axis Parameter
The axis parameter enables statistical calculations across specific dimensions without flattening arrays or writing loops.
import numpy as np
# 2D array: 4 samples, 3 features each
data = np.array([
[23, 45, 12],
[67, 34, 89],
[12, 78, 45],
[90, 23, 67]
])
# Variance across all elements
total_var = np.var(data)
print(f"Total variance: {total_var:.2f}")
# Variance for each feature (column-wise, axis=0)
feature_var = np.var(data, axis=0)
print(f"Feature variances: {feature_var}")
# Variance for each sample (row-wise, axis=1)
sample_var = np.var(data, axis=1)
print(f"Sample variances: {sample_var}")
# Standard deviation for each feature
feature_std = np.std(data, axis=0)
print(f"Feature std devs: {feature_std}")
For a 2D array, axis=0 operates down columns (across rows), while axis=1 operates across columns (down rows). This pattern extends to higher dimensions.
Working with Missing Data
Real-world datasets often contain missing values. NumPy provides np.nanvar() and np.nanstd() to handle NaN values gracefully.
import numpy as np
data = np.array([12, 15, np.nan, 18, 21, np.nan, 24, 27])
# Standard functions return nan
regular_std = np.std(data)
print(f"Regular std (with NaN): {regular_std}")
# Nan-aware functions ignore NaN values
nan_std = np.nanstd(data, ddof=1)
nan_var = np.nanvar(data, ddof=1)
print(f"NaN-aware std: {nan_std:.2f}")
print(f"NaN-aware var: {nan_var:.2f}")
# Multi-dimensional with NaN
matrix = np.array([
[1.0, 2.0, np.nan],
[4.0, np.nan, 6.0],
[7.0, 8.0, 9.0]
])
col_std = np.nanstd(matrix, axis=0)
print(f"Column std devs (ignoring NaN): {col_std}")
The nanvar() and nanstd() functions calculate statistics by excluding NaN values from both the numerator and denominator, effectively reducing the count of valid observations.
Weighted Variance and Standard Deviation
When observations have different importance, use weighted statistics. NumPy doesn’t provide built-in weighted variance, but implementation is straightforward.
import numpy as np
def weighted_var(values, weights, ddof=0):
"""Calculate weighted variance."""
average = np.average(values, weights=weights)
variance = np.average((values - average) ** 2, weights=weights)
if ddof != 0:
# Apply Bessel's correction for weighted samples
sum_weights = np.sum(weights)
sum_weights_squared = np.sum(weights ** 2)
variance *= sum_weights / (sum_weights - ddof * sum_weights_squared / sum_weights)
return variance
def weighted_std(values, weights, ddof=0):
"""Calculate weighted standard deviation."""
return np.sqrt(weighted_var(values, weights, ddof))
# Example: test scores with different credit weights
scores = np.array([85, 92, 78, 88, 95])
credits = np.array([3, 4, 3, 2, 4])
w_var = weighted_var(scores, credits)
w_std = weighted_std(scores, credits)
print(f"Weighted variance: {w_var:.2f}")
print(f"Weighted std dev: {w_std:.2f}")
# Compare with unweighted
print(f"Unweighted std dev: {np.std(scores):.2f}")
Weighted statistics are essential in scenarios like portfolio analysis, survey data with sampling weights, or aggregating metrics with different confidence levels.
Performance Considerations
NumPy’s compiled C implementation delivers significant performance advantages over pure Python, especially for large datasets.
import numpy as np
import time
# Large dataset
data = np.random.randn(1000000)
# NumPy approach
start = time.perf_counter()
np_std = np.std(data)
np_time = time.perf_counter() - start
# Pure Python approach
start = time.perf_counter()
mean = sum(data) / len(data)
variance = sum((x - mean) ** 2 for x in data) / len(data)
py_std = variance ** 0.5
py_time = time.perf_counter() - start
print(f"NumPy time: {np_time:.4f}s")
print(f"Python time: {py_time:.4f}s")
print(f"Speedup: {py_time / np_time:.1f}x")
print(f"Results match: {np.isclose(np_std, py_std)}")
For production systems processing large datasets or requiring real-time statistics, NumPy’s vectorized operations provide orders of magnitude performance improvements.
Practical Application: Outlier Detection
Standard deviation forms the basis for statistical outlier detection through z-scores.
import numpy as np
def detect_outliers(data, threshold=3):
"""Detect outliers using z-score method."""
mean = np.mean(data)
std = np.std(data)
z_scores = np.abs((data - mean) / std)
return np.where(z_scores > threshold)[0]
# Dataset with outliers
measurements = np.array([
23, 25, 24, 26, 25, 24, 23, 25,
24, 26, 95, 24, 25, 23, 24
])
outlier_indices = detect_outliers(measurements, threshold=2.5)
print(f"Outlier indices: {outlier_indices}")
print(f"Outlier values: {measurements[outlier_indices]}")
# Robust statistics excluding outliers
clean_data = np.delete(measurements, outlier_indices)
print(f"\nOriginal mean: {np.mean(measurements):.2f}")
print(f"Original std: {np.std(measurements):.2f}")
print(f"Cleaned mean: {np.mean(clean_data):.2f}")
print(f"Cleaned std: {np.std(clean_data):.2f}")
This z-score approach assumes normal distribution. For non-normal data, consider robust alternatives like median absolute deviation (MAD) or interquartile range (IQR) methods.
Keeping Dimensions with keepdims
The keepdims parameter preserves array dimensions after reduction operations, simplifying broadcasting operations.
import numpy as np
data = np.array([
[10, 20, 30],
[40, 50, 60],
[70, 80, 90]
])
# Without keepdims (returns 1D array)
std_no_keep = np.std(data, axis=1)
print(f"Shape without keepdims: {std_no_keep.shape}")
print(std_no_keep)
# With keepdims (preserves 2D structure)
std_keep = np.std(data, axis=1, keepdims=True)
print(f"\nShape with keepdims: {std_keep.shape}")
print(std_keep)
# Enables direct broadcasting for normalization
normalized = (data - np.mean(data, axis=1, keepdims=True)) / std_keep
print(f"\nNormalized data shape: {normalized.shape}")
print(normalized)
Using keepdims=True eliminates the need for manual reshaping when performing element-wise operations between the original array and computed statistics.