NumPy - np.median() with Examples
The `np.median()` function calculates the median value of array elements. For arrays with odd length, it returns the middle element. For even-length arrays, it returns the average of the two middle...
Key Insights
np.median()computes the 50th percentile of array elements, returning the middle value that’s robust to outliers unlike mean-based statistics- The function supports multi-dimensional arrays with axis-specific calculations, NaN handling through
np.nanmedian(), and memory-efficient computation for large datasets - Understanding median calculation differences between odd/even length arrays and leveraging
keepdimsparameter enables sophisticated data analysis workflows
Basic Median Calculation
The np.median() function calculates the median value of array elements. For arrays with odd length, it returns the middle element. For even-length arrays, it returns the average of the two middle elements.
import numpy as np
# Odd number of elements
arr_odd = np.array([1, 3, 5, 7, 9])
median_odd = np.median(arr_odd)
print(f"Median (odd): {median_odd}") # Output: 5.0
# Even number of elements
arr_even = np.array([1, 3, 5, 7, 9, 11])
median_even = np.median(arr_even)
print(f"Median (even): {median_even}") # Output: 6.0 (average of 5 and 7)
# Unsorted array - automatically sorted internally
arr_unsorted = np.array([9, 1, 5, 3, 7])
median_unsorted = np.median(arr_unsorted)
print(f"Median (unsorted): {median_unsorted}") # Output: 5.0
The function handles unsorted data automatically, eliminating the need for manual sorting. This internal sorting doesn’t modify the original array.
Multi-Dimensional Arrays and Axis Parameter
Computing medians along specific axes enables column-wise or row-wise analysis of matrices and higher-dimensional tensors.
# 2D array
data_2d = np.array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
# Median of all elements (flattened)
median_all = np.median(data_2d)
print(f"Overall median: {median_all}") # Output: 5.0
# Median along axis 0 (column-wise)
median_cols = np.median(data_2d, axis=0)
print(f"Column medians: {median_cols}") # Output: [4. 5. 6.]
# Median along axis 1 (row-wise)
median_rows = np.median(data_2d, axis=1)
print(f"Row medians: {median_rows}") # Output: [2. 5. 8.]
# 3D array example
data_3d = np.random.randint(0, 100, size=(3, 4, 5))
median_axis2 = np.median(data_3d, axis=2)
print(f"Shape after axis=2 median: {median_axis2.shape}") # Output: (3, 4)
The axis parameter follows NumPy’s standard convention: axis=0 operates along rows (column-wise results), while axis=1 operates along columns (row-wise results).
Handling NaN Values
Standard np.median() propagates NaN values through calculations. Use np.nanmedian() to ignore NaN values during computation.
# Array with NaN values
data_with_nan = np.array([1.0, 2.0, np.nan, 4.0, 5.0])
# Standard median - returns NaN
median_standard = np.median(data_with_nan)
print(f"Standard median: {median_standard}") # Output: nan
# NaN-aware median
median_nanaware = np.nanmedian(data_with_nan)
print(f"NaN-aware median: {median_nanaware}") # Output: 3.0
# Multi-dimensional with NaN
data_2d_nan = np.array([[1.0, np.nan, 3.0],
[4.0, 5.0, np.nan],
[7.0, 8.0, 9.0]])
median_cols_nan = np.nanmedian(data_2d_nan, axis=0)
print(f"Column medians (NaN ignored): {median_cols_nan}") # Output: [4. 6.5 6.]
This distinction matters significantly in real-world datasets where missing values are common. Using the wrong function can silently corrupt entire analyses.
keepdims Parameter for Broadcasting
The keepdims parameter maintains original array dimensionality, crucial for broadcasting operations and maintaining shape consistency.
data = np.array([[10, 20, 30],
[40, 50, 60],
[70, 80, 90]])
# Without keepdims (default)
median_no_keepdims = np.median(data, axis=1)
print(f"Shape without keepdims: {median_no_keepdims.shape}") # Output: (3,)
print(median_no_keepdims) # Output: [20. 50. 80.]
# With keepdims
median_keepdims = np.median(data, axis=1, keepdims=True)
print(f"Shape with keepdims: {median_keepdims.shape}") # Output: (3, 1)
print(median_keepdims) # Output: [[20.] [50.] [80.]]
# Broadcasting application - subtract row medians from each row
centered_data = data - median_keepdims
print("Centered data:")
print(centered_data)
# Output:
# [[-10. 0. 10.]
# [-10. 0. 10.]
# [-10. 0. 10.]]
This pattern enables vectorized operations without explicit loops, maintaining NumPy’s performance advantages.
Weighted Median Calculation
NumPy doesn’t provide built-in weighted median functionality, but you can implement it efficiently using NumPy operations.
def weighted_median(data, weights):
"""
Calculate weighted median.
Args:
data: numpy array of values
weights: numpy array of weights (same shape as data)
Returns:
Weighted median value
"""
# Sort data and weights by data values
sorted_indices = np.argsort(data)
sorted_data = data[sorted_indices]
sorted_weights = weights[sorted_indices]
# Calculate cumulative sum of weights
cumsum = np.cumsum(sorted_weights)
# Find the point where cumulative weight crosses 50%
midpoint = 0.5 * cumsum[-1]
# Find index where cumsum >= midpoint
median_idx = np.where(cumsum >= midpoint)[0][0]
return sorted_data[median_idx]
# Example usage
values = np.array([1, 2, 3, 4, 5])
weights = np.array([1, 1, 3, 1, 1]) # Value 3 has higher weight
wmedian = weighted_median(values, weights)
print(f"Weighted median: {wmedian}") # Output: 3
print(f"Regular median: {np.median(values)}") # Output: 3.0
Weighted medians prove essential in scenarios like survey analysis where responses have different reliability scores or sample sizes.
Performance Optimization Strategies
Understanding computational complexity and memory usage helps optimize median calculations for large datasets.
import time
# Performance comparison for large arrays
sizes = [10**4, 10**5, 10**6]
for size in sizes:
data_large = np.random.randn(size)
# Time median calculation
start = time.perf_counter()
median_result = np.median(data_large)
elapsed = time.perf_counter() - start
print(f"Size: {size:>8}, Time: {elapsed:.6f}s, Median: {median_result:.4f}")
# Output (approximate):
# Size: 10000, Time: 0.000234s, Median: -0.0012
# Size: 100000, Time: 0.002341s, Median: 0.0023
# Size: 1000000, Time: 0.024567s, Median: -0.0001
# Memory-efficient approach for very large datasets
# Use percentile with linear interpolation
data_huge = np.random.randn(10**7)
median_percentile = np.percentile(data_huge, 50, method='linear')
median_direct = np.median(data_huge)
print(f"Percentile method: {median_percentile:.6f}")
print(f"Direct method: {median_direct:.6f}")
The np.median() function has O(n log n) complexity due to internal sorting. For repeated median calculations on similar data, consider pre-sorting or using approximate algorithms.
Real-World Application: Outlier Detection
Median-based statistics provide robust outlier detection compared to mean-based approaches.
# Generate data with outliers
np.random.seed(42)
normal_data = np.random.normal(50, 10, 95)
outliers = np.array([150, 200, -50, 180, 160])
data_with_outliers = np.concatenate([normal_data, outliers])
# Compare mean vs median
mean_value = np.mean(data_with_outliers)
median_value = np.median(data_with_outliers)
print(f"Mean: {mean_value:.2f}") # Output: Mean: 56.45
print(f"Median: {median_value:.2f}") # Output: Median: 50.23
# Median Absolute Deviation (MAD) for outlier detection
mad = np.median(np.abs(data_with_outliers - median_value))
modified_z_score = 0.6745 * (data_with_outliers - median_value) / mad
# Identify outliers (|modified_z_score| > 3.5)
outlier_mask = np.abs(modified_z_score) > 3.5
outlier_indices = np.where(outlier_mask)[0]
print(f"Detected outliers at indices: {outlier_indices}")
print(f"Outlier values: {data_with_outliers[outlier_mask]}")
This MAD-based approach provides a robust alternative to standard deviation-based methods, particularly when data contains extreme values that would inflate variance estimates.
Practical Considerations
Always verify data type compatibility. Integer arrays return float medians when the result falls between two integers. For memory-constrained environments, consider using np.partition() for approximate medians without full sorting. When working with time-series data, apply rolling median calculations using stride tricks or pandas integration for smoothing without lag bias. The median’s resistance to outliers makes it superior to mean for skewed distributions, but this same property means it discards information about distribution tails that might be analytically important.