How to Calculate the Median in NumPy
The median represents the middle value in a sorted dataset. If you have an odd number of values, it's the exact center element. With an even number, it's the average of the two center elements. This...
Key Insights
- NumPy’s
np.median()function calculates the middle value of sorted data, making it essential for statistical analysis where outliers could skew your results. - Use the
axisparameter to compute medians along specific dimensions of multi-dimensional arrays, and switch tonp.nanmedian()when your data contains missing values. - The median is computationally more expensive than the mean (O(n log n) vs O(n)), but its robustness against outliers makes it the better choice for real-world datasets with anomalies.
Introduction
The median represents the middle value in a sorted dataset. If you have an odd number of values, it’s the exact center element. With an even number, it’s the average of the two center elements. This simple concept becomes surprisingly powerful in practice.
Why should you care about the median when you could just use the mean? Because real-world data is messy. A single outlier—a sensor malfunction reading 999999, a billionaire in your salary survey, or a bot generating thousands of page views—can completely distort your mean. The median shrugs off these anomalies and gives you a representative value.
NumPy provides efficient, vectorized functions for median calculations that handle everything from simple lists to complex multi-dimensional arrays. Let’s explore how to use them effectively.
Basic Median Calculation with np.median()
The np.median() function is your primary tool for calculating medians in NumPy. Its syntax is straightforward:
import numpy as np
# Simple 1D array with odd number of elements
temperatures = np.array([72, 68, 75, 71, 69, 74, 70])
median_temp = np.median(temperatures)
print(f"Median temperature: {median_temp}") # Output: 71.0
# Even number of elements - returns average of two middle values
prices = np.array([29.99, 45.50, 32.00, 55.75, 28.50, 41.25])
median_price = np.median(prices)
print(f"Median price: {median_price}") # Output: 36.625
Notice that np.median() always returns a float, even when working with integer arrays. This ensures precision when averaging the two middle values in even-length arrays.
You can also pass Python lists directly—NumPy converts them internally:
# Works with regular Python lists
response_times = [120, 95, 180, 88, 102, 115, 98]
median_response = np.median(response_times)
print(f"Median response time: {median_response}ms") # Output: 102.0
However, if you’re performing multiple operations, convert to a NumPy array first to avoid repeated conversion overhead.
Median of Multi-Dimensional Arrays
Real data often comes in matrices or higher-dimensional structures. The axis parameter lets you compute medians along specific dimensions.
import numpy as np
# Sales data: 4 quarters (rows) x 3 products (columns)
sales = np.array([
[150, 200, 175], # Q1
[180, 190, 160], # Q2
[165, 210, 185], # Q3
[170, 195, 170] # Q4
])
# Median across all values (flattened)
overall_median = np.median(sales)
print(f"Overall median: {overall_median}") # Output: 177.5
# Median along axis=0 (down columns) - median for each product
product_medians = np.median(sales, axis=0)
print(f"Product medians: {product_medians}") # Output: [167.5 197.5 172.5]
# Median along axis=1 (across rows) - median for each quarter
quarter_medians = np.median(sales, axis=1)
print(f"Quarter medians: {quarter_medians}") # Output: [175. 180. 185. 170.]
The axis parameter follows NumPy’s standard convention:
axis=None(default): Flatten the array and compute a single medianaxis=0: Compute along rows (result has one value per column)axis=1: Compute along columns (result has one value per row)
For higher-dimensional arrays, you can specify multiple axes as a tuple:
# 3D array: 2 stores x 4 quarters x 3 products
store_sales = np.array([
[[150, 200, 175], [180, 190, 160], [165, 210, 185], [170, 195, 170]],
[[140, 185, 165], [175, 200, 155], [160, 195, 180], [165, 190, 168]]
])
# Median across quarters and products for each store
store_medians = np.median(store_sales, axis=(1, 2))
print(f"Store medians: {store_medians}") # Output: [177.5 172.5]
Handling Missing Data with np.nanmedian()
Missing data is inevitable in real-world applications. NumPy represents missing values as np.nan (Not a Number), but these values propagate through standard calculations:
import numpy as np
# Sensor readings with missing values
readings = np.array([23.5, 24.1, np.nan, 22.8, 25.0, np.nan, 24.5])
# Standard median returns NaN if any NaN exists
standard_median = np.median(readings)
print(f"np.median result: {standard_median}") # Output: nan
# nanmedian ignores NaN values
safe_median = np.nanmedian(readings)
print(f"np.nanmedian result: {safe_median}") # Output: 24.1
The np.nanmedian() function ignores NaN values and computes the median of the remaining elements. This works identically with multi-dimensional arrays:
# Survey responses with missing data
survey_data = np.array([
[4, 5, np.nan, 3],
[5, np.nan, 4, 4],
[3, 4, 5, np.nan]
])
# Median per question (column), ignoring missing responses
question_medians = np.nanmedian(survey_data, axis=0)
print(f"Question medians: {question_medians}") # Output: [4. 4.5 4.5 3.5]
One important consideration: if an entire row or column contains only NaN values, np.nanmedian() will return NaN for that slice and raise a RuntimeWarning. Handle this case explicitly if it’s possible in your data:
import warnings
data_with_empty_column = np.array([
[1, np.nan],
[2, np.nan],
[3, np.nan]
])
with warnings.catch_warnings():
warnings.simplefilter("ignore", RuntimeWarning)
result = np.nanmedian(data_with_empty_column, axis=0)
print(f"Result: {result}") # Output: [2. nan]
Median vs Mean: When to Use Each
The median’s resistance to outliers is its defining advantage. Here’s a concrete example that demonstrates why this matters:
import numpy as np
# Employee salaries at a small company
salaries = np.array([
45000, 52000, 48000, 55000, 51000, # Regular employees
47000, 53000, 49000, 54000, 50000,
2500000 # CEO salary
])
mean_salary = np.mean(salaries)
median_salary = np.median(salaries)
print(f"Mean salary: ${mean_salary:,.2f}") # Output: $272,636.36
print(f"Median salary: ${median_salary:,.2f}") # Output: $51,000.00
The mean suggests employees earn over $270,000—completely misleading. The median correctly represents what a typical employee earns.
Use the mean when:
- Your data is normally distributed without significant outliers
- You need to calculate totals (mean × count = sum)
- Outliers represent valid, important data points
Use the median when:
- Your data is skewed or contains outliers
- You want a “typical” value that represents the center
- You’re analyzing income, house prices, response times, or other naturally skewed distributions
For a robust analysis, calculate both and compare them. A large divergence signals skewed data:
def analyze_distribution(data, name="Data"):
mean_val = np.mean(data)
median_val = np.median(data)
skew_indicator = (mean_val - median_val) / median_val * 100
print(f"{name}:")
print(f" Mean: {mean_val:.2f}")
print(f" Median: {median_val:.2f}")
print(f" Divergence: {skew_indicator:+.1f}%")
if abs(skew_indicator) > 10:
print(" ⚠️ Significant skew detected - prefer median")
analyze_distribution(salaries, "Salaries")
Performance Considerations
Computing the median requires sorting, which gives it O(n log n) time complexity compared to O(n) for the mean. For most applications, this difference is negligible, but it matters with very large datasets or performance-critical code.
The out parameter lets you reuse a pre-allocated array for results, reducing memory allocation overhead in loops:
import numpy as np
# Pre-allocate output array
data_batches = [np.random.randn(1000) for _ in range(100)]
result = np.empty(1)
# Reuse the same output array
medians = []
for batch in data_batches:
np.median(batch, out=result)
medians.append(result[0])
For multi-dimensional operations, size the output array appropriately:
# 2D data: 1000 samples x 50 features
large_data = np.random.randn(1000, 50)
# Pre-allocate for per-feature medians
feature_medians = np.empty(50)
np.median(large_data, axis=0, out=feature_medians)
If you need the median repeatedly on streaming data, consider approximate algorithms or maintaining a sorted structure. NumPy’s np.median() recomputes from scratch each time.
The keepdims parameter preserves the original array’s dimensionality, which is useful for broadcasting:
data = np.random.randn(100, 50)
medians = np.median(data, axis=0, keepdims=True)
centered_data = data - medians # Broadcasting works correctly
print(f"Centered data shape: {centered_data.shape}") # Output: (100, 50)
Conclusion
NumPy provides two essential functions for median calculations: np.median() for standard arrays and np.nanmedian() for data with missing values. Use the axis parameter to compute medians along specific dimensions of multi-dimensional arrays, and leverage keepdims when you need to preserve array shapes for broadcasting.
Choose the median over the mean when working with skewed distributions or datasets containing outliers—which describes most real-world data. When performance matters, pre-allocate output arrays using the out parameter and be mindful of the O(n log n) complexity for very large datasets.
The median is a fundamental statistical tool that belongs in every data practitioner’s toolkit. Master these NumPy functions, and you’ll write more robust analyses that accurately represent your data.