How to Calculate the Median in Python

Key Insights

The median is more robust than the mean for skewed data or datasets with outliers—use it when you need a “typical” value that isn’t distorted by extremes.
Python’s built-in statistics module handles most use cases cleanly, but NumPy offers 10-100x better performance for large datasets.
Choose your tool based on context: statistics for simple scripts, NumPy for numerical computing, and Pandas when working with tabular data.

Introduction to the Median

The median is the middle value in a sorted dataset. Unlike the mean, which sums all values and divides by count, the median simply finds the centerpoint. This makes it resistant to outliers—a property statisticians call “robustness.”

Consider salary data. If you have five employees earning $50K, $55K, $60K, $65K, and $500K, the mean salary is $146K. That’s misleading—most employees earn far less. The median is $60K, which better represents the typical employee’s compensation.

Use the median when:

Your data is skewed (income, home prices, response times)
Outliers are present and shouldn’t dominate your analysis
You want a “typical” value rather than an average
You’re dealing with ordinal data where arithmetic means don’t make sense

Python gives you several ways to calculate the median. Let’s work through them from first principles to production-ready solutions.

Calculating Median Manually

Before reaching for libraries, understand the algorithm. The median calculation follows two rules:

For odd-length lists: return the middle element
For even-length lists: return the average of the two middle elements

Here’s a pure Python implementation:

def calculate_median(values):
    """Calculate the median of a list of numbers."""
    if not values:
        raise ValueError("Cannot calculate median of empty list")
    
    sorted_values = sorted(values)
    n = len(sorted_values)
    mid = n // 2
    
    if n % 2 == 1:
        # Odd length: return middle element
        return sorted_values[mid]
    else:
        # Even length: return average of two middle elements
        return (sorted_values[mid - 1] + sorted_values[mid]) / 2


# Test with odd-length list
odd_data = [3, 1, 4, 1, 5]
print(f"Odd list {sorted(odd_data)}: median = {calculate_median(odd_data)}")
# Output: Odd list [1, 1, 3, 4, 5]: median = 3

# Test with even-length list
even_data = [3, 1, 4, 1, 5, 9]
print(f"Even list {sorted(even_data)}: median = {calculate_median(even_data)}")
# Output: Even list [1, 1, 3, 4, 5, 9]: median = 3.5

This implementation has O(n log n) time complexity due to sorting. For most applications, that’s fine. If you need O(n) performance, look into the “median of medians” algorithm—but you’ll rarely need it in practice.

The manual approach is educational, but don’t use it in production. Python’s standard library handles edge cases you might miss.

Using the statistics Module

Python 3.4 introduced the statistics module, which provides a clean, tested implementation. It’s part of the standard library—no installation required.

import statistics

# Basic usage
data = [2, 3, 5, 7, 11, 13, 17]
print(statistics.median(data))  # Output: 7

# Works with floats
float_data = [1.5, 2.7, 3.2, 4.8]
print(statistics.median(float_data))  # Output: 2.95

# Works with Decimal for financial calculations
from decimal import Decimal
prices = [Decimal("19.99"), Decimal("24.99"), Decimal("29.99")]
print(statistics.median(prices))  # Output: 24.99

The module provides three additional median functions for specific use cases:

import statistics

even_data = [1, 2, 3, 4]

# Standard median (averages middle two for even-length)
print(statistics.median(even_data))  # Output: 2.5

# median_low: returns lower of two middle values
print(statistics.median_low(even_data))  # Output: 2

# median_high: returns higher of two middle values
print(statistics.median_high(even_data))  # Output: 3

# median_grouped: for continuous data in class intervals
# Assumes data points are centers of class intervals
continuous_data = [52, 52, 53, 54, 54, 54, 55, 55, 56]
print(statistics.median_grouped(continuous_data))  # Output: 54.0
print(statistics.median_grouped(continuous_data, interval=2))  # Output: 54.0

Use median_low or median_high when you need the median to be an actual value from your dataset—useful for ordinal data or when interpolation doesn’t make sense. median_grouped is specialized for binned continuous data, common in survey analysis.

The statistics module raises StatisticsError for empty sequences:

import statistics

try:
    statistics.median([])
except statistics.StatisticsError as e:
    print(f"Error: {e}")  # Output: Error: no median for empty data

Using NumPy for Performance

When you’re working with large datasets or doing numerical computing, NumPy is the right choice. It’s implemented in C and operates on contiguous memory blocks, making it dramatically faster than pure Python.

import numpy as np

# Basic usage
data = np.array([3, 1, 4, 1, 5, 9, 2, 6])
print(np.median(data))  # Output: 3.5

# Works on multi-dimensional arrays
matrix = np.array([
    [1, 2, 3],
    [4, 5, 6],
    [7, 8, 9]
])

# Median of all elements
print(np.median(matrix))  # Output: 5.0

# Median along axis 0 (columns)
print(np.median(matrix, axis=0))  # Output: [4. 5. 6.]

# Median along axis 1 (rows)
print(np.median(matrix, axis=1))  # Output: [2. 5. 8.]

Real-world data often contains missing values. NumPy represents these as NaN, and np.median() will return NaN if any value is missing. Use np.nanmedian() to ignore them:

import numpy as np

# Data with missing values
sensor_readings = np.array([23.1, 24.5, np.nan, 22.8, 25.2, np.nan, 24.1])

# Regular median propagates NaN
print(np.median(sensor_readings))  # Output: nan

# nanmedian ignores NaN values
print(np.nanmedian(sensor_readings))  # Output: 24.1

NumPy also supports the keepdims parameter to maintain array dimensions after reduction:

import numpy as np

data = np.array([[1, 2, 3], [4, 5, 6]])

# Without keepdims
result = np.median(data, axis=1)
print(result.shape)  # Output: (2,)

# With keepdims - useful for broadcasting
result = np.median(data, axis=1, keepdims=True)
print(result.shape)  # Output: (2, 1)

Median with Pandas DataFrames

Pandas builds on NumPy and adds the tabular data structures you need for real-world analysis. The .median() method works on both Series and DataFrames.

import pandas as pd

# Create sample sales data
df = pd.DataFrame({
    'product': ['Widget', 'Widget', 'Widget', 'Gadget', 'Gadget', 'Gadget'],
    'region': ['North', 'South', 'North', 'South', 'North', 'South'],
    'sales': [150, 200, 175, 300, 250, 280],
    'units': [10, 15, 12, 8, 7, 9]
})

print(df)

Calculate median across columns (numeric columns only):

# Median of each numeric column
print(df.median(numeric_only=True))
# Output:
# sales    212.5
# units      9.5
# dtype: float64

# Median of a single column
print(df['sales'].median())  # Output: 212.5

The real power comes with groupby():

# Median sales by product
print(df.groupby('product')['sales'].median())
# Output:
# product
# Gadget    280.0
# Widget    175.0
# Name: sales, dtype: float64

# Median of all numeric columns by product
print(df.groupby('product').median(numeric_only=True))
# Output:
#          sales  units
# product              
# Gadget   280.0    8.0
# Widget   175.0   12.0

# Multiple grouping levels
print(df.groupby(['product', 'region']).median(numeric_only=True))

Pandas handles missing values gracefully by default:

import pandas as pd
import numpy as np

# Data with missing values
df = pd.DataFrame({
    'A': [1, 2, np.nan, 4, 5],
    'B': [np.nan, 2, 3, 4, np.nan]
})

# Median ignores NaN by default
print(df.median())
# Output:
# A    3.0
# B    3.0
# dtype: float64

# Use skipna=False to propagate NaN
print(df.median(skipna=False))
# Output:
# A    NaN
# B    NaN
# dtype: float64

Performance Considerations

Your choice of tool should match your data size and context. Here’s a practical comparison:

import time
import statistics
import numpy as np

def benchmark(func, data, iterations=100):
    start = time.perf_counter()
    for _ in range(iterations):
        func(data)
    elapsed = time.perf_counter() - start
    return elapsed / iterations * 1000  # milliseconds

# Small dataset (100 elements)
small_list = list(range(100))
small_array = np.array(small_list)

print("Small dataset (100 elements):")
print(f"  statistics.median: {benchmark(statistics.median, small_list):.4f} ms")
print(f"  np.median:         {benchmark(np.median, small_array):.4f} ms")

# Large dataset (1,000,000 elements)
large_list = list(range(1_000_000))
large_array = np.array(large_list)

print("\nLarge dataset (1,000,000 elements):")
print(f"  statistics.median: {benchmark(statistics.median, large_list, iterations=5):.2f} ms")
print(f"  np.median:         {benchmark(np.median, large_array, iterations=5):.2f} ms")

Typical results show NumPy is 10-100x faster for large datasets. However, for small datasets or one-off calculations, the difference is negligible—and statistics has no dependencies.

Choose based on context:

Pure Python / statistics: Simple scripts, small datasets, no NumPy dependency
NumPy: Numerical computing, large arrays, need axis operations or NaN handling
Pandas: Tabular data, grouped aggregations, data analysis workflows

Conclusion

Python offers multiple paths to the median, each suited to different contexts. Start with statistics.median() for simple cases—it’s in the standard library and handles edge cases correctly. Move to NumPy when performance matters or when you’re already in a numerical computing context. Use Pandas when you’re working with DataFrames and need grouped aggregations.

The median itself is a fundamental tool for understanding your data. Unlike the mean, it tells you where the center of your data actually lies, unswayed by extreme values. Master these implementations, and you’ll have a robust statistical tool ready for whatever data comes your way.