How to Calculate Percentiles in NumPy
Percentiles divide your data into 100 equal parts, answering the question: 'What value falls below X% of my observations?' The median is the 50th percentile—half the data falls below it. The 90th...
Key Insights
- NumPy’s
np.percentile()function provides flexible percentile calculations with multiple interpolation methods to handle non-integer indices—choosing the right method depends on whether you need conservative estimates, liberal estimates, or smooth interpolation. - The
axisparameter transforms percentile calculations from single-array operations to powerful multi-dimensional analysis, enabling you to compute statistics across rows, columns, or any axis of your data. - Combining the 25th and 75th percentiles into an Interquartile Range (IQR) calculation gives you a robust, outlier-resistant method for detecting anomalous data points that’s superior to standard deviation for skewed distributions.
What Percentiles Tell You About Your Data
Percentiles divide your data into 100 equal parts, answering the question: “What value falls below X% of my observations?” The median is the 50th percentile—half the data falls below it. The 90th percentile tells you the value that 90% of your data falls below.
This matters for practical data analysis. When you’re analyzing response times, the 95th percentile tells you the worst-case experience for most users. When examining salaries, percentiles reveal where someone stands relative to their peers without being skewed by extreme outliers. When monitoring system performance, percentile-based alerts catch degradation that mean-based metrics miss entirely.
NumPy provides two primary functions for this: np.percentile() for values between 0-100, and np.quantile() for values between 0-1. They’re functionally identical otherwise.
Basic Percentile Calculation with np.percentile()
The core syntax is straightforward. Pass your array and the percentile(s) you want:
import numpy as np
# Sample dataset: daily website response times in milliseconds
response_times = np.array([45, 52, 48, 61, 55, 49, 58, 47, 53, 250, 51, 54, 46, 50, 57])
# Calculate single percentile
median = np.percentile(response_times, 50)
print(f"Median response time: {median} ms") # 52.0 ms
# Calculate multiple percentiles at once
quartiles = np.percentile(response_times, [25, 50, 75])
print(f"Q1: {quartiles[0]} ms") # 48.5 ms
print(f"Q2: {quartiles[1]} ms") # 52.0 ms
print(f"Q3: {quartiles[2]} ms") # 56.0 ms
# Calculate common percentiles for performance monitoring
performance_percentiles = np.percentile(response_times, [50, 90, 95, 99])
print(f"p50: {performance_percentiles[0]} ms") # 52.0 ms
print(f"p90: {performance_percentiles[1]} ms") # 156.0 ms
print(f"p95: {performance_percentiles[2]} ms") # 203.0 ms
print(f"p99: {performance_percentiles[3]} ms") # 240.6 ms
Notice how the mean of this dataset would be heavily influenced by that 250ms outlier, but the median stays at 52ms. That’s the power of percentiles for real-world data.
Understanding Interpolation Methods
Here’s where most tutorials gloss over important details. When you ask for the 25th percentile of 15 data points, the exact position (0.25 × 14 = 3.5) falls between indices. NumPy must interpolate.
The method parameter controls this behavior:
import numpy as np
data = np.array([10, 20, 30, 40, 50, 60, 70, 80, 90, 100])
methods = ['linear', 'lower', 'higher', 'nearest', 'midpoint']
print("33rd percentile with different interpolation methods:")
for method in methods:
result = np.percentile(data, 33, method=method)
print(f" {method:10}: {result}")
# Output:
# linear : 36.7
# lower : 30
# higher : 40
# nearest : 40
# midpoint : 35.0
When should you use each?
- linear (default): Best for continuous data where smooth interpolation makes sense. Use this for measurements, times, and most numerical data.
- lower: Conservative estimate. Use when you need the actual data point that’s at or below the percentile. Good for discrete counts or when underestimating is safer.
- higher: Liberal estimate. Use when overestimating is the safer choice, like capacity planning.
- nearest: Returns the closest actual data point. Use when you need a value that actually exists in your dataset.
- midpoint: Average of lower and higher. A balanced approach when you want to stay close to actual values but allow some interpolation.
For most applications, stick with linear. Switch to lower or higher when you’re making decisions where the direction of error matters.
Working with Multi-dimensional Arrays
Real data rarely comes as flat arrays. The axis parameter lets you calculate percentiles across specific dimensions:
import numpy as np
# Monthly sales data: 4 products (rows) × 6 months (columns)
sales_data = np.array([
[120, 135, 142, 138, 155, 160], # Product A
[85, 92, 88, 95, 102, 98], # Product B
[200, 215, 225, 210, 240, 235], # Product C
[45, 52, 48, 55, 58, 62] # Product D
])
# Median sales for each product (across months)
product_medians = np.percentile(sales_data, 50, axis=1)
print("Median monthly sales per product:")
print(f" Product A: {product_medians[0]}") # 140.0
print(f" Product B: {product_medians[1]}") # 93.5
print(f" Product C: {product_medians[2]}") # 220.0
print(f" Product D: {product_medians[3]}") # 53.5
# 75th percentile sales for each month (across products)
monthly_p75 = np.percentile(sales_data, 75, axis=0)
print(f"\n75th percentile by month: {monthly_p75}")
# [180.0, 195.0, 204.25, 192.0, 218.5, 216.25]
# Multiple percentiles across an axis
product_distribution = np.percentile(sales_data, [25, 50, 75], axis=1)
print(f"\nProduct quartiles shape: {product_distribution.shape}") # (3, 4)
# First dimension is percentiles, second is products
The axis logic follows NumPy conventions: axis=0 operates down columns (across rows), axis=1 operates across columns (within each row). Think of it as “collapse this axis.”
np.nanpercentile() for Handling Missing Data
Production data has gaps. Sensors fail, users skip fields, imports corrupt. Standard np.percentile() propagates NaN values, ruining your calculations:
import numpy as np
# Sensor readings with missing data points
sensor_readings = np.array([23.5, 24.1, np.nan, 23.8, 25.2, np.nan, 24.5, 23.9, 24.8, 25.0])
# Standard percentile fails with NaN
standard_result = np.percentile(sensor_readings, 50)
print(f"np.percentile with NaN: {standard_result}") # nan
# nanpercentile ignores NaN values
clean_result = np.nanpercentile(sensor_readings, 50)
print(f"np.nanpercentile with NaN: {clean_result}") # 24.3
# Works with multiple percentiles and axes too
readings_2d = np.array([
[23.5, 24.1, np.nan, 23.8],
[25.2, np.nan, 24.5, 23.9],
[24.8, 25.0, 24.2, np.nan]
])
row_medians = np.nanpercentile(readings_2d, 50, axis=1)
print(f"Row medians (ignoring NaN): {row_medians}") # [23.8, 24.5, 24.8]
col_medians = np.nanpercentile(readings_2d, 50, axis=0)
print(f"Column medians (ignoring NaN): {col_medians}") # [24.8, 24.55, 24.35, 23.85]
Use np.nanpercentile() by default when working with data from external sources. The performance overhead is negligible, and it prevents silent failures.
Practical Use Case: Outlier Detection with IQR
The Interquartile Range (IQR) method uses the 25th and 75th percentiles to define “normal” data bounds. Points beyond 1.5× IQR from the quartiles are flagged as outliers. This approach is robust against the very outliers it’s trying to detect:
import numpy as np
def detect_outliers_iqr(data, multiplier=1.5):
"""
Detect outliers using the IQR method.
Parameters:
data: array-like, input data (can contain NaN)
multiplier: float, IQR multiplier for bounds (default 1.5)
Returns:
dict with bounds, outlier mask, and outlier values
"""
q1, q3 = np.nanpercentile(data, [25, 75])
iqr = q3 - q1
lower_bound = q1 - (multiplier * iqr)
upper_bound = q3 + (multiplier * iqr)
outlier_mask = (data < lower_bound) | (data > upper_bound)
return {
'q1': q1,
'q3': q3,
'iqr': iqr,
'lower_bound': lower_bound,
'upper_bound': upper_bound,
'outlier_mask': outlier_mask,
'outliers': data[outlier_mask],
'clean_data': data[~outlier_mask & ~np.isnan(data)]
}
# Example: API response times with anomalies
response_times = np.array([
45, 52, 48, 61, 55, 49, 58, 47, 53, 51, 54, 46, 50, 57,
52, 49, 55, 48, 53, 50, 56, 47, 54, 51, 49,
850, # Server hiccup
52, 48, 55, 51,
1200, # Database timeout
49, 53, 50, 47
])
results = detect_outliers_iqr(response_times)
print(f"Q1: {results['q1']} ms")
print(f"Q3: {results['q3']} ms")
print(f"IQR: {results['iqr']} ms")
print(f"Acceptable range: [{results['lower_bound']:.1f}, {results['upper_bound']:.1f}] ms")
print(f"Outliers detected: {results['outliers']}")
print(f"Clean data points: {len(results['clean_data'])}/{len(response_times)}")
Use a multiplier of 1.5 for standard outlier detection. Increase to 3.0 for “extreme” outliers only. This method works particularly well for skewed distributions where standard deviation-based methods fail.
Performance Tips and Best Practices
Prefer np.quantile() for normalized inputs. If you’re already working with proportions (0-1 scale), use np.quantile() directly instead of multiplying by 100 for np.percentile(). They use the same underlying implementation.
# These are equivalent
np.percentile(data, 95)
np.quantile(data, 0.95)
Pre-sort for repeated calculations. If you’re calculating many percentiles on the same data, sorting once is faster:
sorted_data = np.sort(data)
# Multiple percentile calls on sorted_data benefit from cache locality
Use out parameter for memory efficiency. When processing large datasets in loops, reuse output arrays:
result = np.empty(3)
for batch in data_batches:
np.percentile(batch, [25, 50, 75], out=result)
process(result)
Know when percentiles aren’t the answer. Percentiles excel at understanding distribution shape and finding robust central tendencies. But for hypothesis testing, use proper statistical tests. For finding exact thresholds in sorted data, direct indexing on sorted arrays is faster than percentile calculations.
Percentiles are a foundational tool for exploratory data analysis and robust statistics. Master np.percentile() and np.nanpercentile(), understand the interpolation methods, and you’ll handle most real-world statistical summarization tasks efficiently.