NumPy - np.percentile() and np.quantile()
Percentiles and quantiles represent the same statistical concept with different scaling conventions. A percentile divides data into 100 equal parts (0-100 scale), while a quantile uses a 0-1 scale....
Key Insights
np.percentile()andnp.quantile()compute the same statistical measures but use different scales—percentiles (0-100) versus quantiles (0-1)—making quantile notation more mathematically standard while percentile remains more intuitive for most use cases- Both functions support multiple interpolation methods that significantly impact results on discrete datasets, with ’linear’ as default but ‘midpoint’ and ’nearest’ offering advantages for specific data distributions
- Multidimensional array operations with axis parameters enable efficient statistical analysis across specific dimensions without explicit loops, critical for large-scale data processing
Understanding Percentiles and Quantiles
Percentiles and quantiles represent the same statistical concept with different scaling conventions. A percentile divides data into 100 equal parts (0-100 scale), while a quantile uses a 0-1 scale. The 75th percentile equals the 0.75 quantile—both identify the value below which 75% of observations fall.
import numpy as np
data = np.array([1, 3, 5, 7, 9, 11, 13, 15, 17, 19])
# These produce identical results
percentile_75 = np.percentile(data, 75)
quantile_75 = np.quantile(data, 0.75)
print(f"75th percentile: {percentile_75}") # 14.0
print(f"0.75 quantile: {quantile_75}") # 14.0
print(f"Equal: {percentile_75 == quantile_75}") # True
For multiple statistics simultaneously, pass arrays:
# Calculate quartiles (25th, 50th, 75th percentiles)
quartiles_p = np.percentile(data, [25, 50, 75])
quartiles_q = np.quantile(data, [0.25, 0.5, 0.75])
print(f"Quartiles (percentile): {quartiles_p}") # [6. 10. 14.]
print(f"Quartiles (quantile): {quartiles_q}") # [6. 10. 14.]
Interpolation Methods Matter
When the desired percentile falls between data points, NumPy must interpolate. The method parameter (NumPy 1.22+) or legacy interpolation parameter controls this behavior. Different methods produce different results on discrete data.
small_dataset = np.array([1, 2, 3, 4])
methods = ['linear', 'lower', 'higher', 'nearest', 'midpoint']
for method in methods:
result = np.percentile(small_dataset, 75, method=method)
print(f"{method:8s}: {result}")
# Output:
# linear : 3.25
# lower : 3.0
# higher : 4.0
# nearest : 3.0
# midpoint: 3.5
Linear interpolation (default) calculates proportional values between points. For the 75th percentile with 4 values, it finds the position at index 2.25 and interpolates between indices 2 and 3.
Lower and higher return actual data values—lower gives the largest value below the percentile, higher gives the smallest above it. Use these when you need actual observed values rather than interpolated estimates.
Midpoint averages the lower and higher values, useful for symmetric distributions or when you want a compromise between boundary values.
Nearest rounds to the closest data point, appropriate when interpolation doesn’t make sense (categorical data represented numerically, or when only actual measurements are valid).
# Practical example: Response time SLAs
response_times = np.array([45, 52, 61, 73, 89, 102, 156, 234, 445, 1203])
# P95 for SLA reporting - use actual observed value
p95_actual = np.percentile(response_times, 95, method='higher')
print(f"P95 (actual): {p95_actual}ms") # 1203ms
# P95 for capacity planning - interpolated estimate
p95_interpolated = np.percentile(response_times, 95, method='linear')
print(f"P95 (interpolated): {p95_interpolated:.1f}ms") # 923.1ms
Working with Multidimensional Arrays
The axis parameter enables statistical calculations across specific dimensions without reshaping data or writing loops. This is critical for performance with large datasets.
# Simulated daily metrics: 7 days × 24 hours
hourly_requests = np.random.randint(100, 1000, size=(7, 24))
# Calculate median requests for each hour across all days
median_by_hour = np.percentile(hourly_requests, 50, axis=0)
print(f"Median by hour shape: {median_by_hour.shape}") # (24,)
print(f"Median at hour 12: {median_by_hour[12]:.1f}")
# Calculate P95 for each day across all hours
p95_by_day = np.percentile(hourly_requests, 95, axis=1)
print(f"P95 by day shape: {p95_by_day.shape}") # (7,)
print(f"P95 for day 3: {p95_by_day[3]:.1f}")
For multiple dimensions simultaneously, pass a tuple:
# 3D array: 4 servers × 7 days × 24 hours
server_metrics = np.random.randint(50, 500, size=(4, 7, 24))
# Calculate percentiles across days and hours (axis 1 and 2)
# Result: one value per server
p99_per_server = np.percentile(server_metrics, 99, axis=(1, 2))
print(f"P99 per server: {p99_per_server}") # Shape: (4,)
# Calculate percentiles across servers and days (axis 0 and 1)
# Result: one value per hour
p50_per_hour = np.percentile(server_metrics, 50, axis=(0, 1))
print(f"P50 per hour shape: {p50_per_hour.shape}") # (24,)
Handling Edge Cases and Missing Data
NumPy’s percentile functions handle NaN values through the nan_policy behavior or dedicated functions.
data_with_nan = np.array([1.5, 2.3, np.nan, 4.1, 5.7, np.nan, 8.2])
# Standard percentile propagates NaN
result_standard = np.percentile(data_with_nan, 50)
print(f"Standard result: {result_standard}") # nan
# Use nanpercentile to ignore NaN values
result_nanpercentile = np.nanpercentile(data_with_nan, 50)
print(f"nanpercentile result: {result_nanpercentile}") # 4.1
# Same for quantile
result_nanquantile = np.nanquantile(data_with_nan, 0.5)
print(f"nanquantile result: {result_nanquantile}") # 4.1
Empty arrays or all-NaN slices require explicit handling:
empty_data = np.array([])
all_nan_data = np.array([np.nan, np.nan, np.nan])
try:
np.percentile(empty_data, 50)
except IndexError as e:
print(f"Empty array error: {e}")
# Returns NaN for all-NaN input
result = np.nanpercentile(all_nan_data, 50)
print(f"All-NaN result: {result}") # nan
# Check before processing
if len(data_with_nan) > 0 and not np.all(np.isnan(data_with_nan)):
valid_percentile = np.nanpercentile(data_with_nan, 50)
Performance Optimization Strategies
For repeated percentile calculations, consider these optimization techniques:
import time
# Large dataset
large_data = np.random.randn(10_000_000)
# Single percentile calculation
start = time.perf_counter()
p50 = np.percentile(large_data, 50)
single_time = time.perf_counter() - start
# Multiple percentiles in one call (more efficient)
start = time.perf_counter()
p25, p50, p75 = np.percentile(large_data, [25, 50, 75])
batch_time = time.perf_counter() - start
print(f"Single call: {single_time:.4f}s")
print(f"Batch call: {batch_time:.4f}s")
print(f"Speedup: {(single_time * 3) / batch_time:.2f}x")
Pre-sorting data when calculating multiple percentiles on the same dataset:
# For multiple percentile calculations on the same data
data = np.random.randn(1_000_000)
# Method 1: Repeated calls (sorts each time)
start = time.perf_counter()
for q in [10, 20, 30, 40, 50, 60, 70, 80, 90]:
_ = np.percentile(data, q)
unsorted_time = time.perf_counter() - start
# Method 2: Sort once, then calculate
start = time.perf_counter()
sorted_data = np.sort(data)
for q in [10, 20, 30, 40, 50, 60, 70, 80, 90]:
# Manual percentile calculation on sorted data
idx = int(len(sorted_data) * q / 100)
_ = sorted_data[idx]
sorted_time = time.perf_counter() - start
print(f"Unsorted: {unsorted_time:.4f}s")
print(f"Pre-sorted: {sorted_time:.4f}s")
print(f"Speedup: {unsorted_time / sorted_time:.2f}x")
Real-World Application: Outlier Detection
Percentiles form the basis for robust outlier detection using the Interquartile Range (IQR) method:
def detect_outliers_iqr(data, multiplier=1.5):
"""Detect outliers using IQR method."""
q1, q3 = np.percentile(data, [25, 75])
iqr = q3 - q1
lower_bound = q1 - multiplier * iqr
upper_bound = q3 + multiplier * iqr
outliers = (data < lower_bound) | (data > upper_bound)
return {
'outlier_mask': outliers,
'outlier_values': data[outliers],
'bounds': (lower_bound, upper_bound),
'q1': q1,
'q3': q3,
'iqr': iqr
}
# Example: API response times with anomalies
response_times = np.concatenate([
np.random.normal(200, 50, 1000), # Normal traffic
np.random.normal(2000, 200, 50) # Anomalies
])
results = detect_outliers_iqr(response_times)
print(f"Detected {results['outlier_mask'].sum()} outliers")
print(f"Bounds: {results['bounds']}")
print(f"IQR: {results['iqr']:.2f}")
print(f"Sample outliers: {results['outlier_values'][:5]}")
This implementation provides a non-parametric approach to outlier detection that doesn’t assume normal distribution, making it robust for real-world data with unknown distributions.