How to Calculate Percentiles in Python

Percentiles divide your data into 100 equal parts, telling you what percentage of values fall below a given threshold. The 90th percentile means 90% of your data points are at or below that value....

Key Insights

  • NumPy’s percentile() and quantile() functions are the fastest options for calculating percentiles, differing only in their input scale (0-100 vs 0-1)
  • The interpolation method you choose matters significantly when your percentile falls between data points—linear is the default, but lower or higher may be more appropriate for discrete data
  • For DataFrames with missing values, always use np.nanpercentile() or Pandas’ built-in quantile() method to avoid corrupted results

Introduction to Percentiles

Percentiles divide your data into 100 equal parts, telling you what percentage of values fall below a given threshold. The 90th percentile means 90% of your data points are at or below that value. Simple concept, powerful applications.

You’ll use percentiles constantly in real-world data analysis. Response time monitoring? You care about the 95th and 99th percentiles, not the average. Salary negotiations? Knowing where you fall in the distribution matters more than the mean. Identifying outliers? The interquartile range (P75 - P25) is your go-to tool.

Python gives you multiple ways to calculate percentiles. I’ll cover them all, explain when to use each, and show you the gotchas that trip up even experienced developers.

Using NumPy’s percentile() and quantile()

NumPy is your first choice for percentile calculations. It’s fast, well-tested, and handles the edge cases correctly.

The percentile() function takes values from 0 to 100. The quantile() function takes values from 0 to 1. That’s the only difference.

import numpy as np

# Sample dataset: response times in milliseconds
response_times = np.array([45, 67, 89, 102, 115, 128, 145, 167, 189, 234, 278, 312, 456])

# Calculate quartiles using percentile (0-100 scale)
p25 = np.percentile(response_times, 25)
p50 = np.percentile(response_times, 50)  # This is the median
p75 = np.percentile(response_times, 75)

print(f"25th percentile: {p25} ms")
print(f"50th percentile (median): {p50} ms")
print(f"75th percentile: {p75} ms")

# Same calculation using quantile (0-1 scale)
q25 = np.quantile(response_times, 0.25)
q50 = np.quantile(response_times, 0.50)
q75 = np.quantile(response_times, 0.75)

print(f"\nUsing quantile - Q1: {q25}, Q2: {q50}, Q3: {q75}")

# Calculate multiple percentiles at once
percentiles = np.percentile(response_times, [10, 50, 90, 95, 99])
print(f"\nP10, P50, P90, P95, P99: {percentiles}")

Output:

25th percentile: 108.5 ms
50th percentile (median): 145.0 ms
75th percentile: 256.0 ms

Using quantile - Q1: 108.5, Q2: 145.0, Q3: 256.0

P10, P50, P90, P95, P99: [ 63.4 145.  295.  369.6 441.6]

Pass a list of percentiles to calculate multiple values in one call. This is more efficient than calling the function repeatedly.

Interpolation Methods Explained

Here’s where things get interesting. What happens when the percentile you want falls between two data points? NumPy needs to interpolate, and you have options.

The method parameter (called interpolation in older NumPy versions) controls this behavior:

  • linear (default): Linear interpolation between adjacent points
  • lower: Return the lower of the two adjacent values
  • higher: Return the higher of the two adjacent values
  • midpoint: Average of the two adjacent values
  • nearest: Return the nearest value
import numpy as np

# Small dataset to clearly show interpolation differences
data = np.array([10, 20, 30, 40, 50, 60, 70, 80, 90, 100])

# Calculate 35th percentile with different methods
methods = ['linear', 'lower', 'higher', 'midpoint', 'nearest']

print("35th percentile with different interpolation methods:")
print("-" * 50)

for method in methods:
    result = np.percentile(data, 35, method=method)
    print(f"{method:12}: {result}")

print("\n75th percentile comparison:")
print("-" * 50)

for method in methods:
    result = np.percentile(data, 75, method=method)
    print(f"{method:12}: {result}")

Output:

35th percentile with different interpolation methods:
--------------------------------------------------
linear      : 41.5
lower       : 40
higher      : 50
midpoint    : 45.0
nearest     : 40

75th percentile comparison:
--------------------------------------------------
linear      : 77.5
lower       : 70
higher      : 80
midpoint    : 75.0
nearest     : 80

When should you use each method? Use linear for continuous data like temperatures or prices. Use lower or higher when you need actual observed values (useful for discrete data like test scores or counts). Use nearest when you want the closest real observation.

Calculating Percentiles with Pandas

Pandas wraps NumPy’s functionality with a more convenient API for DataFrames and Series. The quantile() method uses the 0-1 scale.

import pandas as pd
import numpy as np

# Create a sample DataFrame
np.random.seed(42)
df = pd.DataFrame({
    'salary': np.random.normal(75000, 15000, 100).astype(int),
    'experience_years': np.random.randint(1, 20, 100),
    'performance_score': np.random.uniform(2.5, 5.0, 100).round(2)
})

# Calculate specific percentiles for each column
percentiles = df.quantile([0.25, 0.50, 0.75, 0.90])
print("Percentiles across all columns:")
print(percentiles)

# Calculate percentile for a single column
salary_p90 = df['salary'].quantile(0.90)
print(f"\n90th percentile salary: ${salary_p90:,.0f}")

# Use describe() for quick quartile summary
print("\nQuick summary with describe():")
print(df.describe(percentiles=[0.25, 0.5, 0.75, 0.9, 0.95]))

The describe() method is incredibly useful for exploratory analysis. By default, it shows the 25th, 50th, and 75th percentiles. Pass a custom list to see different percentiles.

# Group-wise percentile calculation
df['experience_band'] = pd.cut(df['experience_years'], 
                                bins=[0, 5, 10, 20], 
                                labels=['Junior', 'Mid', 'Senior'])

# 75th percentile salary by experience band
salary_by_experience = df.groupby('experience_band')['salary'].quantile(0.75)
print("\n75th percentile salary by experience:")
print(salary_by_experience)

This pattern—grouping then calculating percentiles—appears constantly in business analytics.

Manual Percentile Calculation

Understanding the math helps you debug edge cases and implement custom logic when needed.

def calculate_percentile(data, percentile, method='linear'):
    """
    Calculate percentile manually.
    
    Args:
        data: Array-like of numeric values
        percentile: Value between 0 and 100
        method: 'linear', 'lower', 'higher', 'nearest', 'midpoint'
    
    Returns:
        The calculated percentile value
    """
    sorted_data = sorted(data)
    n = len(sorted_data)
    
    # Calculate the rank (position) for this percentile
    # Using the linear interpolation formula
    rank = (percentile / 100) * (n - 1)
    
    lower_index = int(rank)
    upper_index = lower_index + 1
    fraction = rank - lower_index
    
    # Handle edge case where rank is exactly an integer
    if upper_index >= n:
        return sorted_data[-1]
    
    lower_value = sorted_data[lower_index]
    upper_value = sorted_data[upper_index]
    
    if method == 'linear':
        return lower_value + fraction * (upper_value - lower_value)
    elif method == 'lower':
        return lower_value
    elif method == 'higher':
        return upper_value
    elif method == 'nearest':
        return lower_value if fraction < 0.5 else upper_value
    elif method == 'midpoint':
        return (lower_value + upper_value) / 2
    else:
        raise ValueError(f"Unknown method: {method}")

# Test against NumPy
test_data = [15, 20, 35, 40, 50]

print("Comparing manual vs NumPy implementation:")
for p in [25, 50, 75]:
    manual = calculate_percentile(test_data, p)
    numpy_result = np.percentile(test_data, p)
    print(f"P{p}: Manual={manual}, NumPy={numpy_result}, Match={manual == numpy_result}")

Note that different statistical software packages use slightly different formulas for percentile calculation. NumPy’s default matches Excel’s PERCENTILE.INC function. If you need exact compatibility with another system, check their documentation.

Practical Applications

Let’s look at real-world scenarios where percentiles solve actual problems.

Outlier Detection with IQR:

import numpy as np

def detect_outliers_iqr(data, multiplier=1.5):
    """
    Detect outliers using the Interquartile Range method.
    
    Values below Q1 - 1.5*IQR or above Q3 + 1.5*IQR are outliers.
    Use multiplier=3.0 for "extreme" outliers only.
    """
    q1 = np.percentile(data, 25)
    q3 = np.percentile(data, 75)
    iqr = q3 - q1
    
    lower_bound = q1 - multiplier * iqr
    upper_bound = q3 + multiplier * iqr
    
    outliers = data[(data < lower_bound) | (data > upper_bound)]
    
    return {
        'q1': q1,
        'q3': q3,
        'iqr': iqr,
        'lower_bound': lower_bound,
        'upper_bound': upper_bound,
        'outliers': outliers,
        'outlier_count': len(outliers)
    }

# Example: API response times with some anomalies
response_times = np.array([
    45, 52, 48, 55, 61, 58, 49, 53, 47, 56,  # Normal range
    52, 49, 58, 54, 51, 48, 55, 53, 50, 57,
    250, 340, 15, 8, 420  # Outliers
])

results = detect_outliers_iqr(response_times)
print(f"Q1: {results['q1']}, Q3: {results['q3']}")
print(f"IQR: {results['iqr']}")
print(f"Acceptable range: [{results['lower_bound']:.1f}, {results['upper_bound']:.1f}]")
print(f"Outliers found: {results['outliers']}")

SLA Compliance Monitoring:

def check_sla_compliance(response_times, sla_threshold_ms, percentile=95):
    """
    Check if response times meet SLA requirements.
    
    Common SLAs: "95th percentile response time under 200ms"
    """
    actual_percentile = np.percentile(response_times, percentile)
    compliant = actual_percentile <= sla_threshold_ms
    
    return {
        'percentile': percentile,
        'actual_value': actual_percentile,
        'threshold': sla_threshold_ms,
        'compliant': compliant,
        'margin': sla_threshold_ms - actual_percentile
    }

# Check if we're meeting our SLA
sla_result = check_sla_compliance(response_times, sla_threshold_ms=100, percentile=95)
print(f"P{sla_result['percentile']}: {sla_result['actual_value']:.1f}ms")
print(f"SLA Compliant: {sla_result['compliant']}")

Performance Considerations

For large datasets, NumPy’s vectorized operations are significantly faster than pure Python implementations. But the real gotcha is handling missing data.

import numpy as np

# Data with NaN values
data_with_nans = np.array([10, 20, np.nan, 40, 50, np.nan, 70, 80, 90, 100])

# Regular percentile fails silently (returns nan)
regular_result = np.percentile(data_with_nans, 50)
print(f"np.percentile with NaN: {regular_result}")  # Returns nan

# Use nanpercentile to ignore NaN values
nan_safe_result = np.nanpercentile(data_with_nans, 50)
print(f"np.nanpercentile with NaN: {nan_safe_result}")  # Returns 55.0

# Pandas handles NaN automatically
import pandas as pd
series_with_nans = pd.Series(data_with_nans)
pandas_result = series_with_nans.quantile(0.50)
print(f"Pandas quantile with NaN: {pandas_result}")  # Returns 55.0

Always use np.nanpercentile() when your data might contain missing values. The regular percentile() function will return nan if any value is missing, which can silently corrupt your analysis.

For very large datasets (millions of rows), consider using approximate percentile algorithms available in libraries like datasketch or database-specific functions if you’re working with data in PostgreSQL or similar systems.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.