How to Calculate the Interquartile Range (IQR) in Python

The interquartile range is one of the most useful statistical measures you'll encounter in data analysis. It tells you how spread out the middle 50% of your data is, and unlike variance or standard...

Key Insights

  • The interquartile range (IQR) measures statistical spread by calculating the difference between the 75th and 25th percentiles, making it resistant to outliers unlike standard deviation.
  • Python offers three main approaches: NumPy’s np.percentile() for arrays, Pandas’ .quantile() for DataFrames, and SciPy’s scipy.stats.iqr() for a convenient one-liner with built-in NaN handling.
  • The 1.5×IQR rule provides a robust method for outlier detection—any value below Q1 - 1.5×IQR or above Q3 + 1.5×IQR is typically considered an outlier.

The interquartile range is one of the most useful statistical measures you’ll encounter in data analysis. It tells you how spread out the middle 50% of your data is, and unlike variance or standard deviation, it doesn’t get thrown off by extreme values. If you’re cleaning datasets, building dashboards, or performing exploratory data analysis, you need IQR in your toolkit.

This article covers three different ways to calculate IQR in Python, when to use each approach, and how to apply IQR for practical outlier detection.

Understanding Quartiles

Before calculating IQR, you need to understand what quartiles are. Quartiles divide your sorted data into four equal parts:

  • Q1 (First Quartile): The 25th percentile. 25% of data falls below this value.
  • Q2 (Second Quartile): The 50th percentile, also known as the median.
  • Q3 (Third Quartile): The 75th percentile. 75% of data falls below this value.

The IQR is simply Q3 minus Q1. This gives you the range containing the middle half of your data.

Let’s start with a manual calculation to build intuition:

def calculate_quartiles_manual(data):
    """Calculate quartiles manually for understanding."""
    sorted_data = sorted(data)
    n = len(sorted_data)
    
    # Find median (Q2)
    if n % 2 == 0:
        q2 = (sorted_data[n // 2 - 1] + sorted_data[n // 2]) / 2
    else:
        q2 = sorted_data[n // 2]
    
    # Split data for Q1 and Q3
    lower_half = sorted_data[:n // 2]
    upper_half = sorted_data[(n + 1) // 2:]
    
    # Q1 is median of lower half
    n_lower = len(lower_half)
    if n_lower % 2 == 0:
        q1 = (lower_half[n_lower // 2 - 1] + lower_half[n_lower // 2]) / 2
    else:
        q1 = lower_half[n_lower // 2]
    
    # Q3 is median of upper half
    n_upper = len(upper_half)
    if n_upper % 2 == 0:
        q3 = (upper_half[n_upper // 2 - 1] + upper_half[n_upper // 2]) / 2
    else:
        q3 = upper_half[n_upper // 2]
    
    return q1, q2, q3

# Example with small dataset
data = [7, 15, 36, 39, 40, 41, 42, 43, 47, 49]
q1, q2, q3 = calculate_quartiles_manual(data)
iqr = q3 - q1

print(f"Q1: {q1}, Q2 (median): {q2}, Q3: {q3}")
print(f"IQR: {iqr}")
# Output: Q1: 36.0, Q2 (median): 40.5, Q3: 43.0
# Output: IQR: 7.0

This manual approach helps you understand what’s happening, but you should never use it in production. The libraries we’ll cover next handle edge cases, interpolation, and performance far better.

Calculating IQR with NumPy

NumPy provides two functions for percentile calculations: np.percentile() and np.quantile(). The difference is simple—percentile() takes values from 0-100, while quantile() takes values from 0-1.

import numpy as np

data = np.array([7, 15, 36, 39, 40, 41, 42, 43, 47, 49])

# Using percentile (0-100 scale)
q1 = np.percentile(data, 25)
q3 = np.percentile(data, 75)
iqr = q3 - q1

print(f"Q1: {q1}, Q3: {q3}, IQR: {iqr}")
# Output: Q1: 36.75, Q3: 44.0, IQR: 7.25

# Using quantile (0-1 scale)
q1 = np.quantile(data, 0.25)
q3 = np.quantile(data, 0.75)
iqr = q3 - q1

print(f"Q1: {q1}, Q3: {q3}, IQR: {iqr}")
# Output: Q1: 36.75, Q3: 44.0, IQR: 7.25

Notice the results differ slightly from our manual calculation. This is because NumPy uses linear interpolation by default. When the percentile falls between two data points, it interpolates.

The method parameter controls interpolation behavior:

data = np.array([7, 15, 36, 39, 40, 41, 42, 43, 47, 49])

# Different interpolation methods
methods = ['linear', 'lower', 'higher', 'midpoint', 'nearest']

for method in methods:
    q1 = np.percentile(data, 25, method=method)
    q3 = np.percentile(data, 75, method=method)
    iqr = q3 - q1
    print(f"{method:10}: Q1={q1:5.2f}, Q3={q3:5.2f}, IQR={iqr:.2f}")

# Output:
# linear    : Q1=36.75, Q3=44.00, IQR=7.25
# lower     : Q1=36.00, Q3=43.00, IQR=7.00
# higher    : Q1=39.00, Q3=47.00, IQR=8.00
# midpoint  : Q1=37.50, Q3=45.00, IQR=7.50
# nearest   : Q1=36.00, Q3=43.00, IQR=7.00

For most applications, the default linear method works well. Use lower or higher when you need values that actually exist in your dataset.

Calculating IQR with Pandas

Pandas is the natural choice when working with DataFrames. The .quantile() method works on both Series and DataFrames.

import pandas as pd
import numpy as np

# Create sample DataFrame
df = pd.DataFrame({
    'sales': [120, 150, 180, 200, 250, 300, 350, 400, 450, 1200],
    'region': ['East', 'East', 'East', 'West', 'West', 
               'West', 'North', 'North', 'North', 'North'],
    'quarter': ['Q1', 'Q2', 'Q3', 'Q1', 'Q2', 'Q3', 'Q1', 'Q2', 'Q3', 'Q4']
})

# IQR for a single column
q1 = df['sales'].quantile(0.25)
q3 = df['sales'].quantile(0.75)
iqr = q3 - q1

print(f"Sales IQR: {iqr}")
# Output: Sales IQR: 212.5

# Get multiple quantiles at once
quantiles = df['sales'].quantile([0.25, 0.5, 0.75])
print(quantiles)
# Output:
# 0.25    175.0
# 0.50    275.0
# 0.75    387.5

The real power of Pandas shows when calculating IQR for grouped data:

# IQR by region
def calculate_iqr(series):
    return series.quantile(0.75) - series.quantile(0.25)

regional_iqr = df.groupby('region')['sales'].apply(calculate_iqr)
print("IQR by region:")
print(regional_iqr)
# Output:
# region
# East     30.0
# North    75.0
# West     50.0

# Full quartile summary by region
regional_stats = df.groupby('region')['sales'].agg(
    Q1=lambda x: x.quantile(0.25),
    Median=lambda x: x.quantile(0.50),
    Q3=lambda x: x.quantile(0.75),
    IQR=lambda x: x.quantile(0.75) - x.quantile(0.25)
)
print("\nFull statistics by region:")
print(regional_stats)

This pattern is invaluable when analyzing data across categories—comparing spread across regions, time periods, or customer segments.

Using SciPy’s Built-in IQR Function

SciPy provides scipy.stats.iqr() for direct IQR calculation. It’s the cleanest option when you just need the IQR value.

from scipy import stats
import numpy as np

data = np.array([7, 15, 36, 39, 40, 41, 42, 43, 47, 49])

# One-liner IQR
iqr = stats.iqr(data)
print(f"IQR: {iqr}")
# Output: IQR: 7.25

# Custom percentile range (useful for other quantile ranges)
iqr_custom = stats.iqr(data, rng=(10, 90))  # 10th to 90th percentile
print(f"10-90 range: {iqr_custom}")

SciPy shines with its nan_policy parameter for handling missing data:

# Data with NaN values
data_with_nan = np.array([7, 15, np.nan, 39, 40, 41, np.nan, 43, 47, 49])

# Default behavior raises no error but may give unexpected results
# Use nan_policy to control behavior

# Propagate NaN (returns NaN if any NaN present)
result = stats.iqr(data_with_nan, nan_policy='propagate')
print(f"propagate: {result}")  # Output: nan

# Raise error if NaN present
try:
    result = stats.iqr(data_with_nan, nan_policy='raise')
except ValueError as e:
    print(f"raise: {e}")

# Omit NaN values from calculation
result = stats.iqr(data_with_nan, nan_policy='omit')
print(f"omit: {result}")  # Output: 8.0

The scale parameter normalizes the IQR, which is useful for comparing distributions:

# Scale IQR to standard deviation of normal distribution
iqr_scaled = stats.iqr(data, scale='normal')
print(f"Scaled IQR (normal): {iqr_scaled:.4f}")
# This equals IQR / 1.349, useful for robust scale estimation

Practical Application: Outlier Detection

The most common use of IQR is detecting outliers using the 1.5×IQR rule. Values below Q1 - 1.5×IQR or above Q3 + 1.5×IQR are flagged as outliers.

import numpy as np
import pandas as pd
from scipy import stats

def detect_outliers_iqr(data, multiplier=1.5):
    """
    Detect outliers using the IQR method.
    
    Parameters:
    -----------
    data : array-like
        Input data
    multiplier : float
        IQR multiplier for bounds (1.5 for outliers, 3.0 for extreme outliers)
    
    Returns:
    --------
    dict with bounds, outlier mask, and outlier values
    """
    q1 = np.percentile(data, 25)
    q3 = np.percentile(data, 75)
    iqr = q3 - q1
    
    lower_bound = q1 - (multiplier * iqr)
    upper_bound = q3 + (multiplier * iqr)
    
    outlier_mask = (data < lower_bound) | (data > upper_bound)
    
    return {
        'q1': q1,
        'q3': q3,
        'iqr': iqr,
        'lower_bound': lower_bound,
        'upper_bound': upper_bound,
        'outlier_mask': outlier_mask,
        'outliers': data[outlier_mask],
        'clean_data': data[~outlier_mask]
    }

# Example usage
np.random.seed(42)
data = np.concatenate([
    np.random.normal(100, 15, 100),  # Normal data
    np.array([5, 10, 200, 250])       # Outliers
])

results = detect_outliers_iqr(data)

print(f"Q1: {results['q1']:.2f}, Q3: {results['q3']:.2f}")
print(f"IQR: {results['iqr']:.2f}")
print(f"Bounds: [{results['lower_bound']:.2f}, {results['upper_bound']:.2f}]")
print(f"Number of outliers: {len(results['outliers'])}")
print(f"Outlier values: {results['outliers']}")

For DataFrame columns, here’s a reusable function:

def remove_outliers_df(df, columns, multiplier=1.5):
    """Remove outliers from specified DataFrame columns."""
    df_clean = df.copy()
    
    for col in columns:
        q1 = df_clean[col].quantile(0.25)
        q3 = df_clean[col].quantile(0.75)
        iqr = q3 - q1
        
        lower = q1 - (multiplier * iqr)
        upper = q3 + (multiplier * iqr)
        
        before = len(df_clean)
        df_clean = df_clean[(df_clean[col] >= lower) & (df_clean[col] <= upper)]
        after = len(df_clean)
        
        print(f"{col}: removed {before - after} outliers")
    
    return df_clean

# Usage
df = pd.DataFrame({
    'price': [100, 110, 105, 95, 500, 108, 102, 10],
    'quantity': [5, 6, 4, 5, 5, 6, 100, 5]
})

df_clean = remove_outliers_df(df, ['price', 'quantity'])
print(df_clean)

Summary and Best Practices

Choose your library based on your data structure and needs:

  • NumPy: Best for raw arrays and when you need control over interpolation methods. Fastest for large numerical arrays.
  • Pandas: Use when working with DataFrames, especially for grouped calculations. The .quantile() method integrates naturally with other Pandas operations.
  • SciPy: Ideal for quick one-liner calculations and when you need built-in NaN handling. The scipy.stats.iqr() function is the most readable option.

For large datasets (millions of rows), NumPy typically offers the best performance. Pandas adds overhead but provides convenience for structured data. SciPy’s performance sits between the two.

Remember that IQR-based outlier detection is just one method. It works well for roughly symmetric distributions but may be too aggressive or too lenient for heavily skewed data. Always visualize your data with box plots before blindly removing outliers.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.