How to Calculate Quartiles in Python
Quartiles divide your dataset into four equal parts. Q1 (the 25th percentile) marks where 25% of your data falls below. Q2 (the 50th percentile) is your median. Q3 (the 75th percentile) marks where...
Key Insights
- NumPy’s
np.quantile()and Pandas’.quantile()methods use different default interpolation methods, which can produce slightly different results—always specify the method explicitly for reproducibility. - The interquartile range (IQR) is the foundation for the most widely-used outlier detection method: any value below Q1 - 1.5×IQR or above Q3 + 1.5×IQR is considered an outlier.
- Box plots are the standard visualization for quartile data, and both Matplotlib and Seaborn calculate quartiles automatically—but you should understand the underlying math before relying on automated tools.
Introduction to Quartiles
Quartiles divide your dataset into four equal parts. Q1 (the 25th percentile) marks where 25% of your data falls below. Q2 (the 50th percentile) is your median. Q3 (the 75th percentile) marks where 75% of your data falls below. The spread between Q1 and Q3—called the interquartile range (IQR)—tells you where the middle 50% of your data lives.
Why should you care? Quartiles are resistant to outliers in ways that mean and standard deviation aren’t. When you’re analyzing salary data, housing prices, or any dataset with extreme values, quartiles give you a more honest picture of your data’s distribution. They’re also the foundation for box plots and the most common outlier detection method in exploratory data analysis.
Let’s walk through how to calculate quartiles in Python using NumPy, Pandas, and SciPy, then put that knowledge to work with visualizations and outlier detection.
Understanding Quartile Calculation Methods
Here’s something that trips up many developers: there’s no single “correct” way to calculate quartiles. Different statistical software packages use different interpolation methods, which is why you might get slightly different results from R, Excel, and Python.
Python offers several interpolation methods:
- linear (default in NumPy): Interpolates linearly between data points
- lower: Returns the lower of the two surrounding data points
- higher: Returns the higher of the two surrounding data points
- midpoint: Returns the average of the lower and higher values
- nearest: Returns the nearest data point
For most analytical work, the linear method works fine. But if you’re comparing results with another tool or need exact reproducibility, you’ll need to match methods explicitly. Excel uses a different algorithm entirely (it uses what’s called the “exclusive” method), so don’t expect identical results without adjustment.
Calculating Quartiles with NumPy
NumPy provides two functions for quartile calculations: np.quantile() (which takes values from 0 to 1) and np.percentile() (which takes values from 0 to 100). They’re functionally identical—use whichever feels more intuitive.
import numpy as np
# Sample dataset: daily website visitors
visitors = np.array([120, 145, 167, 189, 203, 215, 234, 256, 278, 312,
345, 389, 412, 456, 523, 589, 634, 712, 834, 1023])
# Calculate quartiles using np.quantile()
q1 = np.quantile(visitors, 0.25)
q2 = np.quantile(visitors, 0.50) # This is the median
q3 = np.quantile(visitors, 0.75)
print(f"Q1 (25th percentile): {q1}")
print(f"Q2 (Median): {q2}")
print(f"Q3 (75th percentile): {q3}")
print(f"IQR: {q3 - q1}")
Output:
Q1 (25th percentile): 208.0
Q2 (Median): 328.5
Q3 (75th percentile): 550.25
IQR: 342.25
You can calculate all quartiles at once by passing a list:
# Calculate all quartiles in one call
quartiles = np.quantile(visitors, [0.25, 0.50, 0.75])
print(f"Quartiles: {quartiles}")
# Using np.percentile() instead
quartiles_pct = np.percentile(visitors, [25, 50, 75])
print(f"Quartiles (percentile): {quartiles_pct}")
To specify an interpolation method:
# Compare different interpolation methods
methods = ['linear', 'lower', 'higher', 'midpoint', 'nearest']
for method in methods:
q1 = np.quantile(visitors, 0.25, method=method)
print(f"Q1 ({method}): {q1}")
Output:
Q1 (linear): 208.0
Q1 (lower): 203
Q1 (higher): 215
Q1 (midpoint): 209.0
Q1 (nearest): 203
Notice how the results differ. For this dataset, the linear interpolation gives 208.0, while the lower method gives 203. This is why explicit method specification matters when reproducibility is important.
Calculating Quartiles with Pandas
Pandas makes quartile calculations straightforward, especially when working with DataFrames. The .quantile() method works on both Series and DataFrame objects.
import pandas as pd
import numpy as np
# Create a sample DataFrame
df = pd.DataFrame({
'revenue': [12000, 15000, 18000, 22000, 25000, 28000, 32000,
38000, 45000, 52000, 61000, 78000, 95000, 120000],
'customers': [45, 52, 61, 73, 82, 91, 105, 118, 134, 152, 178, 201, 245, 312],
'satisfaction': [3.2, 3.5, 3.8, 4.0, 4.1, 4.2, 4.3, 4.4, 4.5, 4.6, 4.7, 4.8, 4.9, 5.0]
})
# Calculate quartiles for a single column
revenue_quartiles = df['revenue'].quantile([0.25, 0.50, 0.75])
print("Revenue Quartiles:")
print(revenue_quartiles)
Output:
Revenue Quartiles:
0.25 21250.0
0.50 35000.0
0.75 63250.0
Name: revenue, dtype: float64
For multiple columns at once:
# Calculate quartiles for all numeric columns
all_quartiles = df.quantile([0.25, 0.50, 0.75])
print("\nQuartiles for all columns:")
print(all_quartiles)
Output:
Quartiles for all columns:
revenue customers satisfaction
0.25 21250.0 79.75 3.975
0.50 35000.0 111.50 4.350
0.75 63250.0 184.75 4.725
Handling missing values is straightforward—Pandas excludes NaN values by default:
# Dataset with missing values
df_with_nulls = pd.DataFrame({
'values': [10, 20, np.nan, 40, 50, np.nan, 70, 80, 90, 100]
})
# Quartiles automatically exclude NaN
quartiles = df_with_nulls['values'].quantile([0.25, 0.50, 0.75])
print("Quartiles (NaN excluded automatically):")
print(quartiles)
# Explicit handling if needed
clean_quartiles = df_with_nulls['values'].dropna().quantile([0.25, 0.50, 0.75])
print("\nQuartiles (explicit dropna):")
print(clean_quartiles)
Using SciPy for Quartile Calculations
SciPy’s scipy.stats module provides iqr() for calculating the interquartile range directly, plus integration with other statistical functions.
from scipy import stats
import numpy as np
data = np.array([23, 45, 56, 67, 78, 89, 92, 103, 115, 128,
142, 156, 178, 195, 234, 267, 312, 389, 456, 523])
# Calculate IQR directly
iqr_value = stats.iqr(data)
print(f"Interquartile Range: {iqr_value}")
# Calculate IQR with specific percentiles (useful for custom ranges)
iqr_custom = stats.iqr(data, rng=(10, 90)) # 10th to 90th percentile
print(f"10-90 Percentile Range: {iqr_custom}")
# Combine with other statistical measures
print(f"\nDescriptive Statistics:")
print(f"Mean: {np.mean(data):.2f}")
print(f"Median: {np.median(data):.2f}")
print(f"Std Dev: {np.std(data):.2f}")
print(f"IQR: {iqr_value:.2f}")
print(f"Skewness: {stats.skew(data):.2f}")
Visualizing Quartiles with Box Plots
Box plots are the canonical visualization for quartile data. The box spans from Q1 to Q3, with a line at the median (Q2). Whiskers extend to show the data range, and outliers appear as individual points.
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
# Sample data
np.random.seed(42)
data = np.concatenate([
np.random.normal(50, 10, 100),
np.random.normal(80, 15, 100),
[150, 155, 5, 2] # Adding some outliers
])
# Basic Matplotlib box plot
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
# Matplotlib version
bp = axes[0].boxplot(data, patch_artist=True)
bp['boxes'][0].set_facecolor('lightblue')
axes[0].set_title('Matplotlib Box Plot')
axes[0].set_ylabel('Value')
# Calculate and annotate quartiles
q1, q2, q3 = np.percentile(data, [25, 50, 75])
axes[0].annotate(f'Q1: {q1:.1f}', xy=(1.15, q1), fontsize=9)
axes[0].annotate(f'Q2: {q2:.1f}', xy=(1.15, q2), fontsize=9)
axes[0].annotate(f'Q3: {q3:.1f}', xy=(1.15, q3), fontsize=9)
# Seaborn version
sns.boxplot(y=data, ax=axes[1], color='lightgreen')
axes[1].set_title('Seaborn Box Plot')
axes[1].set_ylabel('Value')
plt.tight_layout()
plt.savefig('boxplot_comparison.png', dpi=150)
plt.show()
For comparing multiple groups:
# Multi-group box plot
df = pd.DataFrame({
'value': np.concatenate([
np.random.normal(50, 10, 100),
np.random.normal(65, 12, 100),
np.random.normal(55, 8, 100)
]),
'group': ['A'] * 100 + ['B'] * 100 + ['C'] * 100
})
plt.figure(figsize=(10, 6))
sns.boxplot(x='group', y='value', data=df, palette='Set2')
plt.title('Quartile Distribution by Group')
plt.savefig('grouped_boxplot.png', dpi=150)
plt.show()
Practical Application: Outlier Detection
The IQR method is the most common approach for identifying outliers. Any value below Q1 - 1.5×IQR or above Q3 + 1.5×IQR is flagged as an outlier.
import numpy as np
import pandas as pd
def detect_outliers_iqr(data, column=None):
"""
Detect outliers using the IQR method.
Returns a boolean mask where True indicates an outlier.
"""
if isinstance(data, pd.DataFrame):
if column is None:
raise ValueError("Column name required for DataFrame input")
values = data[column]
else:
values = np.array(data)
q1 = np.percentile(values, 25)
q3 = np.percentile(values, 75)
iqr = q3 - q1
lower_bound = q1 - 1.5 * iqr
upper_bound = q3 + 1.5 * iqr
outlier_mask = (values < lower_bound) | (values > upper_bound)
return outlier_mask, lower_bound, upper_bound
# Example usage
np.random.seed(42)
sales_data = pd.DataFrame({
'daily_sales': np.concatenate([
np.random.normal(1000, 200, 95),
[50, 75, 2500, 3000, 3200] # Outliers
])
})
outliers, lower, upper = detect_outliers_iqr(sales_data, 'daily_sales')
print(f"Lower bound: ${lower:.2f}")
print(f"Upper bound: ${upper:.2f}")
print(f"Number of outliers: {outliers.sum()}")
print(f"\nOutlier values:")
print(sales_data[outliers]['daily_sales'].values)
# Filter dataset to remove outliers
clean_data = sales_data[~outliers]
print(f"\nOriginal size: {len(sales_data)}")
print(f"Clean size: {len(clean_data)}")
Output:
Lower bound: $473.47
Upper bound: $1524.89
Number of outliers: 5
Outlier values:
[ 50. 75. 2500. 3000. 3200.]
Original size: 100
Clean size: 95
For a more robust implementation that handles multiple columns:
def remove_outliers_multi(df, columns, multiplier=1.5):
"""
Remove outliers from multiple columns using IQR method.
"""
mask = pd.Series([True] * len(df))
for col in columns:
q1 = df[col].quantile(0.25)
q3 = df[col].quantile(0.75)
iqr = q3 - q1
lower = q1 - multiplier * iqr
upper = q3 + multiplier * iqr
col_mask = (df[col] >= lower) & (df[col] <= upper)
mask = mask & col_mask
return df[mask]
# Usage
clean_df = remove_outliers_multi(df, ['revenue', 'customers'])
Quartiles are fundamental to exploratory data analysis. Master these techniques, and you’ll have a solid foundation for understanding data distribution, identifying anomalies, and communicating statistical insights through visualizations. The key is consistency—pick an interpolation method, document it, and stick with it throughout your analysis.