How to Detect Outliers Using IQR in Python

Outliers are data points that deviate significantly from the rest of your dataset. They can emerge from measurement errors, data entry mistakes, or genuinely unusual observations. Regardless of their...

Key Insights

  • The IQR method identifies outliers by calculating boundaries at 1.5× the interquartile range below Q1 and above Q3, making it resistant to extreme values that would skew mean-based approaches.
  • Unlike Z-score methods that assume normal distributions, IQR works reliably on skewed data and doesn’t require your data to follow any particular distribution.
  • Always visualize your outliers before removing them—what looks like an outlier statistically might be a legitimate data point that carries important information.

Introduction to Outliers and IQR

Outliers are data points that deviate significantly from the rest of your dataset. They can emerge from measurement errors, data entry mistakes, or genuinely unusual observations. Regardless of their origin, outliers wreak havoc on statistical analyses. They inflate standard deviations, skew means, and can completely derail machine learning models that assume well-behaved data.

The Interquartile Range (IQR) method offers a robust approach to outlier detection. Unlike methods based on mean and standard deviation, IQR uses the median and quartiles—statistics that aren’t influenced by extreme values. This makes IQR particularly valuable when you suspect your data already contains outliers, since those very outliers won’t corrupt your detection mechanism.

The core idea is straightforward: calculate the range where the middle 50% of your data lives, then flag anything too far outside that range as suspicious. It’s simple, interpretable, and works across a wide variety of data distributions.

Understanding the IQR Formula

The IQR method relies on three key values: Q1 (the 25th percentile), Q3 (the 75th percentile), and the IQR itself (Q3 - Q1).

Here’s the logic:

  • Q1: 25% of data points fall below this value
  • Q3: 75% of data points fall below this value
  • IQR: The spread of the middle 50% of your data

The standard rule defines outliers as any point falling below Q1 - 1.5 × IQR (lower fence) or above Q3 + 1.5 × IQR (upper fence). The 1.5 multiplier isn’t arbitrary—statistician John Tukey chose it because it captures approximately 99.3% of normally distributed data while remaining practical for real-world datasets.

import numpy as np

# Sample data with obvious outliers
data = np.array([12, 15, 14, 10, 13, 15, 14, 100, 13, 12, 14, -50, 11, 15, 13])

# Calculate quartiles and IQR
q1 = np.percentile(data, 25)
q3 = np.percentile(data, 75)
iqr = q3 - q1

# Calculate fences
lower_fence = q1 - 1.5 * iqr
upper_fence = q3 + 1.5 * iqr

print(f"Q1: {q1}")
print(f"Q3: {q3}")
print(f"IQR: {iqr}")
print(f"Lower fence: {lower_fence}")
print(f"Upper fence: {upper_fence}")

Output:

Q1: 12.0
Q3: 14.5
IQR: 2.5
Lower fence: 8.25
Upper fence: 18.25

Any value below 8.25 or above 18.25 gets flagged. In our sample, that’s -50 and 100—exactly what we’d expect.

Implementing IQR Outlier Detection from Scratch

Let’s build a reusable function that handles the complete outlier detection workflow. This function will return both a boolean mask (useful for filtering) and the actual outlier values.

import numpy as np
import pandas as pd
from typing import Tuple, Union

def detect_outliers_iqr(
    data: Union[np.ndarray, pd.Series],
    multiplier: float = 1.5
) -> Tuple[np.ndarray, np.ndarray, float, float]:
    """
    Detect outliers using the IQR method.
    
    Parameters:
    -----------
    data : array-like
        Input data (1D array or pandas Series)
    multiplier : float
        IQR multiplier for fence calculation (default: 1.5)
    
    Returns:
    --------
    outlier_mask : boolean array indicating outliers
    outliers : array of outlier values
    lower_fence : lower boundary
    upper_fence : upper boundary
    """
    data = np.asarray(data)
    
    q1 = np.percentile(data, 25)
    q3 = np.percentile(data, 75)
    iqr = q3 - q1
    
    lower_fence = q1 - multiplier * iqr
    upper_fence = q3 + multiplier * iqr
    
    outlier_mask = (data < lower_fence) | (data > upper_fence)
    outliers = data[outlier_mask]
    
    return outlier_mask, outliers, lower_fence, upper_fence


def remove_outliers_iqr(
    data: Union[np.ndarray, pd.Series],
    multiplier: float = 1.5
) -> np.ndarray:
    """Remove outliers and return cleaned data."""
    data = np.asarray(data)
    outlier_mask, _, _, _ = detect_outliers_iqr(data, multiplier)
    return data[~outlier_mask]


# Example usage
data = np.array([12, 15, 14, 10, 13, 15, 14, 100, 13, 12, 14, -50, 11, 15, 13])

mask, outliers, lower, upper = detect_outliers_iqr(data)
print(f"Outliers found: {outliers}")
print(f"Number of outliers: {len(outliers)}")
print(f"Percentage of data: {len(outliers)/len(data)*100:.1f}%")

cleaned_data = remove_outliers_iqr(data)
print(f"Original length: {len(data)}, Cleaned length: {len(cleaned_data)}")

Output:

Outliers found: [100 -50]
Number of outliers: 2
Percentage of data: 13.3%
Original length: 15, Cleaned length: 13

The multiplier parameter lets you adjust sensitivity. Use 1.5 for standard outlier detection or 3.0 to catch only extreme outliers.

Visualizing Outliers with Box Plots

Box plots were designed specifically to show the IQR method visually. The box represents Q1 to Q3, the line inside is the median, whiskers extend to the fences, and individual points beyond the whiskers are outliers.

import matplotlib.pyplot as plt
import seaborn as sns

# Generate sample data with outliers
np.random.seed(42)
normal_data = np.random.normal(50, 10, 100)
outliers_to_add = np.array([5, 10, 95, 100, 105])
data_with_outliers = np.concatenate([normal_data, outliers_to_add])

# Create figure with two subplots
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Standard box plot
axes[0].boxplot(data_with_outliers, vert=True)
axes[0].set_title('Box Plot with Outliers')
axes[0].set_ylabel('Value')

# Seaborn version with more detail
sns.boxplot(y=data_with_outliers, ax=axes[1], color='lightblue')
sns.stripplot(y=data_with_outliers, ax=axes[1], color='red', alpha=0.3, size=4)
axes[1].set_title('Box Plot with Individual Points')
axes[1].set_ylabel('Value')

plt.tight_layout()
plt.savefig('outlier_boxplot.png', dpi=150)
plt.show()

# Annotate the outliers
mask, outliers, lower, upper = detect_outliers_iqr(data_with_outliers)
print(f"Detected outliers: {sorted(outliers)}")
print(f"Fences: [{lower:.2f}, {upper:.2f}]")

The strip plot overlay shows every data point, making it easy to see the density of your data and exactly which points fall outside the whiskers.

Handling Outliers in Real Datasets

Real-world data analysis typically involves DataFrames with multiple columns. Here’s how to apply IQR detection across an entire dataset:

import pandas as pd
import numpy as np

def detect_outliers_dataframe(
    df: pd.DataFrame,
    columns: list = None,
    multiplier: float = 1.5
) -> pd.DataFrame:
    """
    Detect outliers in multiple DataFrame columns.
    
    Returns a DataFrame with boolean values indicating outliers.
    """
    if columns is None:
        columns = df.select_dtypes(include=[np.number]).columns.tolist()
    
    outlier_df = pd.DataFrame(index=df.index)
    
    for col in columns:
        q1 = df[col].quantile(0.25)
        q3 = df[col].quantile(0.75)
        iqr = q3 - q1
        
        lower = q1 - multiplier * iqr
        upper = q3 + multiplier * iqr
        
        outlier_df[col] = (df[col] < lower) | (df[col] > upper)
    
    return outlier_df


def get_outlier_summary(df: pd.DataFrame, outlier_df: pd.DataFrame) -> pd.DataFrame:
    """Generate a summary of outliers per column."""
    summary = pd.DataFrame({
        'outlier_count': outlier_df.sum(),
        'outlier_percentage': (outlier_df.sum() / len(df) * 100).round(2),
        'total_rows': len(df)
    })
    return summary


# Create sample housing dataset
np.random.seed(42)
n_samples = 500

housing_data = pd.DataFrame({
    'price': np.concatenate([
        np.random.normal(300000, 50000, n_samples - 10),
        np.array([50000, 75000, 800000, 950000, 1200000,  # price outliers
                  45000, 60000, 850000, 1100000, 1500000])
    ]),
    'sqft': np.concatenate([
        np.random.normal(1800, 300, n_samples - 5),
        np.array([400, 500, 5000, 6000, 7000])  # sqft outliers
    ]),
    'bedrooms': np.concatenate([
        np.random.choice([2, 3, 4], n_samples - 3),
        np.array([0, 8, 10])  # bedroom outliers
    ])
})

# Detect outliers
outlier_flags = detect_outliers_dataframe(housing_data)
summary = get_outlier_summary(housing_data, outlier_flags)

print("Outlier Summary:")
print(summary)
print("\n" + "="*50)

# Find rows with ANY outlier
rows_with_outliers = outlier_flags.any(axis=1)
print(f"\nRows containing at least one outlier: {rows_with_outliers.sum()}")

# Remove rows with outliers in specific columns
clean_housing = housing_data[~outlier_flags['price']].copy()
print(f"Rows after removing price outliers: {len(clean_housing)}")

Output:

Outlier Summary:
          outlier_count  outlier_percentage  total_rows
price                10                 2.0         500
sqft                  5                 1.0         500
bedrooms              3                 0.6         500

==================================================

Rows containing at least one outlier: 18
Rows after removing price outliers: 490

This approach gives you flexibility: remove rows with any outlier, remove only rows with outliers in critical columns, or simply flag outliers for manual review.

When to Use IQR vs. Other Methods

The IQR method isn’t always the right choice. Here’s when to use it and when to consider alternatives:

Use IQR when:

  • Your data is univariate or you’re checking columns independently
  • You don’t know (or can’t assume) the underlying distribution
  • You need an interpretable, easy-to-explain method
  • Your dataset is small to medium-sized

Consider Z-score when:

  • Your data is approximately normally distributed
  • You need a probabilistic interpretation (e.g., “this point is 3 standard deviations away”)

Consider DBSCAN or Isolation Forest when:

  • You have multivariate data where outliers exist in combinations of features
  • You’re dealing with high-dimensional data
  • You need to detect clusters of anomalies

IQR limitations to keep in mind:

  • It’s strictly univariate—a point might be normal in each dimension separately but anomalous when dimensions are combined
  • It assumes some level of symmetry; heavily skewed distributions may need transformation first
  • The 1.5 multiplier is a convention, not a law—adjust based on your domain knowledge
# Quick comparison: IQR vs Z-score on skewed data
from scipy import stats

# Heavily right-skewed data (like income)
skewed_data = np.random.exponential(scale=50000, size=1000)
skewed_data = np.append(skewed_data, [500000, 600000])  # Add outliers

# IQR method
iqr_mask, iqr_outliers, _, _ = detect_outliers_iqr(skewed_data)

# Z-score method (3 standard deviations)
z_scores = np.abs(stats.zscore(skewed_data))
zscore_mask = z_scores > 3
zscore_outliers = skewed_data[zscore_mask]

print(f"IQR detected {len(iqr_outliers)} outliers")
print(f"Z-score detected {len(zscore_outliers)} outliers")

On skewed data, Z-score often misses outliers because the extreme values inflate the standard deviation, making everything look closer to the mean.

Conclusion

The IQR method provides a reliable, distribution-free approach to outlier detection. The key steps are straightforward: calculate Q1 and Q3, compute the IQR, establish fences at 1.5×IQR beyond the quartiles, and flag anything outside those boundaries.

Remember these practical guidelines:

  1. Always visualize before removing—box plots are your friend
  2. Consider the domain context; not every statistical outlier is a data quality issue
  3. Use the multiplier parameter to adjust sensitivity for your specific use case
  4. For multivariate outlier detection, graduate to methods like Isolation Forest

The complete code from this article is available in our GitHub repository. Start with the IQR method as your baseline, and only reach for more complex approaches when you have a clear reason to do so.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.