How to Detect Outliers Using Z-Score in Python
Outliers are data points that deviate significantly from the rest of your dataset. They're not just statistical curiosities—they can wreak havoc on your machine learning models, skew your summary...
Key Insights
- Z-score measures how many standard deviations a data point lies from the mean, making it effective for detecting outliers in normally distributed data with a typical threshold of ±3
- SciPy’s
zscore()function provides a clean one-liner solution, but understanding the underlying NumPy implementation gives you more control over edge cases and custom thresholds - For non-normal distributions or data with existing outliers, use the modified Z-score with median and MAD (Median Absolute Deviation) instead of mean and standard deviation
Introduction to Outliers and Why They Matter
Outliers are data points that deviate significantly from the rest of your dataset. They’re not just statistical curiosities—they can wreak havoc on your machine learning models, skew your summary statistics, and lead to incorrect conclusions. A single extreme value can shift your mean dramatically, inflate your standard deviation, and cause regression models to chase noise instead of signal.
Several methods exist for detecting outliers: the Interquartile Range (IQR) method, Isolation Forest for high-dimensional data, DBSCAN clustering, and the Z-score method. Each has its place. The Z-score approach shines when your data follows a roughly normal distribution and you need a quick, interpretable metric for how “unusual” each observation is.
If your data is heavily skewed or you’re working with categorical features, look elsewhere. But for continuous, bell-curve-shaped data? Z-score detection is fast, intuitive, and gets the job done.
Understanding the Z-Score Formula
The Z-score transforms any value into a standardized measure of distance from the mean:
Z = (x - μ) / σ
Where x is your data point, μ is the population mean, and σ is the standard deviation. The result tells you how many standard deviations away from the mean your value sits.
A Z-score of 0 means the value equals the mean. A Z-score of 2 means it’s two standard deviations above the mean. A Z-score of -1.5 means it’s 1.5 standard deviations below.
For normally distributed data, roughly 68% of values fall within ±1 standard deviation, 95% within ±2, and 99.7% within ±3. This is why ±3 is the most common threshold—values beyond this range occur less than 0.3% of the time by chance alone.
Here’s how to calculate it manually for a single value:
def calculate_zscore(value, data):
"""Calculate Z-score for a single value given a dataset."""
mean = sum(data) / len(data)
variance = sum((x - mean) ** 2 for x in data) / len(data)
std_dev = variance ** 0.5
z_score = (value - mean) / std_dev
return z_score
# Example usage
temperatures = [72, 74, 71, 73, 75, 74, 72, 98, 73, 71]
suspicious_value = 98
z = calculate_zscore(suspicious_value, temperatures)
print(f"Z-score for {suspicious_value}: {z:.2f}")
# Output: Z-score for 98: 3.37
That temperature reading of 98 has a Z-score of 3.37—clearly an outlier worth investigating.
Implementing Z-Score Detection with NumPy
Manual loops don’t scale. NumPy’s vectorized operations let you compute Z-scores for entire arrays efficiently:
import numpy as np
def detect_outliers_zscore(data, threshold=3):
"""
Detect outliers using Z-score method.
Parameters:
data: array-like, input data
threshold: float, Z-score threshold for outlier detection
Returns:
outlier_indices: array of indices where outliers occur
z_scores: array of Z-scores for all data points
"""
data = np.array(data)
mean = np.mean(data)
std = np.std(data)
# Avoid division by zero
if std == 0:
return np.array([]), np.zeros_like(data)
z_scores = (data - mean) / std
# Find indices where absolute Z-score exceeds threshold
outlier_indices = np.where(np.abs(z_scores) > threshold)[0]
return outlier_indices, z_scores
# Example with sensor readings
sensor_data = np.array([23.1, 22.8, 23.4, 22.9, 23.0, 45.2, 23.1, 22.7,
23.3, 22.8, 23.0, 22.9, -5.1, 23.2, 23.0])
outliers, scores = detect_outliers_zscore(sensor_data, threshold=3)
print(f"Outlier indices: {outliers}")
print(f"Outlier values: {sensor_data[outliers]}")
print(f"Their Z-scores: {scores[outliers]}")
Output:
Outlier indices: [5, 12]
Outlier values: [45.2 -5.1]
Their Z-scores: [3.67 -4.65]
The function returns both the indices (useful for removal or flagging) and all Z-scores (useful for ranking how extreme each point is).
Using SciPy’s Built-in Z-Score Function
Why reinvent the wheel? SciPy provides scipy.stats.zscore() with useful parameters for handling edge cases:
from scipy import stats
import numpy as np
data = np.array([23.1, 22.8, 23.4, np.nan, 23.0, 45.2, 23.1, 22.7])
# Basic usage
z_scores = stats.zscore(data, nan_policy='omit')
print(f"Z-scores: {z_scores}")
# One-liner outlier detection
threshold = 3
outlier_mask = np.abs(stats.zscore(data, nan_policy='omit')) > threshold
outliers = data[outlier_mask]
print(f"Outliers: {outliers}")
The nan_policy parameter is crucial for real-world data:
'propagate'(default): Returns NaN if any NaN exists'omit': Ignores NaN values in calculations'raise': Throws an error if NaN values are present
For most practical applications, use 'omit' to handle missing data gracefully.
Working with Pandas DataFrames
Real datasets have multiple columns. Here’s how to apply Z-score detection across a DataFrame:
import pandas as pd
import numpy as np
from scipy import stats
def flag_outliers_dataframe(df, columns=None, threshold=3):
"""
Flag outliers in specified DataFrame columns using Z-score.
Parameters:
df: pandas DataFrame
columns: list of column names (None = all numeric columns)
threshold: Z-score threshold
Returns:
DataFrame with additional boolean columns marking outliers
"""
df = df.copy()
if columns is None:
columns = df.select_dtypes(include=[np.number]).columns.tolist()
for col in columns:
z_scores = np.abs(stats.zscore(df[col], nan_policy='omit'))
df[f'{col}_is_outlier'] = z_scores > threshold
return df
def remove_outliers_any(df, columns=None, threshold=3):
"""Remove rows where ANY specified column has an outlier."""
df_flagged = flag_outliers_dataframe(df, columns, threshold)
outlier_cols = [col for col in df_flagged.columns if col.endswith('_is_outlier')]
# Keep rows where no outliers exist
mask = ~df_flagged[outlier_cols].any(axis=1)
return df[mask]
# Example usage
data = {
'temperature': [22, 23, 21, 22, 85, 23, 22, 21, 23, 22],
'humidity': [45, 47, 46, 120, 45, 46, 47, 45, 46, 45],
'pressure': [1013, 1012, 1014, 1013, 1012, 1013, 1014, 1012, 1013, 1012]
}
df = pd.DataFrame(data)
# Flag outliers
df_flagged = flag_outliers_dataframe(df, threshold=2.5)
print("Flagged DataFrame:")
print(df_flagged)
# Remove rows with any outlier
df_clean = remove_outliers_any(df, threshold=2.5)
print(f"\nOriginal rows: {len(df)}, Clean rows: {len(df_clean)}")
This approach lets you either flag outliers for review or remove them entirely, depending on your use case.
Visualizing Outliers
Numbers tell part of the story. Visualization makes outliers obvious:
import matplotlib.pyplot as plt
import numpy as np
from scipy import stats
def visualize_outliers(data, threshold=3, title="Outlier Detection"):
"""Create a visualization showing detected outliers."""
data = np.array(data)
z_scores = stats.zscore(data)
outlier_mask = np.abs(z_scores) > threshold
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
# Scatter plot with outliers highlighted
ax1 = axes[0]
indices = np.arange(len(data))
ax1.scatter(indices[~outlier_mask], data[~outlier_mask],
c='steelblue', label='Normal', alpha=0.7)
ax1.scatter(indices[outlier_mask], data[outlier_mask],
c='red', s=100, label='Outlier', edgecolors='black')
ax1.set_xlabel('Index')
ax1.set_ylabel('Value')
ax1.set_title('Data Points with Outliers Highlighted')
ax1.legend()
# Distribution with threshold lines
ax2 = axes[1]
ax2.hist(data, bins=30, density=True, alpha=0.7, color='steelblue')
mean, std = np.mean(data), np.std(data)
lower_bound = mean - threshold * std
upper_bound = mean + threshold * std
ax2.axvline(lower_bound, color='red', linestyle='--',
label=f'Lower bound ({lower_bound:.1f})')
ax2.axvline(upper_bound, color='red', linestyle='--',
label=f'Upper bound ({upper_bound:.1f})')
ax2.axvline(mean, color='green', linestyle='-', label=f'Mean ({mean:.1f})')
ax2.set_xlabel('Value')
ax2.set_ylabel('Density')
ax2.set_title(f'Distribution with ±{threshold}σ Boundaries')
ax2.legend()
plt.suptitle(title)
plt.tight_layout()
plt.savefig('outlier_visualization.png', dpi=150)
plt.show()
# Generate sample data with outliers
np.random.seed(42)
normal_data = np.random.normal(50, 5, 100)
outliers = np.array([20, 85, 90])
data_with_outliers = np.concatenate([normal_data, outliers])
visualize_outliers(data_with_outliers, threshold=3)
The scatter plot immediately shows which points are flagged, while the histogram reveals whether your threshold makes sense given the data distribution.
Limitations and Best Practices
Z-score detection assumes your data is normally distributed. When it isn’t, the mean and standard deviation become unreliable—existing outliers pull the mean toward them and inflate the standard deviation, making extreme values appear less extreme. It’s a vicious cycle.
The modified Z-score uses the median and Median Absolute Deviation (MAD) instead, making it robust to outliers in the data itself:
import numpy as np
def modified_zscore(data, threshold=3.5):
"""
Calculate modified Z-score using median and MAD.
More robust to outliers than standard Z-score.
The constant 0.6745 makes MAD consistent with standard deviation
for normally distributed data.
"""
data = np.array(data)
median = np.median(data)
mad = np.median(np.abs(data - median))
# Avoid division by zero
if mad == 0:
return np.zeros_like(data), np.array([])
modified_z = 0.6745 * (data - median) / mad
outlier_indices = np.where(np.abs(modified_z) > threshold)[0]
return modified_z, outlier_indices
# Compare standard vs modified Z-score
contaminated_data = [10, 12, 11, 13, 12, 11, 100, 12, 11, 13, 200]
# Standard Z-score (outliers affect mean/std)
from scipy import stats
standard_z = stats.zscore(contaminated_data)
# Modified Z-score (robust to outliers)
modified_z, outliers = modified_zscore(contaminated_data)
print("Standard Z-scores:", np.round(standard_z, 2))
print("Modified Z-scores:", np.round(modified_z, 2))
print(f"Modified method outliers at indices: {outliers}")
Notice how the standard Z-score is dampened by the outliers’ influence on the mean and standard deviation, while the modified version correctly identifies the extreme values.
Threshold guidelines:
- For small datasets (n < 100): Use ±2.5 to catch more potential outliers
- For medium datasets (100 < n < 1000): Use ±3 (standard)
- For large datasets (n > 1000): Consider ±3.5 to reduce false positives
Always visualize before removing. Sometimes outliers are the most interesting part of your data—they might represent fraud, equipment failure, or breakthrough discoveries. Don’t blindly delete them.