Pandas - Handle Missing Data (Complete Guide)
• Missing data in Pandas appears as NaN, None, or NaT (for datetime), and understanding detection methods prevents silent errors in analysis pipelines
Key Insights
• Missing data in Pandas appears as NaN, None, or NaT (for datetime), and understanding detection methods prevents silent errors in analysis pipelines • Strategic handling approaches—dropping, filling, or interpolating—depend on data patterns, with forward fill for time series and mean imputation for statistical analyses requiring different validation strategies • Custom missing value indicators and categorical data require specialized handling beyond standard methods to maintain data integrity and analysis accuracy
Detecting Missing Data
Pandas provides multiple methods to identify missing values. The isna() and isnull() methods are interchangeable and return boolean DataFrames indicating missing values.
import pandas as pd
import numpy as np
df = pd.DataFrame({
'A': [1, 2, np.nan, 4],
'B': [5, np.nan, np.nan, 8],
'C': [9, 10, 11, 12],
'D': [None, 'text', np.nan, 'data']
})
# Check for missing values
print(df.isna())
print(df.isnull().sum()) # Count per column
# Check if any missing values exist
print(df.isna().any()) # Per column
print(df.isna().any().any()) # Entire DataFrame
For conditional filtering, combine boolean indexing with missing value detection:
# Rows with any missing values
missing_rows = df[df.isna().any(axis=1)]
# Rows with all values present
complete_rows = df[df.notna().all(axis=1)]
# Specific column filtering
df_filtered = df[df['B'].notna()]
Dropping Missing Data
The dropna() method removes rows or columns containing missing values. Control behavior with parameters for threshold-based dropping.
df = pd.DataFrame({
'A': [1, np.nan, 3, np.nan, 5],
'B': [np.nan, 2, 3, np.nan, 5],
'C': [1, 2, np.nan, 4, 5],
'D': [1, 2, 3, 4, 5]
})
# Drop rows with any missing values
df_clean = df.dropna()
# Drop columns with any missing values
df_clean_cols = df.dropna(axis=1)
# Drop rows with all missing values
df_partial = df.dropna(how='all')
# Keep rows with at least 3 non-missing values
df_thresh = df.dropna(thresh=3)
# Drop based on specific columns
df_subset = df.dropna(subset=['A', 'B'])
For in-place modifications:
df.dropna(inplace=True) # Modifies original DataFrame
Filling Missing Data
Filling strategies depend on data characteristics. Use scalar values, forward/backward fill, or statistical measures.
df = pd.DataFrame({
'price': [100, np.nan, 150, np.nan, 200],
'quantity': [10, 20, np.nan, 40, 50],
'category': ['A', np.nan, 'B', 'A', np.nan]
})
# Fill with scalar value
df_filled = df.fillna(0)
# Fill different columns with different values
df_filled = df.fillna({'price': df['price'].mean(),
'quantity': df['quantity'].median(),
'category': 'Unknown'})
# Forward fill - propagate last valid observation
df_ffill = df.fillna(method='ffill')
# Backward fill
df_bfill = df.fillna(method='bfill')
# Limit consecutive fills
df_limited = df.fillna(method='ffill', limit=1)
For time series data, specify direction explicitly:
dates = pd.date_range('2024-01-01', periods=5)
ts = pd.Series([10, np.nan, np.nan, 40, 50], index=dates)
# Forward fill time series
ts_filled = ts.fillna(method='ffill')
# Combine forward and backward fill
ts_filled = ts.fillna(method='ffill').fillna(method='bfill')
Interpolation Methods
Interpolation estimates missing values based on surrounding data points, useful for continuous numerical data.
df = pd.DataFrame({
'value': [1, np.nan, np.nan, 4, 5, np.nan, 7],
'time': pd.date_range('2024-01-01', periods=7)
})
# Linear interpolation
df['linear'] = df['value'].interpolate(method='linear')
# Polynomial interpolation
df['polynomial'] = df['value'].interpolate(method='polynomial', order=2)
# Time-based interpolation
df_time = df.set_index('time')
df_time['time_interp'] = df_time['value'].interpolate(method='time')
# Limit interpolation direction
df['forward_only'] = df['value'].interpolate(method='linear',
limit_direction='forward')
# Limit number of consecutive NaNs to fill
df['limited'] = df['value'].interpolate(method='linear', limit=1)
For more advanced interpolation:
from scipy import interpolate
# Spline interpolation
df['spline'] = df['value'].interpolate(method='spline', order=3)
# Nearest neighbor
df['nearest'] = df['value'].interpolate(method='nearest')
Replace Custom Missing Indicators
Real-world datasets often use custom indicators for missing values like -999, “N/A”, or empty strings.
df = pd.DataFrame({
'temperature': [20, -999, 25, 30, -999],
'humidity': [50, 60, -1, 70, 80],
'status': ['OK', 'N/A', 'OK', '', 'OK']
})
# Replace custom missing indicators with NaN
df_clean = df.replace({
-999: np.nan,
-1: np.nan,
'N/A': np.nan,
'': np.nan
})
# Replace per column
df['temperature'] = df['temperature'].replace(-999, np.nan)
# Replace multiple values
df_clean = df.replace([-999, -1, 'N/A', ''], np.nan)
# Using regex for string patterns
df['status'] = df['status'].replace(r'^\s*$', np.nan, regex=True)
Handling Categorical Missing Data
Categorical data requires different strategies than numerical data to preserve category relationships.
df = pd.DataFrame({
'category': pd.Categorical(['A', 'B', np.nan, 'A', np.nan, 'C']),
'value': [1, 2, 3, 4, 5, 6]
})
# Add 'Missing' as a category
df['category'] = df['category'].cat.add_categories(['Missing'])
df['category'].fillna('Missing', inplace=True)
# Mode imputation for categorical data
mode_value = df['category'].mode()[0]
df['category'].fillna(mode_value, inplace=True)
# Create indicator variable
df['category_missing'] = df['category'].isna().astype(int)
For one-hot encoded data:
df = pd.DataFrame({
'color': ['red', np.nan, 'blue', 'red', np.nan]
})
# Create dummies with missing indicator
dummies = pd.get_dummies(df['color'], dummy_na=True, prefix='color')
print(dummies)
Advanced Missing Data Patterns
Analyze missing data patterns to inform handling strategies and detect systematic issues.
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.DataFrame({
'A': [1, np.nan, 3, np.nan, 5, 6, np.nan],
'B': [np.nan, 2, 3, np.nan, 5, 6, 7],
'C': [1, 2, np.nan, 4, np.nan, 6, 7],
'D': [1, 2, 3, 4, 5, 6, 7]
})
# Missing data summary
missing_summary = pd.DataFrame({
'missing_count': df.isna().sum(),
'missing_pct': (df.isna().sum() / len(df)) * 100
})
print(missing_summary)
# Correlation of missingness
missing_corr = df.isna().corr()
# Identify rows with multiple missing values
df['missing_count'] = df.isna().sum(axis=1)
high_missing = df[df['missing_count'] > 2]
Validation After Handling
Verify missing data handling preserved data integrity and didn’t introduce errors.
def validate_missing_handling(df_original, df_processed):
"""Validate missing data handling results."""
# Check no new missing values introduced
original_missing = df_original.isna().sum().sum()
processed_missing = df_processed.isna().sum().sum()
print(f"Original missing: {original_missing}")
print(f"Processed missing: {processed_missing}")
# Check shape preservation
assert df_original.shape == df_processed.shape, "Shape changed"
# Check data types preserved
for col in df_original.columns:
assert df_original[col].dtype == df_processed[col].dtype, \
f"Type changed for {col}"
# Statistical validation for numerical columns
for col in df_original.select_dtypes(include=[np.number]).columns:
orig_mean = df_original[col].mean()
proc_mean = df_processed[col].mean()
pct_change = abs((proc_mean - orig_mean) / orig_mean) * 100
print(f"{col} mean change: {pct_change:.2f}%")
# Example usage
df_original = pd.DataFrame({'A': [1, np.nan, 3, 4, 5]})
df_filled = df_original.fillna(df_original['A'].mean())
validate_missing_handling(df_original, df_filled)
Missing data handling is foundational to data analysis. Choose strategies based on data characteristics, validate results, and document decisions for reproducible analysis pipelines.