Pandas: Handling Missing Data
Every real-world dataset has holes. Missing data shows up as `NaN` (Not a Number), `None`, or `NaT` (Not a Time) in Pandas, and how you handle these gaps directly impacts the quality of your analysis.
Key Insights
- Missing data handling isn’t one-size-fits-all—your strategy should depend on why data is missing and how you’ll use it downstream.
- Detection comes before treatment: always profile your dataset’s missing data patterns before deciding whether to drop, fill, or interpolate.
- Pandas provides a complete toolkit from simple deletion to sophisticated interpolation, but the wrong choice can introduce bias or destroy signal in your analysis.
Introduction to Missing Data in Pandas
Every real-world dataset has holes. Missing data shows up as NaN (Not a Number), None, or NaT (Not a Time) in Pandas, and how you handle these gaps directly impacts the quality of your analysis.
Data goes missing for many reasons: sensors fail, users skip form fields, joins don’t match, or data simply wasn’t collected. The cause matters because it should inform your strategy. Data missing completely at random is different from data missing because of a systematic issue.
Let’s start with a DataFrame that represents common missing data scenarios:
import pandas as pd
import numpy as np
# Create a DataFrame with intentional missing values
data = {
'customer_id': [1, 2, 3, 4, 5, 6],
'age': [25, np.nan, 35, 42, np.nan, 29],
'income': [50000, 62000, None, 75000, 48000, np.nan],
'signup_date': pd.to_datetime(['2023-01-15', '2023-02-20', None,
'2023-04-10', '2023-05-05', '2023-06-12']),
'region': ['North', 'South', 'North', None, 'East', 'West']
}
df = pd.DataFrame(data)
print(df)
Output:
customer_id age income signup_date region
0 1 25.0 50000.0 2023-01-15 North
1 2 NaN 62000.0 2023-02-20 South
2 3 35.0 NaN NaT North
3 4 42.0 75000.0 2023-04-10 None
4 5 NaN 48000.0 2023-05-05 East
5 6 29.0 NaN 2023-06-12 West
Notice how Pandas normalizes None to NaN for numeric columns and NaT for datetime columns. This consistency makes detection straightforward.
Detecting Missing Data
Before fixing anything, you need to understand what you’re dealing with. Pandas provides several methods for missing data detection.
The isnull() and isna() methods are aliases—they do exactly the same thing. Use whichever reads better to you. The notnull() method returns the inverse.
# Generate a comprehensive missing data report
def missing_data_report(df):
"""Create a detailed missing data summary for each column."""
report = pd.DataFrame({
'missing_count': df.isnull().sum(),
'missing_percent': (df.isnull().sum() / len(df) * 100).round(2),
'dtype': df.dtypes
})
report = report[report['missing_count'] > 0].sort_values(
'missing_percent', ascending=False
)
return report
print(missing_data_report(df))
Output:
missing_count missing_percent dtype
age 2 33.33 float64
income 2 33.33 float64
signup_date 1 16.67 datetime64[ns]
region 1 16.67 object
For a quick overview, df.info() shows non-null counts per column, and df.isnull().sum().sum() gives you the total missing value count across the entire DataFrame.
Removing Missing Data with dropna()
Sometimes the cleanest solution is deletion. The dropna() method gives you fine-grained control over what gets removed.
# Sample DataFrame for demonstration
df_demo = pd.DataFrame({
'A': [1, 2, np.nan, 4],
'B': [np.nan, 2, 3, 4],
'C': [1, np.nan, np.nan, 4],
'D': [1, 2, 3, 4]
})
# Drop rows where ANY value is missing (default behavior)
print("Drop any missing:")
print(df_demo.dropna())
# Drop rows where ALL values are missing
print("\nDrop only if all missing:")
print(df_demo.dropna(how='all'))
# Drop columns instead of rows
print("\nDrop columns with missing values:")
print(df_demo.dropna(axis=1))
# Keep rows with at least 3 non-null values
print("\nKeep rows with at least 3 valid values:")
print(df_demo.dropna(thresh=3))
# Only consider specific columns
print("\nDrop based on columns A and D only:")
print(df_demo.dropna(subset=['A', 'D']))
The thresh parameter is particularly useful. If you have 10 columns and can tolerate a few missing values per row, dropna(thresh=8) keeps rows with at least 8 non-null values.
Warning: Dropping data can introduce bias. If missing values correlate with certain outcomes, you’re systematically removing a segment of your population.
Filling Missing Data with fillna()
When deletion isn’t appropriate, filling is your next option. The fillna() method offers several approaches.
# Create a fresh DataFrame
df_fill = pd.DataFrame({
'temperature': [72.0, np.nan, 68.0, np.nan, 75.0, np.nan],
'humidity': [45, 50, np.nan, 55, np.nan, 60],
'category': ['A', None, 'B', 'A', None, 'C']
})
# Fill with a static value
df_fill['humidity'].fillna(0, inplace=False)
# Fill with column statistics
df_stats = df_fill.copy()
df_stats['temperature'] = df_stats['temperature'].fillna(
df_stats['temperature'].mean()
)
df_stats['humidity'] = df_stats['humidity'].fillna(
df_stats['humidity'].median()
)
df_stats['category'] = df_stats['category'].fillna(
df_stats['category'].mode()[0]
)
print("Filled with mean/median/mode:")
print(df_stats)
# Forward fill: propagate last valid value forward
print("\nForward fill:")
print(df_fill['temperature'].ffill())
# Backward fill: propagate next valid value backward
print("\nBackward fill:")
print(df_fill['temperature'].bfill())
Forward and backward filling work well for time-ordered data where you can reasonably assume continuity. For cross-sectional data, mean or median imputation is often more appropriate.
Opinion: Don’t blindly fill with the mean. For skewed distributions, the median is more robust. For categorical data, mode makes sense, but consider whether “Unknown” might be more honest.
Interpolation Techniques
Interpolation estimates missing values based on surrounding data points. It’s more sophisticated than simple filling and particularly valuable for time series.
# Time series data with gaps
dates = pd.date_range('2024-01-01', periods=10, freq='D')
ts_data = pd.DataFrame({
'date': dates,
'value': [100, np.nan, np.nan, 115, 120, np.nan, 130, np.nan, np.nan, 150]
})
ts_data.set_index('date', inplace=True)
# Linear interpolation (default)
print("Linear interpolation:")
print(ts_data['value'].interpolate())
# Polynomial interpolation (order 2)
print("\nPolynomial interpolation:")
print(ts_data['value'].interpolate(method='polynomial', order=2))
# Time-based interpolation (uses actual time distances)
print("\nTime-based interpolation:")
print(ts_data['value'].interpolate(method='time'))
# Limit the number of consecutive NaNs to fill
print("\nLimited interpolation (max 1 consecutive):")
print(ts_data['value'].interpolate(limit=1))
Linear interpolation assumes constant rate of change between known points. Polynomial interpolation can capture curves but may produce unrealistic values at boundaries. Time-based interpolation respects actual temporal distances, which matters when your data has irregular intervals.
When to use interpolation: Time series with gradual changes (temperature, stock prices, sensor readings). When to avoid: Categorical data, data with sudden jumps, or when missing values span too large a gap.
Advanced Strategies and Best Practices
Real datasets often have custom missing value markers like -999, “N/A”, or empty strings. Handle these before standard missing data operations:
# Replace custom missing markers with NaN
df_custom = pd.DataFrame({
'score': [85, -999, 90, -999, 78],
'status': ['active', 'N/A', '', 'active', 'N/A']
})
df_clean = df_custom.replace({
'score': {-999: np.nan},
'status': {'N/A': np.nan, '': np.nan}
})
print(df_clean)
When working with grouped data, you often want to fill missing values with group-specific statistics:
# Fill missing values with group means
df_grouped = pd.DataFrame({
'region': ['North', 'North', 'South', 'South', 'North'],
'sales': [100, np.nan, 150, np.nan, 120]
})
df_grouped['sales'] = df_grouped.groupby('region')['sales'].transform(
lambda x: x.fillna(x.mean())
)
print(df_grouped)
Most Pandas aggregation functions have a skipna parameter that defaults to True. Be aware of this behavior:
data = pd.Series([1, 2, np.nan, 4])
print(f"Sum (skipna=True): {data.sum()}") # 7.0
print(f"Sum (skipna=False): {data.sum(skipna=False)}") # nan
Here’s a complete pipeline that chains multiple operations:
def clean_missing_data(df):
"""Complete missing data handling pipeline."""
df = df.copy()
# Step 1: Replace custom markers
df = df.replace(['N/A', '', -999], np.nan)
# Step 2: Drop rows missing critical fields
df = df.dropna(subset=['customer_id'])
# Step 3: Fill numeric columns with median
numeric_cols = df.select_dtypes(include=[np.number]).columns
df[numeric_cols] = df[numeric_cols].fillna(df[numeric_cols].median())
# Step 4: Fill categorical columns with mode
categorical_cols = df.select_dtypes(include=['object']).columns
for col in categorical_cols:
df[col] = df[col].fillna(df[col].mode().iloc[0] if not df[col].mode().empty else 'Unknown')
# Step 5: Interpolate datetime gaps if applicable
datetime_cols = df.select_dtypes(include=['datetime64']).columns
df[datetime_cols] = df[datetime_cols].interpolate(method='linear')
return df
Conclusion
Missing data handling requires judgment. There’s no universal solution—your strategy should depend on why data is missing, how much is missing, and what you’re trying to accomplish downstream.
Quick reference:
| Method | Use Case |
|---|---|
dropna() |
Missing data is rare and random; you can afford to lose rows |
fillna(value) |
You have a sensible default or sentinel value |
fillna(mean/median) |
Numeric data where central tendency is meaningful |
ffill()/bfill() |
Time-ordered data with expected continuity |
interpolate() |
Time series with gradual, continuous changes |
groupby().transform() |
Missing values should reflect group characteristics |
Start by profiling your missing data. Understand the patterns. Then choose the method that preserves the integrity of your analysis while minimizing information loss. When in doubt, document your assumptions—future you will appreciate it.