Pandas: Handling Missing Data

Every real-world dataset has holes. Missing data shows up as `NaN` (Not a Number), `None`, or `NaT` (Not a Time) in Pandas, and how you handle these gaps directly impacts the quality of your analysis.

Key Insights

  • Missing data handling isn’t one-size-fits-all—your strategy should depend on why data is missing and how you’ll use it downstream.
  • Detection comes before treatment: always profile your dataset’s missing data patterns before deciding whether to drop, fill, or interpolate.
  • Pandas provides a complete toolkit from simple deletion to sophisticated interpolation, but the wrong choice can introduce bias or destroy signal in your analysis.

Introduction to Missing Data in Pandas

Every real-world dataset has holes. Missing data shows up as NaN (Not a Number), None, or NaT (Not a Time) in Pandas, and how you handle these gaps directly impacts the quality of your analysis.

Data goes missing for many reasons: sensors fail, users skip form fields, joins don’t match, or data simply wasn’t collected. The cause matters because it should inform your strategy. Data missing completely at random is different from data missing because of a systematic issue.

Let’s start with a DataFrame that represents common missing data scenarios:

import pandas as pd
import numpy as np

# Create a DataFrame with intentional missing values
data = {
    'customer_id': [1, 2, 3, 4, 5, 6],
    'age': [25, np.nan, 35, 42, np.nan, 29],
    'income': [50000, 62000, None, 75000, 48000, np.nan],
    'signup_date': pd.to_datetime(['2023-01-15', '2023-02-20', None, 
                                    '2023-04-10', '2023-05-05', '2023-06-12']),
    'region': ['North', 'South', 'North', None, 'East', 'West']
}

df = pd.DataFrame(data)
print(df)

Output:

   customer_id   age   income signup_date region
0            1  25.0  50000.0  2023-01-15  North
1            2   NaN  62000.0  2023-02-20  South
2            3  35.0      NaN         NaT  North
3            4  42.0  75000.0  2023-04-10   None
4            5   NaN  48000.0  2023-05-05   East
5            6  29.0      NaN  2023-06-12   West

Notice how Pandas normalizes None to NaN for numeric columns and NaT for datetime columns. This consistency makes detection straightforward.

Detecting Missing Data

Before fixing anything, you need to understand what you’re dealing with. Pandas provides several methods for missing data detection.

The isnull() and isna() methods are aliases—they do exactly the same thing. Use whichever reads better to you. The notnull() method returns the inverse.

# Generate a comprehensive missing data report
def missing_data_report(df):
    """Create a detailed missing data summary for each column."""
    report = pd.DataFrame({
        'missing_count': df.isnull().sum(),
        'missing_percent': (df.isnull().sum() / len(df) * 100).round(2),
        'dtype': df.dtypes
    })
    report = report[report['missing_count'] > 0].sort_values(
        'missing_percent', ascending=False
    )
    return report

print(missing_data_report(df))

Output:

             missing_count  missing_percent    dtype
age                      2            33.33  float64
income                   2            33.33  float64
signup_date              1            16.67   datetime64[ns]
region                   1            16.67   object

For a quick overview, df.info() shows non-null counts per column, and df.isnull().sum().sum() gives you the total missing value count across the entire DataFrame.

Removing Missing Data with dropna()

Sometimes the cleanest solution is deletion. The dropna() method gives you fine-grained control over what gets removed.

# Sample DataFrame for demonstration
df_demo = pd.DataFrame({
    'A': [1, 2, np.nan, 4],
    'B': [np.nan, 2, 3, 4],
    'C': [1, np.nan, np.nan, 4],
    'D': [1, 2, 3, 4]
})

# Drop rows where ANY value is missing (default behavior)
print("Drop any missing:")
print(df_demo.dropna())

# Drop rows where ALL values are missing
print("\nDrop only if all missing:")
print(df_demo.dropna(how='all'))

# Drop columns instead of rows
print("\nDrop columns with missing values:")
print(df_demo.dropna(axis=1))

# Keep rows with at least 3 non-null values
print("\nKeep rows with at least 3 valid values:")
print(df_demo.dropna(thresh=3))

# Only consider specific columns
print("\nDrop based on columns A and D only:")
print(df_demo.dropna(subset=['A', 'D']))

The thresh parameter is particularly useful. If you have 10 columns and can tolerate a few missing values per row, dropna(thresh=8) keeps rows with at least 8 non-null values.

Warning: Dropping data can introduce bias. If missing values correlate with certain outcomes, you’re systematically removing a segment of your population.

Filling Missing Data with fillna()

When deletion isn’t appropriate, filling is your next option. The fillna() method offers several approaches.

# Create a fresh DataFrame
df_fill = pd.DataFrame({
    'temperature': [72.0, np.nan, 68.0, np.nan, 75.0, np.nan],
    'humidity': [45, 50, np.nan, 55, np.nan, 60],
    'category': ['A', None, 'B', 'A', None, 'C']
})

# Fill with a static value
df_fill['humidity'].fillna(0, inplace=False)

# Fill with column statistics
df_stats = df_fill.copy()
df_stats['temperature'] = df_stats['temperature'].fillna(
    df_stats['temperature'].mean()
)
df_stats['humidity'] = df_stats['humidity'].fillna(
    df_stats['humidity'].median()
)
df_stats['category'] = df_stats['category'].fillna(
    df_stats['category'].mode()[0]
)

print("Filled with mean/median/mode:")
print(df_stats)

# Forward fill: propagate last valid value forward
print("\nForward fill:")
print(df_fill['temperature'].ffill())

# Backward fill: propagate next valid value backward
print("\nBackward fill:")
print(df_fill['temperature'].bfill())

Forward and backward filling work well for time-ordered data where you can reasonably assume continuity. For cross-sectional data, mean or median imputation is often more appropriate.

Opinion: Don’t blindly fill with the mean. For skewed distributions, the median is more robust. For categorical data, mode makes sense, but consider whether “Unknown” might be more honest.

Interpolation Techniques

Interpolation estimates missing values based on surrounding data points. It’s more sophisticated than simple filling and particularly valuable for time series.

# Time series data with gaps
dates = pd.date_range('2024-01-01', periods=10, freq='D')
ts_data = pd.DataFrame({
    'date': dates,
    'value': [100, np.nan, np.nan, 115, 120, np.nan, 130, np.nan, np.nan, 150]
})
ts_data.set_index('date', inplace=True)

# Linear interpolation (default)
print("Linear interpolation:")
print(ts_data['value'].interpolate())

# Polynomial interpolation (order 2)
print("\nPolynomial interpolation:")
print(ts_data['value'].interpolate(method='polynomial', order=2))

# Time-based interpolation (uses actual time distances)
print("\nTime-based interpolation:")
print(ts_data['value'].interpolate(method='time'))

# Limit the number of consecutive NaNs to fill
print("\nLimited interpolation (max 1 consecutive):")
print(ts_data['value'].interpolate(limit=1))

Linear interpolation assumes constant rate of change between known points. Polynomial interpolation can capture curves but may produce unrealistic values at boundaries. Time-based interpolation respects actual temporal distances, which matters when your data has irregular intervals.

When to use interpolation: Time series with gradual changes (temperature, stock prices, sensor readings). When to avoid: Categorical data, data with sudden jumps, or when missing values span too large a gap.

Advanced Strategies and Best Practices

Real datasets often have custom missing value markers like -999, “N/A”, or empty strings. Handle these before standard missing data operations:

# Replace custom missing markers with NaN
df_custom = pd.DataFrame({
    'score': [85, -999, 90, -999, 78],
    'status': ['active', 'N/A', '', 'active', 'N/A']
})

df_clean = df_custom.replace({
    'score': {-999: np.nan},
    'status': {'N/A': np.nan, '': np.nan}
})
print(df_clean)

When working with grouped data, you often want to fill missing values with group-specific statistics:

# Fill missing values with group means
df_grouped = pd.DataFrame({
    'region': ['North', 'North', 'South', 'South', 'North'],
    'sales': [100, np.nan, 150, np.nan, 120]
})

df_grouped['sales'] = df_grouped.groupby('region')['sales'].transform(
    lambda x: x.fillna(x.mean())
)
print(df_grouped)

Most Pandas aggregation functions have a skipna parameter that defaults to True. Be aware of this behavior:

data = pd.Series([1, 2, np.nan, 4])
print(f"Sum (skipna=True): {data.sum()}")      # 7.0
print(f"Sum (skipna=False): {data.sum(skipna=False)}")  # nan

Here’s a complete pipeline that chains multiple operations:

def clean_missing_data(df):
    """Complete missing data handling pipeline."""
    df = df.copy()
    
    # Step 1: Replace custom markers
    df = df.replace(['N/A', '', -999], np.nan)
    
    # Step 2: Drop rows missing critical fields
    df = df.dropna(subset=['customer_id'])
    
    # Step 3: Fill numeric columns with median
    numeric_cols = df.select_dtypes(include=[np.number]).columns
    df[numeric_cols] = df[numeric_cols].fillna(df[numeric_cols].median())
    
    # Step 4: Fill categorical columns with mode
    categorical_cols = df.select_dtypes(include=['object']).columns
    for col in categorical_cols:
        df[col] = df[col].fillna(df[col].mode().iloc[0] if not df[col].mode().empty else 'Unknown')
    
    # Step 5: Interpolate datetime gaps if applicable
    datetime_cols = df.select_dtypes(include=['datetime64']).columns
    df[datetime_cols] = df[datetime_cols].interpolate(method='linear')
    
    return df

Conclusion

Missing data handling requires judgment. There’s no universal solution—your strategy should depend on why data is missing, how much is missing, and what you’re trying to accomplish downstream.

Quick reference:

Method Use Case
dropna() Missing data is rare and random; you can afford to lose rows
fillna(value) You have a sensible default or sentinel value
fillna(mean/median) Numeric data where central tendency is meaningful
ffill()/bfill() Time-ordered data with expected continuity
interpolate() Time series with gradual, continuous changes
groupby().transform() Missing values should reflect group characteristics

Start by profiling your missing data. Understand the patterns. Then choose the method that preserves the integrity of your analysis while minimizing information loss. When in doubt, document your assumptions—future you will appreciate it.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.