Pandas - Fill NaN Values (fillna) with Examples

Key Insights

The fillna() method provides multiple strategies for handling missing data including constant values, forward/backward fill, and interpolation methods that preserve data integrity
Different fill strategies suit different data types: forward fill works well for time series, mean/median for numerical data, and mode for categorical data
Combining fillna() with groupby operations enables context-aware imputation where missing values are filled based on related group statistics rather than global values

Understanding Missing Data in Pandas

Pandas represents missing data using NaN (Not a Number) from NumPy, None, or pd.NA. Before filling missing values, identify them using isna() or isnull():

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'product': ['A', 'B', 'C', 'D', 'E'],
    'price': [10.5, np.nan, 15.0, np.nan, 20.0],
    'quantity': [100, 200, np.nan, 150, np.nan],
    'category': ['electronics', None, 'electronics', 'furniture', 'furniture']
})

print(df.isna().sum())
# price       2
# quantity    2
# category    1

Fill with Constant Values

The simplest approach fills all NaN values with a constant:

# Fill all NaN with 0
df_filled = df.fillna(0)

# Fill with different values per column
df_filled = df.fillna({
    'price': 0.0,
    'quantity': 0,
    'category': 'unknown'
})

# Fill only specific columns
df['price'] = df['price'].fillna(0)

For production systems, use inplace=True cautiously as it modifies the original DataFrame:

df.fillna(0, inplace=True)  # Modifies df directly

Forward Fill and Backward Fill

Forward fill (ffill) propagates the last valid observation forward. Backward fill (bfill) uses the next valid observation:

df = pd.DataFrame({
    'date': pd.date_range('2024-01-01', periods=6),
    'temperature': [20.5, np.nan, np.nan, 22.0, np.nan, 23.5]
})

# Forward fill - propagate last valid value
df['temp_ffill'] = df['temperature'].ffill()

# Backward fill - use next valid value
df['temp_bfill'] = df['temperature'].bfill()

print(df)
#         date  temperature  temp_ffill  temp_bfill
# 0 2024-01-01         20.5        20.5        20.5
# 1 2024-01-02          NaN        20.5        22.0
# 2 2024-01-03          NaN        20.5        22.0
# 3 2024-01-04         22.0        22.0        22.0
# 4 2024-01-05          NaN        22.0        23.5
# 5 2024-01-06         23.5        23.5        23.5

Limit propagation to prevent filling across large gaps:

# Fill only 1 consecutive NaN
df['temperature'].ffill(limit=1)

Fill with Statistical Measures

Use mean, median, or mode for numerical imputation:

df = pd.DataFrame({
    'sensor_1': [10, 20, np.nan, 40, 50],
    'sensor_2': [15, np.nan, 35, np.nan, 55],
    'status': ['active', 'active', None, 'inactive', 'active']
})

# Fill with mean
df['sensor_1'] = df['sensor_1'].fillna(df['sensor_1'].mean())

# Fill with median (robust to outliers)
df['sensor_2'] = df['sensor_2'].fillna(df['sensor_2'].median())

# Fill categorical with mode
mode_value = df['status'].mode()[0]
df['status'] = df['status'].fillna(mode_value)

print(df)
#    sensor_1  sensor_2    status
# 0      10.0      15.0    active
# 1      20.0      35.0    active
# 2      30.0      35.0    active
# 3      40.0      35.0  inactive
# 4      50.0      55.0    active

Group-Based Filling

Fill missing values based on group statistics for context-aware imputation:

df = pd.DataFrame({
    'store': ['A', 'A', 'A', 'B', 'B', 'B'],
    'product': ['X', 'Y', 'X', 'X', 'Y', 'X'],
    'sales': [100, np.nan, 150, 200, np.nan, 250]
})

# Fill with group mean
df['sales'] = df.groupby('store')['sales'].transform(
    lambda x: x.fillna(x.mean())
)

print(df)
#   store product  sales
# 0     A       X  100.0
# 1     A       Y  125.0  # Filled with store A mean
# 2     A       X  150.0
# 3     B       X  200.0
# 4     B       Y  225.0  # Filled with store B mean
# 5     B       X  250.0

For multi-level grouping:

df['sales'] = df.groupby(['store', 'product'])['sales'].transform(
    lambda x: x.fillna(x.mean())
)

Method Chaining and Multiple Strategies

Combine multiple fill strategies in sequence:

df = pd.DataFrame({
    'A': [1, np.nan, np.nan, 4, 5],
    'B': [np.nan, 2, np.nan, np.nan, 5],
    'C': ['x', None, 'y', None, 'z']
})

# Chain multiple strategies
df_filled = (df
    .fillna(method='ffill', limit=1)  # Try forward fill first
    .fillna(df.mean(numeric_only=True))  # Then use mean
    .fillna('missing')  # Finally, constant for remaining
)

Interpolation for Time Series

For time-based data, interpolation provides smoother filling:

df = pd.DataFrame({
    'timestamp': pd.date_range('2024-01-01', periods=10, freq='h'),
    'value': [10, 12, np.nan, np.nan, 20, np.nan, 25, 28, np.nan, 32]
})

# Linear interpolation
df['linear'] = df['value'].interpolate(method='linear')

# Time-based interpolation
df_indexed = df.set_index('timestamp')
df_indexed['time_interp'] = df_indexed['value'].interpolate(method='time')

print(df_indexed[['value', 'linear', 'time_interp']])
#                      value  linear  time_interp
# 2024-01-01 00:00:00   10.0    10.0         10.0
# 2024-01-01 01:00:00   12.0    12.0         12.0
# 2024-01-01 02:00:00    NaN    14.0         14.0
# 2024-01-01 03:00:00    NaN    16.0         16.0
# 2024-01-01 04:00:00   20.0    20.0         20.0

Polynomial interpolation for non-linear trends:

df['value'].interpolate(method='polynomial', order=2)

Conditional Filling

Apply different fill strategies based on conditions:

df = pd.DataFrame({
    'value': [10, np.nan, 30, np.nan, 50],
    'confidence': ['high', 'high', 'low', 'low', 'high']
})

# Fill based on confidence level
mask_high = df['confidence'] == 'high'
mask_low = df['confidence'] == 'low'

df.loc[mask_high, 'value'] = df.loc[mask_high, 'value'].fillna(method='ffill')
df.loc[mask_low, 'value'] = df.loc[mask_low, 'value'].fillna(0)

Using np.where() for inline conditional filling:

df['filled'] = np.where(
    df['value'].isna() & (df['confidence'] == 'high'),
    df['value'].ffill(),
    df['value'].fillna(0)
)

Performance Considerations

For large datasets, choose efficient methods:

import time

df_large = pd.DataFrame({
    'col': np.random.choice([1, 2, np.nan], size=1_000_000)
})

# Faster: constant fill
start = time.time()
df_large['col'].fillna(0)
print(f"Constant fill: {time.time() - start:.4f}s")

# Slower: forward fill (must iterate)
start = time.time()
df_large['col'].fillna(method='ffill')
print(f"Forward fill: {time.time() - start:.4f}s")

Use vectorized operations over iterative approaches:

# Efficient: vectorized mean
df['value'].fillna(df['value'].mean())

# Inefficient: row-by-row
for idx, row in df.iterrows():
    if pd.isna(row['value']):
        df.at[idx, 'value'] = df['value'].mean()

Handling Edge Cases

Account for completely empty columns or groups:

df = pd.DataFrame({
    'A': [np.nan, np.nan, np.nan],
    'B': [1, 2, 3]
})

# Check before filling to avoid errors
if df['A'].notna().any():
    df['A'].fillna(df['A'].mean())
else:
    df['A'].fillna(0)  # Fallback value

For group operations with empty groups:

df['sales'] = df.groupby('store')['sales'].transform(
    lambda x: x.fillna(x.mean() if x.notna().any() else 0)
)

The choice of fill strategy depends on your data characteristics and business requirements. Time series data benefits from forward fill or interpolation, while cross-sectional data often uses statistical measures. Always validate that your chosen method preserves the underlying data distribution and doesn’t introduce bias into downstream analyses.