Pandas - Fill NaN Values (fillna) with Examples
Pandas represents missing data using `NaN` (Not a Number) from NumPy, `None`, or `pd.NA`. Before filling missing values, identify them using `isna()` or `isnull()`:
Key Insights
- The
fillna()method provides multiple strategies for handling missing data including constant values, forward/backward fill, and interpolation methods that preserve data integrity - Different fill strategies suit different data types: forward fill works well for time series, mean/median for numerical data, and mode for categorical data
- Combining
fillna()with groupby operations enables context-aware imputation where missing values are filled based on related group statistics rather than global values
Understanding Missing Data in Pandas
Pandas represents missing data using NaN (Not a Number) from NumPy, None, or pd.NA. Before filling missing values, identify them using isna() or isnull():
import pandas as pd
import numpy as np
df = pd.DataFrame({
'product': ['A', 'B', 'C', 'D', 'E'],
'price': [10.5, np.nan, 15.0, np.nan, 20.0],
'quantity': [100, 200, np.nan, 150, np.nan],
'category': ['electronics', None, 'electronics', 'furniture', 'furniture']
})
print(df.isna().sum())
# price 2
# quantity 2
# category 1
Fill with Constant Values
The simplest approach fills all NaN values with a constant:
# Fill all NaN with 0
df_filled = df.fillna(0)
# Fill with different values per column
df_filled = df.fillna({
'price': 0.0,
'quantity': 0,
'category': 'unknown'
})
# Fill only specific columns
df['price'] = df['price'].fillna(0)
For production systems, use inplace=True cautiously as it modifies the original DataFrame:
df.fillna(0, inplace=True) # Modifies df directly
Forward Fill and Backward Fill
Forward fill (ffill) propagates the last valid observation forward. Backward fill (bfill) uses the next valid observation:
df = pd.DataFrame({
'date': pd.date_range('2024-01-01', periods=6),
'temperature': [20.5, np.nan, np.nan, 22.0, np.nan, 23.5]
})
# Forward fill - propagate last valid value
df['temp_ffill'] = df['temperature'].ffill()
# Backward fill - use next valid value
df['temp_bfill'] = df['temperature'].bfill()
print(df)
# date temperature temp_ffill temp_bfill
# 0 2024-01-01 20.5 20.5 20.5
# 1 2024-01-02 NaN 20.5 22.0
# 2 2024-01-03 NaN 20.5 22.0
# 3 2024-01-04 22.0 22.0 22.0
# 4 2024-01-05 NaN 22.0 23.5
# 5 2024-01-06 23.5 23.5 23.5
Limit propagation to prevent filling across large gaps:
# Fill only 1 consecutive NaN
df['temperature'].ffill(limit=1)
Fill with Statistical Measures
Use mean, median, or mode for numerical imputation:
df = pd.DataFrame({
'sensor_1': [10, 20, np.nan, 40, 50],
'sensor_2': [15, np.nan, 35, np.nan, 55],
'status': ['active', 'active', None, 'inactive', 'active']
})
# Fill with mean
df['sensor_1'] = df['sensor_1'].fillna(df['sensor_1'].mean())
# Fill with median (robust to outliers)
df['sensor_2'] = df['sensor_2'].fillna(df['sensor_2'].median())
# Fill categorical with mode
mode_value = df['status'].mode()[0]
df['status'] = df['status'].fillna(mode_value)
print(df)
# sensor_1 sensor_2 status
# 0 10.0 15.0 active
# 1 20.0 35.0 active
# 2 30.0 35.0 active
# 3 40.0 35.0 inactive
# 4 50.0 55.0 active
Group-Based Filling
Fill missing values based on group statistics for context-aware imputation:
df = pd.DataFrame({
'store': ['A', 'A', 'A', 'B', 'B', 'B'],
'product': ['X', 'Y', 'X', 'X', 'Y', 'X'],
'sales': [100, np.nan, 150, 200, np.nan, 250]
})
# Fill with group mean
df['sales'] = df.groupby('store')['sales'].transform(
lambda x: x.fillna(x.mean())
)
print(df)
# store product sales
# 0 A X 100.0
# 1 A Y 125.0 # Filled with store A mean
# 2 A X 150.0
# 3 B X 200.0
# 4 B Y 225.0 # Filled with store B mean
# 5 B X 250.0
For multi-level grouping:
df['sales'] = df.groupby(['store', 'product'])['sales'].transform(
lambda x: x.fillna(x.mean())
)
Method Chaining and Multiple Strategies
Combine multiple fill strategies in sequence:
df = pd.DataFrame({
'A': [1, np.nan, np.nan, 4, 5],
'B': [np.nan, 2, np.nan, np.nan, 5],
'C': ['x', None, 'y', None, 'z']
})
# Chain multiple strategies
df_filled = (df
.fillna(method='ffill', limit=1) # Try forward fill first
.fillna(df.mean(numeric_only=True)) # Then use mean
.fillna('missing') # Finally, constant for remaining
)
Interpolation for Time Series
For time-based data, interpolation provides smoother filling:
df = pd.DataFrame({
'timestamp': pd.date_range('2024-01-01', periods=10, freq='h'),
'value': [10, 12, np.nan, np.nan, 20, np.nan, 25, 28, np.nan, 32]
})
# Linear interpolation
df['linear'] = df['value'].interpolate(method='linear')
# Time-based interpolation
df_indexed = df.set_index('timestamp')
df_indexed['time_interp'] = df_indexed['value'].interpolate(method='time')
print(df_indexed[['value', 'linear', 'time_interp']])
# value linear time_interp
# 2024-01-01 00:00:00 10.0 10.0 10.0
# 2024-01-01 01:00:00 12.0 12.0 12.0
# 2024-01-01 02:00:00 NaN 14.0 14.0
# 2024-01-01 03:00:00 NaN 16.0 16.0
# 2024-01-01 04:00:00 20.0 20.0 20.0
Polynomial interpolation for non-linear trends:
df['value'].interpolate(method='polynomial', order=2)
Conditional Filling
Apply different fill strategies based on conditions:
df = pd.DataFrame({
'value': [10, np.nan, 30, np.nan, 50],
'confidence': ['high', 'high', 'low', 'low', 'high']
})
# Fill based on confidence level
mask_high = df['confidence'] == 'high'
mask_low = df['confidence'] == 'low'
df.loc[mask_high, 'value'] = df.loc[mask_high, 'value'].fillna(method='ffill')
df.loc[mask_low, 'value'] = df.loc[mask_low, 'value'].fillna(0)
Using np.where() for inline conditional filling:
df['filled'] = np.where(
df['value'].isna() & (df['confidence'] == 'high'),
df['value'].ffill(),
df['value'].fillna(0)
)
Performance Considerations
For large datasets, choose efficient methods:
import time
df_large = pd.DataFrame({
'col': np.random.choice([1, 2, np.nan], size=1_000_000)
})
# Faster: constant fill
start = time.time()
df_large['col'].fillna(0)
print(f"Constant fill: {time.time() - start:.4f}s")
# Slower: forward fill (must iterate)
start = time.time()
df_large['col'].fillna(method='ffill')
print(f"Forward fill: {time.time() - start:.4f}s")
Use vectorized operations over iterative approaches:
# Efficient: vectorized mean
df['value'].fillna(df['value'].mean())
# Inefficient: row-by-row
for idx, row in df.iterrows():
if pd.isna(row['value']):
df.at[idx, 'value'] = df['value'].mean()
Handling Edge Cases
Account for completely empty columns or groups:
df = pd.DataFrame({
'A': [np.nan, np.nan, np.nan],
'B': [1, 2, 3]
})
# Check before filling to avoid errors
if df['A'].notna().any():
df['A'].fillna(df['A'].mean())
else:
df['A'].fillna(0) # Fallback value
For group operations with empty groups:
df['sales'] = df.groupby('store')['sales'].transform(
lambda x: x.fillna(x.mean() if x.notna().any() else 0)
)
The choice of fill strategy depends on your data characteristics and business requirements. Time series data benefits from forward fill or interpolation, while cross-sectional data often uses statistical measures. Always validate that your chosen method preserves the underlying data distribution and doesn’t introduce bias into downstream analyses.