Pandas - Replace NaN Values in Column
• Pandas offers multiple methods for replacing NaN values including `fillna()`, `replace()`, and `interpolate()`, each suited for different data scenarios and replacement strategies
Key Insights
• Pandas offers multiple methods for replacing NaN values including fillna(), replace(), and interpolate(), each suited for different data scenarios and replacement strategies
• The choice between inplace modification and returning new DataFrames affects memory usage and code clarity—explicit assignment is generally preferred for maintainability
• Understanding when to use forward fill, backward fill, mean/median replacement, or custom logic depends on your data’s temporal nature and statistical requirements
Understanding NaN Values in Pandas
NaN (Not a Number) represents missing or undefined data in Pandas. These values can originate from data collection errors, merging operations, or explicit insertion. Before replacing NaN values, identify their presence:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'product': ['A', 'B', 'C', 'D', 'E'],
'price': [100, np.nan, 150, np.nan, 200],
'quantity': [5, 10, np.nan, 8, 12],
'date': pd.date_range('2024-01-01', periods=5)
})
# Check for NaN values
print(df.isna().sum())
print(f"\nTotal NaN values: {df.isna().sum().sum()}")
Output:
product 0
price 2
quantity 1
date 0
dtype: int64
Total NaN values: 3
Basic Replacement with fillna()
The fillna() method is the primary tool for NaN replacement. It accepts scalar values, dictionaries, Series, or DataFrames as replacement values.
# Replace with scalar value
df_scalar = df.copy()
df_scalar['price'] = df_scalar['price'].fillna(0)
print(df_scalar)
# Replace different columns with different values
df_dict = df.copy()
df_dict = df_dict.fillna({'price': 0, 'quantity': 1})
print(df_dict)
# Replace all NaN values in DataFrame
df_all = df.fillna(-1)
print(df_all)
The method returns a new DataFrame by default. Use inplace=True sparingly:
# Not recommended - harder to debug
df.fillna(0, inplace=True)
# Preferred - explicit and clear
df = df.fillna(0)
Statistical Replacement Methods
Replace NaN values with statistical measures like mean, median, or mode for numerical columns:
df = pd.DataFrame({
'temperature': [20.5, 21.0, np.nan, 22.5, np.nan, 23.0],
'humidity': [45, np.nan, 50, np.nan, 55, 60],
'pressure': [1013, 1012, np.nan, 1011, 1010, np.nan]
})
# Replace with mean
df_mean = df.copy()
df_mean['temperature'] = df_mean['temperature'].fillna(df_mean['temperature'].mean())
# Replace with median (more robust to outliers)
df_median = df.copy()
df_median['humidity'] = df_median['humidity'].fillna(df_median['humidity'].median())
# Apply to all numeric columns
df_all_mean = df.fillna(df.mean())
print(df_all_mean)
For categorical data, use mode:
df_cat = pd.DataFrame({
'category': ['A', 'B', np.nan, 'A', 'B', np.nan, 'A'],
'value': [1, 2, 3, 4, 5, 6, 7]
})
mode_value = df_cat['category'].mode()[0]
df_cat['category'] = df_cat['category'].fillna(mode_value)
print(df_cat)
Forward Fill and Backward Fill
For time-series or ordered data, propagate existing values forward or backward:
df_ts = pd.DataFrame({
'date': pd.date_range('2024-01-01', periods=8),
'stock_price': [100, 102, np.nan, np.nan, 105, np.nan, 108, 110]
})
# Forward fill - use last valid observation
df_ffill = df_ts.copy()
df_ffill['stock_price'] = df_ffill['stock_price'].fillna(method='ffill')
# Backward fill - use next valid observation
df_bfill = df_ts.copy()
df_bfill['stock_price'] = df_bfill['stock_price'].fillna(method='bfill')
# Limit propagation
df_limited = df_ts.copy()
df_limited['stock_price'] = df_limited['stock_price'].fillna(method='ffill', limit=1)
print("Original:\n", df_ts)
print("\nForward Fill:\n", df_ffill)
print("\nBackward Fill:\n", df_bfill)
print("\nLimited Fill:\n", df_limited)
Interpolation for Numeric Data
Interpolation estimates missing values based on surrounding data points:
df_interp = pd.DataFrame({
'x': [0, 1, 2, 3, 4, 5],
'y': [0, np.nan, np.nan, 9, 16, 25]
})
# Linear interpolation (default)
df_linear = df_interp.copy()
df_linear['y'] = df_linear['y'].interpolate()
# Polynomial interpolation
df_poly = df_interp.copy()
df_poly['y'] = df_poly['y'].interpolate(method='polynomial', order=2)
# Time-based interpolation
df_time = pd.DataFrame({
'timestamp': pd.date_range('2024-01-01', periods=6, freq='H'),
'sensor_reading': [10, np.nan, np.nan, 40, np.nan, 60]
})
df_time = df_time.set_index('timestamp')
df_time['sensor_reading'] = df_time['sensor_reading'].interpolate(method='time')
print("Linear:\n", df_linear)
print("\nTime-based:\n", df_time)
Conditional and Custom Replacement
Apply complex logic using apply(), where(), or mask():
df_cond = pd.DataFrame({
'region': ['North', 'South', 'East', 'West', 'North'],
'sales': [100, np.nan, 150, np.nan, 200],
'target': [120, 130, 140, 150, 160]
})
# Replace based on another column
df_cond['sales'] = df_cond.apply(
lambda row: row['target'] * 0.8 if pd.isna(row['sales']) else row['sales'],
axis=1
)
# Replace using where (keep values where condition is True)
df_where = df_cond.copy()
df_where['sales'] = df_where['sales'].where(df_where['sales'].notna(), df_where['target'])
# Group-based replacement
df_group = pd.DataFrame({
'category': ['A', 'A', 'B', 'B', 'A', 'B'],
'value': [10, np.nan, 20, np.nan, 30, np.nan]
})
df_group['value'] = df_group.groupby('category')['value'].transform(
lambda x: x.fillna(x.mean())
)
print(df_group)
Replacing with replace()
The replace() method handles both NaN and non-NaN value replacement:
df_replace = pd.DataFrame({
'status': ['active', np.nan, 'inactive', np.nan, 'active'],
'score': [85, np.nan, 72, np.nan, 90]
})
# Replace NaN with specific value
df_replace = df_replace.replace(np.nan, 'unknown')
# Replace multiple values including NaN
df_multi = df_replace.replace({
'unknown': 'pending',
'inactive': 'suspended'
})
print(df_multi)
Performance Considerations
For large DataFrames, choose efficient methods:
import time
# Create large DataFrame
large_df = pd.DataFrame({
'col1': np.random.choice([1, 2, np.nan], size=1000000),
'col2': np.random.choice([10, 20, np.nan], size=1000000)
})
# Method 1: fillna with dictionary (fast)
start = time.time()
result1 = large_df.fillna({'col1': 0, 'col2': 0})
print(f"fillna dict: {time.time() - start:.4f}s")
# Method 2: column-by-column (slower)
start = time.time()
result2 = large_df.copy()
result2['col1'] = result2['col1'].fillna(0)
result2['col2'] = result2['col2'].fillna(0)
print(f"Column-by-column: {time.time() - start:.4f}s")
# Method 3: apply (slowest)
start = time.time()
result3 = large_df.apply(lambda x: x.fillna(0))
print(f"apply: {time.time() - start:.4f}s")
Handling Edge Cases
Address special scenarios that commonly occur:
# Mixed type columns
df_mixed = pd.DataFrame({
'mixed': [1, 'text', np.nan, 3.5, np.nan]
})
df_mixed['mixed'] = df_mixed['mixed'].fillna('missing')
# All NaN column
df_all_nan = pd.DataFrame({
'empty': [np.nan, np.nan, np.nan]
})
# fillna won't fail, but mean() will return NaN
df_all_nan['empty'] = df_all_nan['empty'].fillna(0)
# Preserve data types
df_dtype = pd.DataFrame({
'integers': pd.Series([1, 2, np.nan, 4], dtype='Int64') # Nullable integer
})
df_dtype['integers'] = df_dtype['integers'].fillna(0)
print(df_dtype.dtypes)
Choose replacement strategies based on data characteristics: use statistical methods for random missing data, forward/backward fill for time series, and interpolation for continuous measurements. Always validate results to ensure replacements make domain sense.