Pandas - Replace NaN Values in Column

Key Insights

• Pandas offers multiple methods for replacing NaN values including fillna(), replace(), and interpolate(), each suited for different data scenarios and replacement strategies • The choice between inplace modification and returning new DataFrames affects memory usage and code clarity—explicit assignment is generally preferred for maintainability • Understanding when to use forward fill, backward fill, mean/median replacement, or custom logic depends on your data’s temporal nature and statistical requirements

Understanding NaN Values in Pandas

NaN (Not a Number) represents missing or undefined data in Pandas. These values can originate from data collection errors, merging operations, or explicit insertion. Before replacing NaN values, identify their presence:

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'product': ['A', 'B', 'C', 'D', 'E'],
    'price': [100, np.nan, 150, np.nan, 200],
    'quantity': [5, 10, np.nan, 8, 12],
    'date': pd.date_range('2024-01-01', periods=5)
})

# Check for NaN values
print(df.isna().sum())
print(f"\nTotal NaN values: {df.isna().sum().sum()}")

Output:

product     0
price       2
quantity    1
date        0
dtype: int64

Total NaN values: 3

Basic Replacement with fillna()

The fillna() method is the primary tool for NaN replacement. It accepts scalar values, dictionaries, Series, or DataFrames as replacement values.

# Replace with scalar value
df_scalar = df.copy()
df_scalar['price'] = df_scalar['price'].fillna(0)
print(df_scalar)

# Replace different columns with different values
df_dict = df.copy()
df_dict = df_dict.fillna({'price': 0, 'quantity': 1})
print(df_dict)

# Replace all NaN values in DataFrame
df_all = df.fillna(-1)
print(df_all)

The method returns a new DataFrame by default. Use inplace=True sparingly:

# Not recommended - harder to debug
df.fillna(0, inplace=True)

# Preferred - explicit and clear
df = df.fillna(0)

Statistical Replacement Methods

Replace NaN values with statistical measures like mean, median, or mode for numerical columns:

df = pd.DataFrame({
    'temperature': [20.5, 21.0, np.nan, 22.5, np.nan, 23.0],
    'humidity': [45, np.nan, 50, np.nan, 55, 60],
    'pressure': [1013, 1012, np.nan, 1011, 1010, np.nan]
})

# Replace with mean
df_mean = df.copy()
df_mean['temperature'] = df_mean['temperature'].fillna(df_mean['temperature'].mean())

# Replace with median (more robust to outliers)
df_median = df.copy()
df_median['humidity'] = df_median['humidity'].fillna(df_median['humidity'].median())

# Apply to all numeric columns
df_all_mean = df.fillna(df.mean())
print(df_all_mean)

For categorical data, use mode:

df_cat = pd.DataFrame({
    'category': ['A', 'B', np.nan, 'A', 'B', np.nan, 'A'],
    'value': [1, 2, 3, 4, 5, 6, 7]
})

mode_value = df_cat['category'].mode()[0]
df_cat['category'] = df_cat['category'].fillna(mode_value)
print(df_cat)

Forward Fill and Backward Fill

For time-series or ordered data, propagate existing values forward or backward:

df_ts = pd.DataFrame({
    'date': pd.date_range('2024-01-01', periods=8),
    'stock_price': [100, 102, np.nan, np.nan, 105, np.nan, 108, 110]
})

# Forward fill - use last valid observation
df_ffill = df_ts.copy()
df_ffill['stock_price'] = df_ffill['stock_price'].fillna(method='ffill')

# Backward fill - use next valid observation
df_bfill = df_ts.copy()
df_bfill['stock_price'] = df_bfill['stock_price'].fillna(method='bfill')

# Limit propagation
df_limited = df_ts.copy()
df_limited['stock_price'] = df_limited['stock_price'].fillna(method='ffill', limit=1)

print("Original:\n", df_ts)
print("\nForward Fill:\n", df_ffill)
print("\nBackward Fill:\n", df_bfill)
print("\nLimited Fill:\n", df_limited)

Interpolation for Numeric Data

Interpolation estimates missing values based on surrounding data points:

df_interp = pd.DataFrame({
    'x': [0, 1, 2, 3, 4, 5],
    'y': [0, np.nan, np.nan, 9, 16, 25]
})

# Linear interpolation (default)
df_linear = df_interp.copy()
df_linear['y'] = df_linear['y'].interpolate()

# Polynomial interpolation
df_poly = df_interp.copy()
df_poly['y'] = df_poly['y'].interpolate(method='polynomial', order=2)

# Time-based interpolation
df_time = pd.DataFrame({
    'timestamp': pd.date_range('2024-01-01', periods=6, freq='H'),
    'sensor_reading': [10, np.nan, np.nan, 40, np.nan, 60]
})
df_time = df_time.set_index('timestamp')
df_time['sensor_reading'] = df_time['sensor_reading'].interpolate(method='time')

print("Linear:\n", df_linear)
print("\nTime-based:\n", df_time)

Conditional and Custom Replacement

Apply complex logic using apply(), where(), or mask():

df_cond = pd.DataFrame({
    'region': ['North', 'South', 'East', 'West', 'North'],
    'sales': [100, np.nan, 150, np.nan, 200],
    'target': [120, 130, 140, 150, 160]
})

# Replace based on another column
df_cond['sales'] = df_cond.apply(
    lambda row: row['target'] * 0.8 if pd.isna(row['sales']) else row['sales'],
    axis=1
)

# Replace using where (keep values where condition is True)
df_where = df_cond.copy()
df_where['sales'] = df_where['sales'].where(df_where['sales'].notna(), df_where['target'])

# Group-based replacement
df_group = pd.DataFrame({
    'category': ['A', 'A', 'B', 'B', 'A', 'B'],
    'value': [10, np.nan, 20, np.nan, 30, np.nan]
})

df_group['value'] = df_group.groupby('category')['value'].transform(
    lambda x: x.fillna(x.mean())
)
print(df_group)

Replacing with replace()

The replace() method handles both NaN and non-NaN value replacement:

df_replace = pd.DataFrame({
    'status': ['active', np.nan, 'inactive', np.nan, 'active'],
    'score': [85, np.nan, 72, np.nan, 90]
})

# Replace NaN with specific value
df_replace = df_replace.replace(np.nan, 'unknown')

# Replace multiple values including NaN
df_multi = df_replace.replace({
    'unknown': 'pending',
    'inactive': 'suspended'
})

print(df_multi)

Performance Considerations

For large DataFrames, choose efficient methods:

import time

# Create large DataFrame
large_df = pd.DataFrame({
    'col1': np.random.choice([1, 2, np.nan], size=1000000),
    'col2': np.random.choice([10, 20, np.nan], size=1000000)
})

# Method 1: fillna with dictionary (fast)
start = time.time()
result1 = large_df.fillna({'col1': 0, 'col2': 0})
print(f"fillna dict: {time.time() - start:.4f}s")

# Method 2: column-by-column (slower)
start = time.time()
result2 = large_df.copy()
result2['col1'] = result2['col1'].fillna(0)
result2['col2'] = result2['col2'].fillna(0)
print(f"Column-by-column: {time.time() - start:.4f}s")

# Method 3: apply (slowest)
start = time.time()
result3 = large_df.apply(lambda x: x.fillna(0))
print(f"apply: {time.time() - start:.4f}s")

Handling Edge Cases

Address special scenarios that commonly occur:

# Mixed type columns
df_mixed = pd.DataFrame({
    'mixed': [1, 'text', np.nan, 3.5, np.nan]
})
df_mixed['mixed'] = df_mixed['mixed'].fillna('missing')

# All NaN column
df_all_nan = pd.DataFrame({
    'empty': [np.nan, np.nan, np.nan]
})
# fillna won't fail, but mean() will return NaN
df_all_nan['empty'] = df_all_nan['empty'].fillna(0)

# Preserve data types
df_dtype = pd.DataFrame({
    'integers': pd.Series([1, 2, np.nan, 4], dtype='Int64')  # Nullable integer
})
df_dtype['integers'] = df_dtype['integers'].fillna(0)
print(df_dtype.dtypes)

Choose replacement strategies based on data characteristics: use statistical methods for random missing data, forward/backward fill for time series, and interpolation for continuous measurements. Always validate results to ensure replacements make domain sense.