How to Fill NaN Values in Pandas
Missing data is inevitable in real-world datasets. Whether it's a sensor that failed to record a reading, a user who skipped a form field, or data that simply doesn't exist for certain combinations,...
Key Insights
- The
fillna()method is your primary tool for handling missing data, supporting static values, dictionaries for column-specific fills, and statistical aggregations like mean or median. - Forward fill (
ffill) and backward fill (bfill) are essential for time-series data where missing values should inherit from neighboring observations. - Always understand why data is missing before choosing a fill strategy—the wrong approach can introduce bias and corrupt your analysis.
Introduction to Missing Data in Pandas
Missing data is inevitable in real-world datasets. Whether it’s a sensor that failed to record a reading, a user who skipped a form field, or data that simply doesn’t exist for certain combinations, you’ll encounter NaN (Not a Number) values constantly. Pandas represents missing data as NaN for floating-point columns and None or NaT (Not a Time) for object and datetime columns respectively.
Ignoring missing data leads to problems. Many statistical operations skip NaN values silently, potentially skewing your results. Machine learning models typically can’t handle missing values at all. And if you’re not careful, NaN values propagate through calculations, corrupting downstream results.
The good news: Pandas provides robust tools for detecting and filling missing values. This article covers the practical techniques you’ll use daily.
Detecting NaN Values
Before filling missing data, you need to know where it exists. Pandas offers several methods for this.
import pandas as pd
import numpy as np
# Sample dataset with missing values
df = pd.DataFrame({
'name': ['Alice', 'Bob', None, 'Diana', 'Eve'],
'age': [25, np.nan, 30, np.nan, 28],
'salary': [50000, 60000, np.nan, 75000, np.nan],
'department': ['Engineering', None, 'Sales', 'Engineering', 'Marketing']
})
# Check for missing values
print(df.isna())
This returns a boolean DataFrame showing True where values are missing. More useful is counting missing values per column:
# Count missing values per column
print(df.isna().sum())
Output:
name 1
age 2
salary 2
department 1
dtype: int64
The isnull() method is an alias for isna()—they’re identical. Use whichever you find more readable.
For a quick overview of your DataFrame including non-null counts, use info():
df.info()
This shows you the total entries and non-null count per column, making it easy to spot columns with missing data at a glance.
Filling with Static Values Using fillna()
The fillna() method is your workhorse for replacing missing values. The simplest use case fills all NaN values with a single constant:
# Fill all NaN with zero
df_filled = df.fillna(0)
print(df_filled)
This works, but it’s rarely what you want. Filling a name column with 0 makes no sense. Instead, use a dictionary to specify different fill values per column:
# Column-specific fill values
fill_values = {
'name': 'Unknown',
'age': 0,
'salary': 0,
'department': 'Unassigned'
}
df_filled = df.fillna(fill_values)
print(df_filled)
Output:
name age salary department
0 Alice 25.0 50000.0 Engineering
1 Bob 0.0 60000.0 Unassigned
2 Unknown 30.0 0.0 Sales
3 Diana 0.0 75000.0 Engineering
4 Eve 28.0 0.0 Marketing
You can also fill a single column directly:
df['department'] = df['department'].fillna('Unassigned')
This targeted approach gives you precise control over how each column handles missing data.
Filling with Statistical Values
For numerical columns, filling with statistical measures often makes more sense than arbitrary constants. Common choices are mean, median, and mode.
# Create a DataFrame with numerical data
df_numeric = pd.DataFrame({
'temperature': [72, 75, np.nan, 68, np.nan, 71, 74],
'humidity': [45, np.nan, 50, 48, 52, np.nan, 47]
})
# Fill with mean
df_mean = df_numeric.copy()
df_mean['temperature'] = df_mean['temperature'].fillna(df_mean['temperature'].mean())
df_mean['humidity'] = df_mean['humidity'].fillna(df_mean['humidity'].mean())
print(df_mean)
For skewed distributions, median is often better than mean because it’s less affected by outliers:
# Fill with median
df_median = df_numeric.copy()
df_median['temperature'] = df_median['temperature'].fillna(df_median['temperature'].median())
print(df_median)
For categorical data, mode (the most frequent value) is appropriate:
# Fill categorical with mode
df['department'] = df['department'].fillna(df['department'].mode()[0])
Note the [0] index—mode() returns a Series because there can be multiple modes. We take the first one.
A cleaner approach for filling multiple columns with their respective means:
# Fill all numeric columns with their means
numeric_cols = df_numeric.select_dtypes(include=[np.number]).columns
df_numeric[numeric_cols] = df_numeric[numeric_cols].fillna(df_numeric[numeric_cols].mean())
Forward Fill and Backward Fill
Time-series data often has a natural ordering where missing values should inherit from neighboring observations. Forward fill (ffill) propagates the last valid observation forward, while backward fill (bfill) uses the next valid observation.
# Time-series example
dates = pd.date_range('2024-01-01', periods=7, freq='D')
df_ts = pd.DataFrame({
'date': dates,
'stock_price': [100, np.nan, np.nan, 103, 105, np.nan, 108]
})
# Forward fill - carry last known value forward
df_ffill = df_ts.copy()
df_ffill['stock_price'] = df_ffill['stock_price'].ffill()
print("Forward Fill:")
print(df_ffill)
Output:
Forward Fill:
date stock_price
0 2024-01-01 100.0
1 2024-01-02 100.0
2 2024-01-03 100.0
3 2024-01-04 103.0
4 2024-01-05 105.0
5 2024-01-06 105.0
6 2024-01-07 108.0
# Backward fill - use next known value
df_bfill = df_ts.copy()
df_bfill['stock_price'] = df_bfill['stock_price'].bfill()
print("\nBackward Fill:")
print(df_bfill)
Output:
Backward Fill:
date stock_price
0 2024-01-01 100.0
1 2024-01-02 103.0
2 2024-01-03 103.0
3 2024-01-04 103.0
4 2024-01-05 105.0
5 2024-01-06 108.0
6 2024-01-07 108.0
You can limit how far the fill propagates:
# Limit forward fill to 1 consecutive NaN
df_limited = df_ts.copy()
df_limited['stock_price'] = df_limited['stock_price'].ffill(limit=1)
print(df_limited)
This fills only the first NaN in each gap, leaving subsequent ones as NaN. Useful when you don’t want to assume values over long gaps.
Interpolation Methods
Interpolation estimates missing values based on surrounding data points. It’s particularly powerful for time-series and continuous numerical data.
# Linear interpolation
df_interp = df_ts.copy()
df_interp['stock_price'] = df_interp['stock_price'].interpolate(method='linear')
print("Linear Interpolation:")
print(df_interp)
Output:
Linear Interpolation:
date stock_price
0 2024-01-01 100.000000
1 2024-01-02 101.000000
2 2024-01-03 102.000000
3 2024-01-04 103.000000
4 2024-01-05 105.000000
5 2024-01-06 106.500000
6 2024-01-07 108.000000
Linear interpolation draws a straight line between known points. For data with datetime indices, time-based interpolation accounts for irregular intervals:
# Set date as index for time-aware interpolation
df_time = df_ts.set_index('date')
df_time['stock_price'] = df_time['stock_price'].interpolate(method='time')
Other interpolation methods include:
polynomial: Fits a polynomial curvespline: Uses spline interpolation for smoother curvesnearest: Uses the nearest valid value
# Polynomial interpolation (order 2)
df_poly = df_ts.copy()
df_poly['stock_price'] = df_poly['stock_price'].interpolate(method='polynomial', order=2)
Choose the method based on your data’s expected behavior. Linear works for most cases; polynomial or spline can capture curvature in trends.
Best Practices and When to Use Each Method
Choosing the right fill strategy depends on your data and analysis goals. Here’s a practical decision framework:
Use static values when:
- You have a meaningful default (e.g., 0 for counts, “Unknown” for categories)
- Missing values represent a specific condition you want to encode
Use statistical values (mean/median/mode) when:
- Data is missing at random
- You want to preserve the overall distribution
- Use median for skewed data, mean for normal distributions
Use forward/backward fill when:
- Data has temporal ordering
- The last known value is a reasonable estimate
- Working with status data that persists until changed
Use interpolation when:
- Data represents continuous measurements
- You expect smooth transitions between points
- Time-series with regular patterns
The inplace parameter: Many tutorials show df.fillna(value, inplace=True). I recommend avoiding this. It modifies your data directly, making debugging harder and breaking method chaining. Instead, assign the result:
# Prefer this
df = df.fillna(0)
# Over this
df.fillna(0, inplace=True)
Method chaining: You can chain multiple operations cleanly:
df_clean = (df
.fillna({'name': 'Unknown', 'department': 'Unassigned'})
.assign(age=lambda x: x['age'].fillna(x['age'].median()))
.assign(salary=lambda x: x['salary'].fillna(x['salary'].mean()))
)
Don’t fill blindly: Before filling, ask why data is missing. If users skip an “income” field, filling with the mean might introduce bias—those who skip may have systematically different incomes. Sometimes the right answer is to drop rows with missing data, or to encode missingness as a feature itself.
Missing data handling isn’t just a technical step—it’s a modeling decision that affects your results. Choose your fill strategy deliberately, document your choices, and validate that your approach makes sense for your specific analysis.