How to Fill NaN with Mean in Pandas
Missing data is inevitable. Whether you're working with survey responses, sensor readings, or scraped web data, you'll encounter NaN values that need handling before analysis or modeling. Mean...
Key Insights
- Mean imputation is fast and preserves dataset size, but it reduces variance and can distort relationships between variables—use it for data missing completely at random (MCAR) with roughly normal distributions.
- For production ML pipelines, use scikit-learn’s
SimpleImputerinstead of raw Pandas operations to ensure consistent imputation between training and test sets. - Group-wise mean filling using
groupby().transform()often produces more accurate imputations than global means when your data has natural categorical segments.
Introduction
Missing data is inevitable. Whether you’re working with survey responses, sensor readings, or scraped web data, you’ll encounter NaN values that need handling before analysis or modeling. Mean imputation—replacing missing values with the column average—remains one of the most common strategies because it’s simple, fast, and preserves your dataset’s size.
But simplicity comes with trade-offs. Mean imputation works well when data is missing completely at random and follows a roughly normal distribution. It falls apart when you have skewed distributions, outliers pulling the mean away from typical values, or data that’s missing for systematic reasons. Before reaching for fillna(), ask yourself: why is this data missing, and will the mean actually represent a reasonable estimate?
This article covers the practical mechanics of mean imputation in Pandas, from single columns to group-wise filling, plus the scikit-learn approach you should use in production ML pipelines.
Understanding NaN Values in Pandas
Pandas uses NaN (Not a Number) from NumPy to represent missing data. For most operations, NaN values propagate—meaning any calculation involving NaN returns NaN unless you explicitly handle it. Pandas also recognizes None and pd.NA as missing values, converting them to NaN in numeric contexts.
Before imputing, you need to understand your missing data landscape. The isna() and isnull() methods (they’re identical—use whichever you prefer) help you detect and count missing values:
import pandas as pd
import numpy as np
# Create sample DataFrame with missing values
df = pd.DataFrame({
'product': ['A', 'B', 'C', 'D', 'E', 'F'],
'price': [29.99, np.nan, 45.50, np.nan, 32.00, 55.00],
'quantity': [100, 150, np.nan, 200, np.nan, 175],
'category': ['electronics', 'electronics', 'clothing', 'clothing', 'electronics', 'clothing']
})
print(df)
print("\nMissing values per column:")
print(df.isna().sum())
print(f"\nTotal missing values: {df.isna().sum().sum()}")
print(f"Percentage missing: {df.isna().sum().sum() / df.size * 100:.1f}%")
Output:
product price quantity category
0 A 29.99 100.0 electronics
1 B NaN 150.0 electronics
2 C 45.50 NaN clothing
3 D NaN 200.0 clothing
4 E 32.00 NaN electronics
5 F 55.00 175.0 clothing
Missing values per column:
product 0
price 2
quantity 2
category 0
dtype: int64
Total missing values: 4
Percentage missing: 16.7%
This quick diagnostic tells you which columns need attention and how severe the missing data problem is. If a column has more than 30-40% missing values, mean imputation becomes increasingly questionable—you’re essentially fabricating a large portion of your data.
Filling NaN with Mean for a Single Column
The most straightforward approach combines fillna() with mean(). This calculates the column’s mean (ignoring NaN values) and uses it to replace all missing entries:
# Fill missing prices with the mean price
mean_price = df['price'].mean()
print(f"Mean price: {mean_price:.2f}")
df['price_filled'] = df['price'].fillna(mean_price)
print(df[['product', 'price', 'price_filled']])
Output:
Mean price: 40.62
product price price_filled
0 A 29.99 29.99
1 B NaN 40.62
2 C 45.50 45.50
3 D NaN 40.62
4 E 32.00 32.00
5 F 55.00 55.00
You can also modify the column in place, though I recommend creating a new column during exploration so you can compare results:
# In-place modification (use with caution)
df['price'] = df['price'].fillna(df['price'].mean())
# Or using inplace parameter (less preferred, being deprecated in some contexts)
df['quantity'].fillna(df['quantity'].mean(), inplace=True)
One subtle issue: if you chain operations, the mean gets recalculated each time. For clarity and debugging, always calculate the mean once and store it in a variable before filling.
Filling NaN with Mean for Multiple or All Columns
When you have missing values across multiple numeric columns, you can apply mean imputation to all of them simultaneously:
# Reset our DataFrame
df = pd.DataFrame({
'product': ['A', 'B', 'C', 'D', 'E', 'F'],
'price': [29.99, np.nan, 45.50, np.nan, 32.00, 55.00],
'quantity': [100, 150, np.nan, 200, np.nan, 175],
'rating': [4.5, 3.8, np.nan, 4.2, 4.0, np.nan],
'category': ['electronics', 'electronics', 'clothing', 'clothing', 'electronics', 'clothing']
})
# Fill all numeric columns with their respective means
df_filled = df.fillna(df.mean(numeric_only=True))
print(df_filled)
The numeric_only=True parameter ensures Pandas only calculates means for numeric columns, avoiding errors with string or categorical data.
For more control, explicitly select which columns to impute:
# Specify columns to fill
cols_to_fill = ['price', 'quantity']
df[cols_to_fill] = df[cols_to_fill].fillna(df[cols_to_fill].mean())
# Or use select_dtypes to automatically get numeric columns
numeric_cols = df.select_dtypes(include=[np.number]).columns
df[numeric_cols] = df[numeric_cols].fillna(df[numeric_cols].mean())
This explicit approach is better for production code—it makes your intentions clear and prevents accidentally imputing columns that shouldn’t be touched.
Filling NaN with Group-wise Mean
Global mean imputation treats all rows identically, but your data often has meaningful segments. A missing price for an electronics item should probably be filled with the average electronics price, not the average across all categories.
The groupby().transform() pattern handles this elegantly:
df = pd.DataFrame({
'product': ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H'],
'price': [29.99, np.nan, 145.50, np.nan, 32.00, 155.00, np.nan, 160.00],
'category': ['electronics', 'electronics', 'clothing', 'clothing',
'electronics', 'clothing', 'electronics', 'clothing']
})
# Calculate group means
print("Mean price by category:")
print(df.groupby('category')['price'].mean())
# Fill with category-specific means
df['price_filled'] = df.groupby('category')['price'].transform(
lambda x: x.fillna(x.mean())
)
print("\nDataFrame with group-wise filling:")
print(df)
Output:
Mean price by category:
category
clothing 153.50
electronics 30.995
Name: price, dtype: float64
DataFrame with group-wise filling:
product price category price_filled
0 A 29.99 electronics 29.99
1 B NaN electronics 31.00
2 C 145.50 clothing 145.50
3 D NaN clothing 153.50
4 E 32.00 electronics 32.00
5 F 155.00 clothing 155.00
6 G NaN electronics 31.00
7 H 160.00 clothing 160.00
Notice how electronics products get filled with ~31 while clothing products get filled with ~153.50. This produces much more realistic imputations than a global mean of ~92.25 would.
The transform() method is crucial here—it returns a Series with the same index as the original, making it compatible with fillna(). Using apply() instead would return aggregated results that don’t align with your DataFrame.
Using SimpleImputer from Scikit-learn
For machine learning pipelines, raw Pandas operations have a critical flaw: they calculate the mean on whatever data you give them. If you fit a model on training data and then apply the same imputation logic to test data, you’ll use the test set’s mean—introducing data leakage.
Scikit-learn’s SimpleImputer solves this by separating the “fit” (learning the mean) from the “transform” (applying it):
from sklearn.impute import SimpleImputer
df = pd.DataFrame({
'feature_1': [1.0, 2.0, np.nan, 4.0, 5.0],
'feature_2': [10.0, np.nan, 30.0, np.nan, 50.0],
'feature_3': [100.0, 200.0, 300.0, np.nan, 500.0]
})
# Create and fit the imputer
imputer = SimpleImputer(strategy='mean')
imputer.fit(df)
# Check learned statistics
print("Learned means:", imputer.statistics_)
# Transform the data
df_imputed = pd.DataFrame(
imputer.transform(df),
columns=df.columns
)
print("\nImputed DataFrame:")
print(df_imputed)
The real power emerges in train/test scenarios:
# Simulate train/test split
train_df = pd.DataFrame({
'feature_1': [1.0, 2.0, np.nan, 4.0],
'feature_2': [10.0, np.nan, 30.0, 40.0]
})
test_df = pd.DataFrame({
'feature_1': [np.nan, 6.0],
'feature_2': [np.nan, 60.0]
})
# Fit on training data only
imputer = SimpleImputer(strategy='mean')
imputer.fit(train_df)
# Transform both sets using training means
train_imputed = imputer.transform(train_df)
test_imputed = imputer.transform(test_df)
print(f"Training mean for feature_1: {imputer.statistics_[0]:.2f}")
print(f"Test set uses same mean: {test_imputed[0, 0]:.2f}")
You can also integrate SimpleImputer into scikit-learn pipelines for cleaner, more maintainable code.
Best Practices and Considerations
When mean imputation works well:
- Data is missing completely at random (MCAR)
- The distribution is roughly symmetric
- Missing values are a small percentage of the data (under 10-15%)
- You need a quick baseline before trying sophisticated methods
When to avoid it:
- Skewed distributions (use median instead)
- Presence of significant outliers
- Data missing not at random (MNAR)—the missingness itself contains information
- High percentage of missing values
Alternatives to consider:
- Median: More robust to outliers and skewed data
- Mode: For categorical data or highly discrete numeric data
- Forward/backward fill: For time series data
- K-nearest neighbors imputation: Uses similar rows to estimate missing values
- Multiple imputation: Creates several imputed datasets to capture uncertainty
Document everything. Record which columns you imputed, what strategy you used, and what the imputed values were. This information is critical for reproducing results and debugging model behavior later.
Mean imputation is a tool, not a solution. Use it thoughtfully, understand its limitations, and always validate that your imputed data makes domain sense.