How to Fill NaN with Mean in Pandas

Key Insights

Mean imputation is fast and preserves dataset size, but it reduces variance and can distort relationships between variables—use it for data missing completely at random (MCAR) with roughly normal distributions.
For production ML pipelines, use scikit-learn’s SimpleImputer instead of raw Pandas operations to ensure consistent imputation between training and test sets.
Group-wise mean filling using groupby().transform() often produces more accurate imputations than global means when your data has natural categorical segments.

Introduction

Missing data is inevitable. Whether you’re working with survey responses, sensor readings, or scraped web data, you’ll encounter NaN values that need handling before analysis or modeling. Mean imputation—replacing missing values with the column average—remains one of the most common strategies because it’s simple, fast, and preserves your dataset’s size.

But simplicity comes with trade-offs. Mean imputation works well when data is missing completely at random and follows a roughly normal distribution. It falls apart when you have skewed distributions, outliers pulling the mean away from typical values, or data that’s missing for systematic reasons. Before reaching for fillna(), ask yourself: why is this data missing, and will the mean actually represent a reasonable estimate?

This article covers the practical mechanics of mean imputation in Pandas, from single columns to group-wise filling, plus the scikit-learn approach you should use in production ML pipelines.

Understanding NaN Values in Pandas

Pandas uses NaN (Not a Number) from NumPy to represent missing data. For most operations, NaN values propagate—meaning any calculation involving NaN returns NaN unless you explicitly handle it. Pandas also recognizes None and pd.NA as missing values, converting them to NaN in numeric contexts.

Before imputing, you need to understand your missing data landscape. The isna() and isnull() methods (they’re identical—use whichever you prefer) help you detect and count missing values:

import pandas as pd
import numpy as np

# Create sample DataFrame with missing values
df = pd.DataFrame({
    'product': ['A', 'B', 'C', 'D', 'E', 'F'],
    'price': [29.99, np.nan, 45.50, np.nan, 32.00, 55.00],
    'quantity': [100, 150, np.nan, 200, np.nan, 175],
    'category': ['electronics', 'electronics', 'clothing', 'clothing', 'electronics', 'clothing']
})

print(df)
print("\nMissing values per column:")
print(df.isna().sum())
print(f"\nTotal missing values: {df.isna().sum().sum()}")
print(f"Percentage missing: {df.isna().sum().sum() / df.size * 100:.1f}%")

Output:

  product  price  quantity     category
0       A  29.99     100.0  electronics
1       B    NaN     150.0  electronics
2       C  45.50       NaN     clothing
3       D    NaN     200.0     clothing
4       E  32.00       NaN  electronics
5       F  55.00     175.0     clothing

Missing values per column:
product     0
price       2
quantity    2
category    0
dtype: int64

Total missing values: 4
Percentage missing: 16.7%

This quick diagnostic tells you which columns need attention and how severe the missing data problem is. If a column has more than 30-40% missing values, mean imputation becomes increasingly questionable—you’re essentially fabricating a large portion of your data.

Filling NaN with Mean for a Single Column

The most straightforward approach combines fillna() with mean(). This calculates the column’s mean (ignoring NaN values) and uses it to replace all missing entries:

# Fill missing prices with the mean price
mean_price = df['price'].mean()
print(f"Mean price: {mean_price:.2f}")

df['price_filled'] = df['price'].fillna(mean_price)
print(df[['product', 'price', 'price_filled']])

Output:

Mean price: 40.62

  product  price  price_filled
0       A  29.99         29.99
1       B    NaN         40.62
2       C  45.50         45.50
3       D    NaN         40.62
4       E  32.00         32.00
5       F  55.00         55.00

You can also modify the column in place, though I recommend creating a new column during exploration so you can compare results:

# In-place modification (use with caution)
df['price'] = df['price'].fillna(df['price'].mean())

# Or using inplace parameter (less preferred, being deprecated in some contexts)
df['quantity'].fillna(df['quantity'].mean(), inplace=True)

One subtle issue: if you chain operations, the mean gets recalculated each time. For clarity and debugging, always calculate the mean once and store it in a variable before filling.

Filling NaN with Mean for Multiple or All Columns

When you have missing values across multiple numeric columns, you can apply mean imputation to all of them simultaneously:

# Reset our DataFrame
df = pd.DataFrame({
    'product': ['A', 'B', 'C', 'D', 'E', 'F'],
    'price': [29.99, np.nan, 45.50, np.nan, 32.00, 55.00],
    'quantity': [100, 150, np.nan, 200, np.nan, 175],
    'rating': [4.5, 3.8, np.nan, 4.2, 4.0, np.nan],
    'category': ['electronics', 'electronics', 'clothing', 'clothing', 'electronics', 'clothing']
})

# Fill all numeric columns with their respective means
df_filled = df.fillna(df.mean(numeric_only=True))
print(df_filled)

The numeric_only=True parameter ensures Pandas only calculates means for numeric columns, avoiding errors with string or categorical data.

For more control, explicitly select which columns to impute:

# Specify columns to fill
cols_to_fill = ['price', 'quantity']
df[cols_to_fill] = df[cols_to_fill].fillna(df[cols_to_fill].mean())

# Or use select_dtypes to automatically get numeric columns
numeric_cols = df.select_dtypes(include=[np.number]).columns
df[numeric_cols] = df[numeric_cols].fillna(df[numeric_cols].mean())

This explicit approach is better for production code—it makes your intentions clear and prevents accidentally imputing columns that shouldn’t be touched.

Filling NaN with Group-wise Mean

Global mean imputation treats all rows identically, but your data often has meaningful segments. A missing price for an electronics item should probably be filled with the average electronics price, not the average across all categories.

The groupby().transform() pattern handles this elegantly:

df = pd.DataFrame({
    'product': ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H'],
    'price': [29.99, np.nan, 145.50, np.nan, 32.00, 155.00, np.nan, 160.00],
    'category': ['electronics', 'electronics', 'clothing', 'clothing', 
                 'electronics', 'clothing', 'electronics', 'clothing']
})

# Calculate group means
print("Mean price by category:")
print(df.groupby('category')['price'].mean())

# Fill with category-specific means
df['price_filled'] = df.groupby('category')['price'].transform(
    lambda x: x.fillna(x.mean())
)
print("\nDataFrame with group-wise filling:")
print(df)

Output:

Mean price by category:
category
clothing       153.50
electronics     30.995
Name: price, dtype: float64

DataFrame with group-wise filling:
  product   price     category  price_filled
0       A   29.99  electronics        29.99
1       B     NaN  electronics        31.00
2       C  145.50     clothing       145.50
3       D     NaN     clothing       153.50
4       E   32.00  electronics        32.00
5       F  155.00     clothing       155.00
6       G     NaN  electronics        31.00
7       H  160.00     clothing       160.00

Notice how electronics products get filled with ~31 while clothing products get filled with ~153.50. This produces much more realistic imputations than a global mean of ~92.25 would.

The transform() method is crucial here—it returns a Series with the same index as the original, making it compatible with fillna(). Using apply() instead would return aggregated results that don’t align with your DataFrame.

Using SimpleImputer from Scikit-learn

For machine learning pipelines, raw Pandas operations have a critical flaw: they calculate the mean on whatever data you give them. If you fit a model on training data and then apply the same imputation logic to test data, you’ll use the test set’s mean—introducing data leakage.

Scikit-learn’s SimpleImputer solves this by separating the “fit” (learning the mean) from the “transform” (applying it):

from sklearn.impute import SimpleImputer

df = pd.DataFrame({
    'feature_1': [1.0, 2.0, np.nan, 4.0, 5.0],
    'feature_2': [10.0, np.nan, 30.0, np.nan, 50.0],
    'feature_3': [100.0, 200.0, 300.0, np.nan, 500.0]
})

# Create and fit the imputer
imputer = SimpleImputer(strategy='mean')
imputer.fit(df)

# Check learned statistics
print("Learned means:", imputer.statistics_)

# Transform the data
df_imputed = pd.DataFrame(
    imputer.transform(df),
    columns=df.columns
)
print("\nImputed DataFrame:")
print(df_imputed)

The real power emerges in train/test scenarios:

# Simulate train/test split
train_df = pd.DataFrame({
    'feature_1': [1.0, 2.0, np.nan, 4.0],
    'feature_2': [10.0, np.nan, 30.0, 40.0]
})

test_df = pd.DataFrame({
    'feature_1': [np.nan, 6.0],
    'feature_2': [np.nan, 60.0]
})

# Fit on training data only
imputer = SimpleImputer(strategy='mean')
imputer.fit(train_df)

# Transform both sets using training means
train_imputed = imputer.transform(train_df)
test_imputed = imputer.transform(test_df)

print(f"Training mean for feature_1: {imputer.statistics_[0]:.2f}")
print(f"Test set uses same mean: {test_imputed[0, 0]:.2f}")

You can also integrate SimpleImputer into scikit-learn pipelines for cleaner, more maintainable code.

Best Practices and Considerations

When mean imputation works well:

Data is missing completely at random (MCAR)
The distribution is roughly symmetric
Missing values are a small percentage of the data (under 10-15%)
You need a quick baseline before trying sophisticated methods

When to avoid it:

Skewed distributions (use median instead)
Presence of significant outliers
Data missing not at random (MNAR)—the missingness itself contains information
High percentage of missing values

Alternatives to consider:

Median: More robust to outliers and skewed data
Mode: For categorical data or highly discrete numeric data
Forward/backward fill: For time series data
K-nearest neighbors imputation: Uses similar rows to estimate missing values
Multiple imputation: Creates several imputed datasets to capture uncertainty

Document everything. Record which columns you imputed, what strategy you used, and what the imputed values were. This information is critical for reproducing results and debugging model behavior later.

Mean imputation is a tool, not a solution. Use it thoughtfully, understand its limitations, and always validate that your imputed data makes domain sense.

Introduction

Understanding NaN Values in Pandas

Filling NaN with Mean for a Single Column

Filling NaN with Mean for Multiple or All Columns

Filling NaN with Group-wise Mean

Using SimpleImputer from Scikit-learn

Best Practices and Considerations

Liked this? There's more.

Similar Articles