How to Fill NaN with Median in Pandas

Missing data is inevitable. Whether you're working with sensor readings, survey responses, or scraped web data, you'll encounter NaN values that need handling before analysis or modeling. The...

Key Insights

  • Use fillna(df['column'].median()) for single columns, but apply df.fillna(df.select_dtypes(include='number').median()) when imputing across all numeric columns to avoid errors with non-numeric data.
  • Group-wise median imputation with groupby().transform('median') produces more accurate fills by respecting categorical relationships in your data—use it when your missing values correlate with group membership.
  • Always fit your imputer on training data only and transform both train and test sets to prevent data leakage; SimpleImputer from scikit-learn makes this pattern explicit and reproducible.

Introduction

Missing data is inevitable. Whether you’re working with sensor readings, survey responses, or scraped web data, you’ll encounter NaN values that need handling before analysis or modeling. The question isn’t whether you’ll face this problem—it’s how you’ll solve it.

Median imputation stands out as a robust default strategy. Unlike mean imputation, which gets pulled toward extreme values, the median remains stable when outliers are present. If your salary column has a few executives making millions alongside typical employees, the median gives you a representative fill value while the mean would skew high.

This article covers the practical techniques for filling NaN values with medians in Pandas, from simple single-column fills to group-wise imputation and scikit-learn integration for production pipelines.

Understanding NaN Values in Pandas

Pandas uses NaN (Not a Number) from NumPy to represent missing data. For integer columns, Pandas 1.0+ introduced nullable integer types (Int64, Int32) that can hold pd.NA, but you’ll still encounter NaN in most real-world datasets.

Detecting missing values is your first step. Here are the essential methods:

import pandas as pd
import numpy as np

# Create sample DataFrame with missing values
df = pd.DataFrame({
    'employee_id': [1, 2, 3, 4, 5, 6, 7, 8],
    'department': ['Engineering', 'Sales', 'Engineering', 'Sales', 
                   'Engineering', 'Sales', 'Engineering', 'Sales'],
    'salary': [75000, 65000, np.nan, 70000, 82000, np.nan, 78000, 68000],
    'years_experience': [5, 3, 7, np.nan, 10, 4, np.nan, 2],
    'performance_score': [4.2, np.nan, 4.5, 3.8, np.nan, 4.0, 4.3, 3.9]
})

# Check for NaN values
print("NaN counts per column:")
print(df.isna().sum())

print("\nDataFrame info:")
print(df.info())

print("\nQuick NaN overview:")
print(df.isnull().any())

Output:

NaN counts per column:
employee_id          0
department           0
salary               2
years_experience     2
performance_score    2

The isna() method (aliased as isnull()) returns a boolean DataFrame. Chaining .sum() gives you counts per column. The info() method shows non-null counts, which helps you quickly spot columns with missing data.

Basic Median Imputation with fillna()

The simplest approach fills a single column’s NaN values with that column’s median:

# Original data
print("Before imputation:")
print(df['salary'].values)
# [75000. 65000.    nan 70000. 82000.    nan 78000. 68000.]

# Calculate median (ignores NaN automatically)
salary_median = df['salary'].median()
print(f"\nSalary median: {salary_median}")
# Salary median: 72500.0

# Fill NaN values with median
df['salary'] = df['salary'].fillna(salary_median)

print("\nAfter imputation:")
print(df['salary'].values)
# [75000. 65000. 72500. 70000. 82000. 72500. 78000. 68000.]

A few things to note here. First, median() automatically ignores NaN values when calculating—you don’t need to drop them first. Second, fillna() returns a new Series by default; you must assign it back or use inplace=True (though I recommend explicit assignment for clarity).

You can also chain this into a single line:

df['salary'] = df['salary'].fillna(df['salary'].median())

This works, but be careful: if you’re filling multiple columns, calling median() inside fillna() repeatedly can become inefficient. Calculate medians once, then apply.

Filling Multiple Columns at Once

When you need to impute several numeric columns, avoid repetitive single-column fills. Pandas lets you pass a Series of fill values to fillna():

# Reset our DataFrame for demonstration
df = pd.DataFrame({
    'employee_id': [1, 2, 3, 4, 5, 6, 7, 8],
    'department': ['Engineering', 'Sales', 'Engineering', 'Sales', 
                   'Engineering', 'Sales', 'Engineering', 'Sales'],
    'salary': [75000, 65000, np.nan, 70000, 82000, np.nan, 78000, 68000],
    'years_experience': [5, 3, 7, np.nan, 10, 4, np.nan, 2],
    'performance_score': [4.2, np.nan, 4.5, 3.8, np.nan, 4.0, 4.3, 3.9]
})

# Select only numeric columns
numeric_cols = df.select_dtypes(include='number')

# Calculate medians for all numeric columns at once
medians = numeric_cols.median()
print("Medians per column:")
print(medians)

# Fill all numeric columns with their respective medians
df[numeric_cols.columns] = numeric_cols.fillna(medians)

print("\nDataFrame after imputation:")
print(df)

The select_dtypes(include='number') call is crucial. If you call df.median() on a DataFrame with non-numeric columns, Pandas will either skip them or raise warnings depending on your version. Being explicit about column selection prevents surprises.

An alternative approach uses apply():

# Fill each numeric column with its own median
df[numeric_cols.columns] = df[numeric_cols.columns].apply(
    lambda col: col.fillna(col.median())
)

This is slightly less efficient but more readable for some developers. Choose based on your team’s preferences.

Group-wise Median Imputation

Here’s where imputation gets interesting. A global median treats all rows equally, but your data often has structure. An engineer’s missing salary should probably be filled with the engineering median, not the company-wide median.

The transform() method combined with groupby() handles this elegantly:

# Reset DataFrame
df = pd.DataFrame({
    'employee_id': [1, 2, 3, 4, 5, 6, 7, 8],
    'department': ['Engineering', 'Sales', 'Engineering', 'Sales', 
                   'Engineering', 'Sales', 'Engineering', 'Sales'],
    'salary': [75000, 65000, np.nan, 70000, 82000, np.nan, 78000, 68000],
    'years_experience': [5, 3, 7, np.nan, 10, 4, np.nan, 2],
    'performance_score': [4.2, np.nan, 4.5, 3.8, np.nan, 4.0, 4.3, 3.9]
})

# Check group medians
print("Median salary by department:")
print(df.groupby('department')['salary'].median())
# Engineering: 76500.0
# Sales: 67500.0

# Fill NaN with group-specific median
df['salary'] = df.groupby('department')['salary'].transform(
    lambda x: x.fillna(x.median())
)

print("\nSalary after group-wise imputation:")
print(df[['department', 'salary']])

The transform() method returns a Series with the same index as the original, making it directly assignable back to the column. The lambda function receives each group’s data and fills NaN values with that group’s median.

For multiple columns with group-wise imputation:

def fill_group_median(group):
    return group.fillna(group.median())

# Apply to multiple numeric columns grouped by department
numeric_cols = ['salary', 'years_experience', 'performance_score']
df[numeric_cols] = df.groupby('department')[numeric_cols].transform(fill_group_median)

Watch out for groups where all values are NaN—the median will be NaN, and your fill won’t work. Handle this edge case by falling back to the global median:

def fill_with_fallback(group, global_median):
    group_median = group.median()
    if pd.isna(group_median):
        return group.fillna(global_median)
    return group.fillna(group_median)

global_med = df['salary'].median()
df['salary'] = df.groupby('department')['salary'].transform(
    lambda x: fill_with_fallback(x, global_med)
)

Using SimpleImputer from Scikit-learn

When building machine learning pipelines, raw Pandas operations become problematic. You need to fit your imputation strategy on training data and apply the same transformation to test data. Scikit-learn’s SimpleImputer makes this explicit:

from sklearn.impute import SimpleImputer
import pandas as pd
import numpy as np

# Sample data split into train/test
train_df = pd.DataFrame({
    'feature_1': [10, 20, np.nan, 40, 50],
    'feature_2': [1.5, np.nan, 3.5, 4.5, 5.5]
})

test_df = pd.DataFrame({
    'feature_1': [np.nan, 25, 35],
    'feature_2': [2.0, np.nan, 4.0]
})

# Create and fit imputer on training data only
imputer = SimpleImputer(strategy='median')
imputer.fit(train_df)

# Check learned medians
print("Learned medians:", imputer.statistics_)
# [30. 4.]

# Transform both datasets
train_imputed = pd.DataFrame(
    imputer.transform(train_df),
    columns=train_df.columns
)

test_imputed = pd.DataFrame(
    imputer.transform(test_df),
    columns=test_df.columns
)

print("\nTest data after imputation:")
print(test_imputed)

The key advantage: SimpleImputer stores the medians computed during fit() and reuses them during transform(). This prevents data leakage where test set statistics accidentally influence your model.

For integration with full pipelines:

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler()),
    ('classifier', LogisticRegression())
])

# Now fit and predict handle imputation automatically
# pipeline.fit(X_train, y_train)
# predictions = pipeline.predict(X_test)

Best Practices and Considerations

Choose your strategy deliberately. Median works well for skewed distributions and outlier-prone data. Mean is appropriate for symmetric distributions. Mode handles categorical data. Don’t default to median without checking your data’s distribution.

Document your imputation decisions. Future you (and your teammates) need to know which columns were imputed and how. Add comments or maintain a data dictionary:

imputation_log = {
    'salary': {'method': 'median', 'value': 72500, 'rows_affected': 2},
    'years_experience': {'method': 'group_median', 'group_by': 'department'}
}

Avoid over-imputation. If a column has 80% missing values, imputing them all with the median creates artificial data that doesn’t reflect reality. Consider dropping such columns or using more sophisticated methods like multiple imputation.

Prevent data leakage. Never calculate imputation statistics on your full dataset before splitting into train/test. Fit on training data, transform everything. This is non-negotiable for valid model evaluation.

Consider the downstream impact. Imputed values reduce variance in your data. If you’re imputing heavily, your model’s confidence intervals and statistical tests may be overly optimistic. Some analyses require flagging imputed values for sensitivity analysis.

Median imputation is a solid default, but it’s still a compromise. You’re replacing unknown values with educated guesses. When missing data is substantial or non-random, invest time in understanding why it’s missing before deciding how to fill it.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.