How to Perform Imputation in Python
Missing data is inevitable. Sensors fail, users skip form fields, databases corrupt, and surveys go incomplete. How you handle these gaps directly impacts the validity of your analysis and the...
Key Insights
- Choose your imputation method based on the missingness mechanism—simple mean imputation works for random missing data, but can severely bias your results when data is missing systematically
- Always fit your imputer on training data only and transform test data separately to prevent data leakage that inflates model performance
- Advanced methods like KNN and iterative imputation preserve relationships between variables, making them superior choices when feature correlations matter for your analysis
Introduction to Missing Data
Missing data is inevitable. Sensors fail, users skip form fields, databases corrupt, and surveys go incomplete. How you handle these gaps directly impacts the validity of your analysis and the performance of your models.
Before reaching for an imputation method, understand why your data is missing. Statisticians categorize missingness into three types:
Missing Completely at Random (MCAR): The probability of missingness is unrelated to any variable. A sensor randomly malfunctions regardless of the readings it would have recorded.
Missing at Random (MAR): Missingness depends on observed variables but not the missing value itself. Higher-income respondents might skip income questions, but the missingness relates to other observable factors like education level.
Missing Not at Random (MNAR): The missingness depends on the unobserved value. People with extreme incomes skip income questions precisely because of their income level.
For MCAR data, simple deletion or basic imputation works fine. MAR data benefits from sophisticated imputation methods that leverage relationships between variables. MNAR data requires domain expertise and often specialized techniques beyond standard imputation.
Detecting Missing Values in Python
Before imputing, you need a clear picture of what’s missing and where. Pandas provides the essential tools.
import pandas as pd
import numpy as np
import missingno as msno
import matplotlib.pyplot as plt
# Create sample dataset with missing values
np.random.seed(42)
df = pd.DataFrame({
'age': [25, 30, np.nan, 45, 50, np.nan, 35, 40, 28, 55],
'income': [50000, np.nan, 75000, np.nan, 90000, 60000, np.nan, 85000, 45000, 95000],
'education': ['Bachelor', 'Master', 'PhD', np.nan, 'Bachelor', 'Master', 'PhD', np.nan, 'Bachelor', 'Master'],
'satisfaction': [7, 8, np.nan, 6, 9, np.nan, 7, 8, np.nan, 9]
})
# Basic missing data summary
print("Missing values per column:")
print(df.isnull().sum())
print("\nPercentage missing:")
print((df.isnull().sum() / len(df) * 100).round(2))
# Detailed info including non-null counts
print("\nDataFrame info:")
df.info()
# Visualize missing data patterns
msno.matrix(df)
plt.title('Missing Data Pattern')
plt.tight_layout()
plt.savefig('missing_pattern.png', dpi=150)
plt.show()
# Correlation of missingness between columns
msno.heatmap(df)
plt.title('Missingness Correlation')
plt.tight_layout()
plt.show()
The missingno library’s matrix plot reveals patterns at a glance—vertical lines of white space indicate missing values, and their alignment across columns suggests correlated missingness. The heatmap shows whether columns tend to be missing together, which hints at the underlying missingness mechanism.
Simple Imputation Techniques
For straightforward cases, scikit-learn’s SimpleImputer handles the heavy lifting. Mean imputation works for normally distributed numerical data, median for skewed distributions, and mode for categorical variables.
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
# Separate numerical and categorical columns
numerical_cols = ['age', 'income', 'satisfaction']
categorical_cols = ['education']
# Create imputers for each type
numerical_imputer = SimpleImputer(strategy='median')
categorical_imputer = SimpleImputer(strategy='most_frequent')
# Apply imputers using ColumnTransformer
preprocessor = ColumnTransformer(
transformers=[
('num', numerical_imputer, numerical_cols),
('cat', categorical_imputer, categorical_cols)
],
remainder='passthrough'
)
# Fit and transform
df_imputed_array = preprocessor.fit_transform(df)
# Convert back to DataFrame with proper column names
df_simple_imputed = pd.DataFrame(
df_imputed_array,
columns=numerical_cols + categorical_cols
)
print("Original data:")
print(df)
print("\nAfter simple imputation:")
print(df_simple_imputed)
# Constant value imputation for specific use cases
constant_imputer = SimpleImputer(strategy='constant', fill_value=-999)
df_constant = pd.DataFrame(
constant_imputer.fit_transform(df[numerical_cols]),
columns=numerical_cols
)
print("\nWith constant fill value:")
print(df_constant)
Simple imputation is fast and interpretable, but it reduces variance and ignores relationships between variables. Use it when missing data is sparse (under 5%) and MCAR.
Advanced Imputation Methods
When variables correlate with each other, advanced methods produce more realistic imputations by leveraging these relationships.
KNN Imputation finds the k most similar complete observations and averages their values. It preserves local data structure but struggles with high-dimensional data.
Iterative Imputation (implementing the MICE algorithm) models each feature with missing values as a function of other features. It cycles through features, refining estimates until convergence.
from sklearn.impute import KNNImputer, IterativeImputer
from sklearn.preprocessing import StandardScaler
# Prepare numerical data
df_numerical = df[numerical_cols].copy()
# KNN Imputation
# Scale data first - KNN is distance-based
scaler = StandardScaler()
df_scaled = scaler.fit_transform(df_numerical)
knn_imputer = KNNImputer(n_neighbors=3, weights='distance')
df_knn_scaled = knn_imputer.fit_transform(df_scaled)
# Inverse transform to original scale
df_knn_imputed = pd.DataFrame(
scaler.inverse_transform(df_knn_scaled),
columns=numerical_cols
)
print("KNN Imputed Data:")
print(df_knn_imputed.round(2))
# Iterative Imputation (MICE-style)
# Note: IterativeImputer is experimental, enable it explicitly
from sklearn.experimental import enable_iterative_imputer
iterative_imputer = IterativeImputer(
max_iter=10,
random_state=42,
initial_strategy='median'
)
df_iterative_imputed = pd.DataFrame(
iterative_imputer.fit_transform(df_numerical),
columns=numerical_cols
)
print("\nIterative Imputed Data:")
print(df_iterative_imputed.round(2))
# Compare the methods
comparison = pd.DataFrame({
'Original': df['income'],
'Simple_Median': df_simple_imputed['income'].astype(float),
'KNN': df_knn_imputed['income'],
'Iterative': df_iterative_imputed['income']
})
print("\nIncome column comparison:")
print(comparison.round(2))
KNN imputation requires scaling because it uses distance metrics. The weights='distance' parameter gives closer neighbors more influence. Iterative imputation handles multicollinearity better and doesn’t require scaling, though scaling can improve convergence.
Time Series Imputation
Time series data demands methods that respect temporal ordering. Forward fill carries the last known value forward, backward fill does the reverse, and interpolation estimates values based on surrounding points.
# Create time series with gaps
dates = pd.date_range('2024-01-01', periods=15, freq='D')
ts_data = pd.DataFrame({
'date': dates,
'temperature': [20, 21, np.nan, np.nan, 24, 25, np.nan, 27, 28, np.nan, 30, 29, np.nan, 27, 26],
'humidity': [65, np.nan, 68, 70, np.nan, np.nan, 75, 73, np.nan, 70, 68, np.nan, 65, 63, 62]
})
ts_data.set_index('date', inplace=True)
# Forward fill - carry last observation forward
ts_ffill = ts_data.ffill()
# Backward fill - carry next observation backward
ts_bfill = ts_data.bfill()
# Linear interpolation - estimate based on surrounding values
ts_interpolated = ts_data.interpolate(method='linear')
# Time-based interpolation (useful for irregular time series)
ts_time_interp = ts_data.interpolate(method='time')
# Polynomial interpolation for smoother curves
ts_poly = ts_data.interpolate(method='polynomial', order=2)
# Compare methods
print("Original:")
print(ts_data['temperature'].values)
print("\nForward Fill:")
print(ts_ffill['temperature'].values)
print("\nLinear Interpolation:")
print(ts_interpolated['temperature'].values.round(2))
# Limit consecutive fills to avoid propagating stale data
ts_limited = ts_data.ffill(limit=1)
print("\nForward Fill (limit=1):")
print(ts_limited['temperature'].values)
# Visualize the differences
fig, axes = plt.subplots(2, 2, figsize=(12, 8))
ts_data['temperature'].plot(ax=axes[0,0], marker='o', title='Original')
ts_ffill['temperature'].plot(ax=axes[0,1], marker='o', title='Forward Fill')
ts_interpolated['temperature'].plot(ax=axes[1,0], marker='o', title='Linear Interpolation')
ts_poly['temperature'].plot(ax=axes[1,1], marker='o', title='Polynomial Interpolation')
plt.tight_layout()
plt.savefig('timeseries_imputation.png', dpi=150)
plt.show()
Use forward fill for operational data where the last known state persists. Use interpolation when values change smoothly over time. Set limit parameters to avoid filling long gaps with stale data.
Evaluating Imputation Quality
Good imputation preserves the original distribution’s shape and maintains relationships between variables.
import seaborn as sns
from scipy import stats
# Generate larger dataset for meaningful comparison
np.random.seed(42)
n = 1000
full_data = pd.DataFrame({
'feature1': np.random.normal(50, 10, n),
'feature2': np.random.exponential(20, n)
})
# Introduce 20% missing values
mask = np.random.random(n) < 0.2
df_missing = full_data.copy()
df_missing.loc[mask, 'feature1'] = np.nan
# Apply different imputation methods
mean_imputer = SimpleImputer(strategy='mean')
knn_imputer = KNNImputer(n_neighbors=5)
df_mean_imputed = pd.DataFrame(
mean_imputer.fit_transform(df_missing),
columns=full_data.columns
)
df_knn_imputed = pd.DataFrame(
knn_imputer.fit_transform(df_missing),
columns=full_data.columns
)
# Compare distributions visually
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
axes[0].hist(full_data['feature1'], bins=30, alpha=0.7, label='Original', density=True)
axes[0].set_title('Original Distribution')
axes[0].legend()
axes[1].hist(df_mean_imputed['feature1'], bins=30, alpha=0.7, label='Mean Imputed', density=True)
axes[1].axvline(df_mean_imputed['feature1'].mean(), color='red', linestyle='--', label='Mean')
axes[1].set_title('Mean Imputation (note spike at mean)')
axes[1].legend()
axes[2].hist(df_knn_imputed['feature1'], bins=30, alpha=0.7, label='KNN Imputed', density=True)
axes[2].set_title('KNN Imputation')
axes[2].legend()
plt.tight_layout()
plt.savefig('imputation_comparison.png', dpi=150)
plt.show()
# Statistical comparison
print("Distribution Statistics:")
print(f"Original - Mean: {full_data['feature1'].mean():.2f}, Std: {full_data['feature1'].std():.2f}")
print(f"Mean Imputed - Mean: {df_mean_imputed['feature1'].mean():.2f}, Std: {df_mean_imputed['feature1'].std():.2f}")
print(f"KNN Imputed - Mean: {df_knn_imputed['feature1'].mean():.2f}, Std: {df_knn_imputed['feature1'].std():.2f}")
# KS test comparing imputed to original
ks_mean = stats.ks_2samp(full_data['feature1'], df_mean_imputed['feature1'])
ks_knn = stats.ks_2samp(full_data['feature1'], df_knn_imputed['feature1'])
print(f"\nKS Test p-values (higher = more similar to original):")
print(f"Mean Imputation: {ks_mean.pvalue:.4f}")
print(f"KNN Imputation: {ks_knn.pvalue:.4f}")
Notice how mean imputation creates a spike at the mean value, artificially reducing variance. KNN imputation better preserves the distribution shape.
Best Practices and Pitfalls
Prevent data leakage. Always fit imputers on training data only:
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
# Split before any preprocessing
X_train, X_test, y_train, y_test = train_test_split(
df_missing[['feature1', 'feature2']],
np.random.randn(len(df_missing)),
test_size=0.2,
random_state=42
)
# Build pipeline that handles imputation properly
pipeline = Pipeline([
('imputer', KNNImputer(n_neighbors=5)),
('model', LinearRegression())
])
# Fit on training data - imputer learns from train only
pipeline.fit(X_train, y_train)
# Transform test data uses training statistics
predictions = pipeline.predict(X_test)
Document your decisions. Record which columns had missing data, the percentage missing, the mechanism you assumed, and the method you chose. Future you will thank present you.
Consider multiple imputation for inferential statistics. Instead of creating one imputed dataset, create several with different random seeds, run your analysis on each, and pool the results. This properly accounts for imputation uncertainty.
Match method to mechanism. Simple imputation for MCAR with low missingness. KNN or iterative imputation for MAR when relationships between variables matter. For MNAR, consult domain experts or use sensitivity analysis.
Watch for impossible values. Imputation might produce negative ages or incomes exceeding reasonable bounds. Add post-processing constraints when necessary.
Missing data handling isn’t glamorous, but it’s foundational. Get it wrong and your analysis builds on a flawed foundation. Get it right and you extract maximum value from imperfect real-world data.