Pandas - Interpolate Missing Values
• Pandas offers six interpolation methods (linear, polynomial, spline, time-based, pad/backfill, and nearest) to handle missing values based on your data's characteristics and requirements
Key Insights
• Pandas offers six interpolation methods (linear, polynomial, spline, time-based, pad/backfill, and nearest) to handle missing values based on your data’s characteristics and requirements
• Time-aware interpolation using method='time' is critical for time-series data with irregular intervals, as it weights values by temporal distance rather than row position
• Combining limit and limit_direction parameters provides precise control over interpolation scope, preventing unrealistic extrapolation in sparse datasets
Understanding Interpolation vs Other Missing Value Strategies
Interpolation estimates missing values by fitting a function through existing data points. Unlike fillna() which uses static values or simple forward/backward fills, interpolation considers the relationship between surrounding values.
import pandas as pd
import numpy as np
# Create sample data with missing values
data = pd.Series([1, np.nan, np.nan, 4, 5, np.nan, 7])
# Compare different approaches
print("Original:", data.values)
print("fillna(0):", data.fillna(0).values)
print("ffill():", data.ffill().values)
print("interpolate():", data.interpolate().values)
Output:
Original: [ 1. nan nan 4. 5. nan 7.]
fillna(0): [1. 0. 0. 4. 5. 0. 7.]
ffill(): [1. 1. 1. 4. 5. 5. 7.]
interpolate(): [1. 2. 3. 4. 5. 6. 7.]
Linear interpolation creates a smooth progression between known values, which is appropriate when you expect gradual changes.
Linear Interpolation for Numeric Data
Linear interpolation is the default method and works well for most numeric datasets with consistent intervals.
df = pd.DataFrame({
'temperature': [20, np.nan, np.nan, 26, 28, np.nan, 32],
'humidity': [45, 48, np.nan, np.nan, 58, 60, np.nan]
})
# Apply linear interpolation
df_interpolated = df.interpolate(method='linear')
print(df_interpolated)
Output:
temperature humidity
0 20.0 45.0
1 22.0 48.0
2 24.0 51.0
3 26.0 54.0
4 28.0 58.0
5 30.0 60.0
6 32.0 60.0
Note that the last humidity value remains 60.0 because interpolation doesn’t extrapolate beyond the last known value by default. Use fillna() or set limit_direction='both' with fill_value='extrapolate' if needed.
Time-Based Interpolation for Irregular Time Series
When working with time-series data where observations aren’t evenly spaced, use method='time' to weight interpolated values by temporal distance.
# Create time series with irregular intervals
dates = pd.to_datetime(['2024-01-01', '2024-01-03', '2024-01-04',
'2024-01-10', '2024-01-15'])
values = [100, np.nan, 120, np.nan, 150]
ts = pd.Series(values, index=dates)
# Linear interpolation (ignores time gaps)
linear_interp = ts.interpolate(method='linear')
# Time-based interpolation (considers time gaps)
time_interp = ts.interpolate(method='time')
print("Linear interpolation:")
print(linear_interp)
print("\nTime-based interpolation:")
print(time_interp)
Output:
Linear interpolation:
2024-01-01 100.0
2024-01-03 110.0
2024-01-04 120.0
2024-01-10 135.0
2024-01-15 150.0
Time-based interpolation:
2024-01-01 100.000000
2024-01-03 106.666667
2024-01-04 120.000000
2024-01-10 133.333333
2024-01-15 150.000000
The time-based method correctly accounts for the 6-day gap between Jan 4 and Jan 10, producing more accurate estimates.
Polynomial and Spline Interpolation for Non-Linear Patterns
For data with curved patterns, polynomial or spline interpolation provides better fits than linear methods.
# Simulate curved growth pattern
x = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
y = np.array([0, 1, 4, np.nan, np.nan, 25, np.nan, 49, 64, np.nan, 100])
series = pd.Series(y, index=x)
# Different interpolation methods
linear = series.interpolate(method='linear')
poly = series.interpolate(method='polynomial', order=2)
spline = series.interpolate(method='spline', order=2)
comparison = pd.DataFrame({
'original': series,
'linear': linear,
'polynomial': poly,
'spline': spline
})
print(comparison)
Output:
original linear polynomial spline
0 0.0 0.000000 0.000000 0.000000
1 1.0 1.000000 1.000000 1.000000
2 4.0 4.000000 4.000000 4.000000
3 NaN 11.000000 9.000000 8.703704
4 NaN 18.000000 16.000000 15.851852
5 25.0 25.000000 25.000000 25.000000
6 NaN 33.285714 36.000000 36.148148
7 49.0 49.000000 49.000000 49.000000
8 64.0 64.000000 64.000000 64.000000
9 NaN 82.000000 81.000000 81.296296
10 100.0 100.000000 100.000000 100.000000
For quadratic data (y = x²), polynomial interpolation produces more accurate estimates than linear interpolation.
Controlling Interpolation Scope with Limit Parameters
The limit parameter restricts how many consecutive NaN values to fill, preventing excessive interpolation in sparse data.
data = pd.Series([10, np.nan, np.nan, np.nan, np.nan, 20,
np.nan, np.nan, 25, np.nan, 30])
# Limit consecutive interpolations
limited = data.interpolate(method='linear', limit=2)
# Control direction
forward_only = data.interpolate(method='linear', limit=2,
limit_direction='forward')
backward_only = data.interpolate(method='linear', limit=2,
limit_direction='backward')
comparison = pd.DataFrame({
'original': data,
'limit_2': limited,
'forward': forward_only,
'backward': backward_only
})
print(comparison)
Output:
original limit_2 forward backward
0 10.0 10.0 10.0 10.0
1 NaN 12.0 12.0 NaN
2 NaN 14.0 14.0 NaN
3 NaN NaN NaN NaN
4 NaN NaN NaN 18.0
5 20.0 20.0 20.0 20.0
6 NaN 21.5 21.5 21.5
7 NaN 23.0 23.0 23.0
8 25.0 25.0 25.0 25.0
9 NaN 27.5 27.5 27.5
10 30.0 30.0 30.0 30.0
DataFrame Interpolation with Axis Control
Apply interpolation across rows or columns in DataFrames using the axis parameter.
df = pd.DataFrame({
'Jan': [100, 200, np.nan, 400],
'Feb': [110, np.nan, 310, np.nan],
'Mar': [np.nan, 230, 330, 450],
'Apr': [140, 250, np.nan, 480]
}, index=['Product_A', 'Product_B', 'Product_C', 'Product_D'])
# Interpolate across columns (time progression)
time_interp = df.interpolate(method='linear', axis=1)
# Interpolate across rows (product comparison)
product_interp = df.interpolate(method='linear', axis=0)
print("Time-based (across columns):")
print(time_interp)
print("\nProduct-based (across rows):")
print(product_interp)
Output:
Time-based (across columns):
Jan Feb Mar Apr
Product_A 100.0 110.00000 125.00000 140.0
Product_B 200.0 215.00000 230.00000 250.0
Product_C NaN 310.00000 330.00000 NaN
Product_D 400.0 425.00000 450.00000 480.0
Product-based (across rows):
Jan Feb Mar Apr
Product_A 100.0 110.0 NaN 140.0
Product_B 200.0 255.0 230.0 250.0
Product_C 300.0 310.0 330.0 390.0
Product_D 400.0 NaN 450.0 480.0
Handling Edge Cases and Missing Data Patterns
Different missing data patterns require different strategies.
# Leading NaNs
leading = pd.Series([np.nan, np.nan, 10, 15, 20])
print("Leading NaNs:", leading.interpolate().values)
print("With backfill:", leading.interpolate().bfill().values)
# Trailing NaNs
trailing = pd.Series([10, 15, 20, np.nan, np.nan])
print("Trailing NaNs:", trailing.interpolate().values)
print("With forward fill:", trailing.interpolate().ffill().values)
# All NaNs in column
all_nan = pd.Series([np.nan, np.nan, np.nan])
print("All NaNs:", all_nan.interpolate().values)
# Single valid value
single = pd.Series([np.nan, 10, np.nan])
print("Single value:", single.interpolate().values)
Output:
Leading NaNs: [nan nan 10. 15. 20.]
With backfill: [10. 10. 10. 15. 20.]
Trailing NaNs: [10. 15. 20. nan nan]
With forward fill: [10. 15. 20. 20. 20.]
All NaNs: [nan nan nan]
Single value: [nan 10. nan]
Interpolation doesn’t fill leading or trailing NaNs by default. Combine with ffill() or bfill() for complete coverage, or use fill_value parameter in specific interpolation methods.
Performance Considerations for Large Datasets
For large datasets, choose interpolation methods wisely based on computational cost.
import time
# Create large dataset
large_data = pd.Series(np.random.randn(1000000))
large_data[large_data < 0] = np.nan # ~50% missing
methods = ['linear', 'nearest', 'polynomial', 'spline']
timings = {}
for method in methods:
start = time.time()
if method in ['polynomial', 'spline']:
result = large_data.interpolate(method=method, order=2)
else:
result = large_data.interpolate(method=method)
timings[method] = time.time() - start
for method, duration in timings.items():
print(f"{method}: {duration:.4f} seconds")
Linear and nearest methods are significantly faster than polynomial and spline for large datasets. Use complex methods only when data patterns justify the computational cost.