Pandas - Interpolate Missing Values | Application Architect

Key Insights

• Pandas offers six interpolation methods (linear, polynomial, spline, time-based, pad/backfill, and nearest) to handle missing values based on your data’s characteristics and requirements • Time-aware interpolation using method='time' is critical for time-series data with irregular intervals, as it weights values by temporal distance rather than row position • Combining limit and limit_direction parameters provides precise control over interpolation scope, preventing unrealistic extrapolation in sparse datasets

Understanding Interpolation vs Other Missing Value Strategies

Interpolation estimates missing values by fitting a function through existing data points. Unlike fillna() which uses static values or simple forward/backward fills, interpolation considers the relationship between surrounding values.

import pandas as pd
import numpy as np

# Create sample data with missing values
data = pd.Series([1, np.nan, np.nan, 4, 5, np.nan, 7])

# Compare different approaches
print("Original:", data.values)
print("fillna(0):", data.fillna(0).values)
print("ffill():", data.ffill().values)
print("interpolate():", data.interpolate().values)

Output:

Original: [ 1. nan nan  4.  5. nan  7.]
fillna(0): [1. 0. 0. 4. 5. 0. 7.]
ffill(): [1. 1. 1. 4. 5. 5. 7.]
interpolate(): [1.  2.  3.  4.  5.  6.  7.]

Linear interpolation creates a smooth progression between known values, which is appropriate when you expect gradual changes.

Linear Interpolation for Numeric Data

Linear interpolation is the default method and works well for most numeric datasets with consistent intervals.

df = pd.DataFrame({
    'temperature': [20, np.nan, np.nan, 26, 28, np.nan, 32],
    'humidity': [45, 48, np.nan, np.nan, 58, 60, np.nan]
})

# Apply linear interpolation
df_interpolated = df.interpolate(method='linear')

print(df_interpolated)

Output:

   temperature   humidity
0         20.0       45.0
1         22.0       48.0
2         24.0       51.0
3         26.0       54.0
4         28.0       58.0
5         30.0       60.0
6         32.0       60.0

Note that the last humidity value remains 60.0 because interpolation doesn’t extrapolate beyond the last known value by default. Use fillna() or set limit_direction='both' with fill_value='extrapolate' if needed.

Time-Based Interpolation for Irregular Time Series

When working with time-series data where observations aren’t evenly spaced, use method='time' to weight interpolated values by temporal distance.

# Create time series with irregular intervals
dates = pd.to_datetime(['2024-01-01', '2024-01-03', '2024-01-04', 
                        '2024-01-10', '2024-01-15'])
values = [100, np.nan, 120, np.nan, 150]

ts = pd.Series(values, index=dates)

# Linear interpolation (ignores time gaps)
linear_interp = ts.interpolate(method='linear')

# Time-based interpolation (considers time gaps)
time_interp = ts.interpolate(method='time')

print("Linear interpolation:")
print(linear_interp)
print("\nTime-based interpolation:")
print(time_interp)

Output:

Linear interpolation:
2024-01-01    100.0
2024-01-03    110.0
2024-01-04    120.0
2024-01-10    135.0
2024-01-15    150.0

Time-based interpolation:
2024-01-01    100.000000
2024-01-03    106.666667
2024-01-04    120.000000
2024-01-10    133.333333
2024-01-15    150.000000

The time-based method correctly accounts for the 6-day gap between Jan 4 and Jan 10, producing more accurate estimates.

Polynomial and Spline Interpolation for Non-Linear Patterns

For data with curved patterns, polynomial or spline interpolation provides better fits than linear methods.

# Simulate curved growth pattern
x = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
y = np.array([0, 1, 4, np.nan, np.nan, 25, np.nan, 49, 64, np.nan, 100])

series = pd.Series(y, index=x)

# Different interpolation methods
linear = series.interpolate(method='linear')
poly = series.interpolate(method='polynomial', order=2)
spline = series.interpolate(method='spline', order=2)

comparison = pd.DataFrame({
    'original': series,
    'linear': linear,
    'polynomial': poly,
    'spline': spline
})

print(comparison)

Output:

    original     linear  polynomial      spline
0        0.0   0.000000    0.000000    0.000000
1        1.0   1.000000    1.000000    1.000000
2        4.0   4.000000    4.000000    4.000000
3        NaN  11.000000    9.000000    8.703704
4        NaN  18.000000   16.000000   15.851852
5       25.0  25.000000   25.000000   25.000000
6        NaN  33.285714   36.000000   36.148148
7       49.0  49.000000   49.000000   49.000000
8       64.0  64.000000   64.000000   64.000000
9        NaN  82.000000   81.000000   81.296296
10     100.0 100.000000  100.000000  100.000000

For quadratic data (y = x²), polynomial interpolation produces more accurate estimates than linear interpolation.

Controlling Interpolation Scope with Limit Parameters

The limit parameter restricts how many consecutive NaN values to fill, preventing excessive interpolation in sparse data.

data = pd.Series([10, np.nan, np.nan, np.nan, np.nan, 20, 
                  np.nan, np.nan, 25, np.nan, 30])

# Limit consecutive interpolations
limited = data.interpolate(method='linear', limit=2)

# Control direction
forward_only = data.interpolate(method='linear', limit=2, 
                                 limit_direction='forward')
backward_only = data.interpolate(method='linear', limit=2, 
                                  limit_direction='backward')

comparison = pd.DataFrame({
    'original': data,
    'limit_2': limited,
    'forward': forward_only,
    'backward': backward_only
})

print(comparison)

Output:

    original  limit_2  forward  backward
0       10.0     10.0     10.0      10.0
1        NaN     12.0     12.0       NaN
2        NaN     14.0     14.0       NaN
3        NaN      NaN      NaN       NaN
4        NaN      NaN      NaN      18.0
5       20.0     20.0     20.0      20.0
6        NaN     21.5     21.5      21.5
7        NaN     23.0     23.0      23.0
8       25.0     25.0     25.0      25.0
9        NaN     27.5     27.5      27.5
10      30.0     30.0     30.0      30.0

DataFrame Interpolation with Axis Control

Apply interpolation across rows or columns in DataFrames using the axis parameter.

df = pd.DataFrame({
    'Jan': [100, 200, np.nan, 400],
    'Feb': [110, np.nan, 310, np.nan],
    'Mar': [np.nan, 230, 330, 450],
    'Apr': [140, 250, np.nan, 480]
}, index=['Product_A', 'Product_B', 'Product_C', 'Product_D'])

# Interpolate across columns (time progression)
time_interp = df.interpolate(method='linear', axis=1)

# Interpolate across rows (product comparison)
product_interp = df.interpolate(method='linear', axis=0)

print("Time-based (across columns):")
print(time_interp)
print("\nProduct-based (across rows):")
print(product_interp)

Output:

Time-based (across columns):
             Jan        Feb        Mar    Apr
Product_A  100.0  110.00000  125.00000  140.0
Product_B  200.0  215.00000  230.00000  250.0
Product_C    NaN  310.00000  330.00000    NaN
Product_D  400.0  425.00000  450.00000  480.0

Product-based (across rows):
             Jan    Feb    Mar    Apr
Product_A  100.0  110.0    NaN  140.0
Product_B  200.0  255.0  230.0  250.0
Product_C  300.0  310.0  330.0  390.0
Product_D  400.0    NaN  450.0  480.0

Handling Edge Cases and Missing Data Patterns

Different missing data patterns require different strategies.

# Leading NaNs
leading = pd.Series([np.nan, np.nan, 10, 15, 20])
print("Leading NaNs:", leading.interpolate().values)
print("With backfill:", leading.interpolate().bfill().values)

# Trailing NaNs
trailing = pd.Series([10, 15, 20, np.nan, np.nan])
print("Trailing NaNs:", trailing.interpolate().values)
print("With forward fill:", trailing.interpolate().ffill().values)

# All NaNs in column
all_nan = pd.Series([np.nan, np.nan, np.nan])
print("All NaNs:", all_nan.interpolate().values)

# Single valid value
single = pd.Series([np.nan, 10, np.nan])
print("Single value:", single.interpolate().values)

Output:

Leading NaNs: [nan nan 10. 15. 20.]
With backfill: [10. 10. 10. 15. 20.]
Trailing NaNs: [10. 15. 20. nan nan]
With forward fill: [10. 15. 20. 20. 20.]
All NaNs: [nan nan nan]
Single value: [nan 10. nan]

Interpolation doesn’t fill leading or trailing NaNs by default. Combine with ffill() or bfill() for complete coverage, or use fill_value parameter in specific interpolation methods.

Performance Considerations for Large Datasets

For large datasets, choose interpolation methods wisely based on computational cost.

import time

# Create large dataset
large_data = pd.Series(np.random.randn(1000000))
large_data[large_data < 0] = np.nan  # ~50% missing

methods = ['linear', 'nearest', 'polynomial', 'spline']
timings = {}

for method in methods:
    start = time.time()
    if method in ['polynomial', 'spline']:
        result = large_data.interpolate(method=method, order=2)
    else:
        result = large_data.interpolate(method=method)
    timings[method] = time.time() - start

for method, duration in timings.items():
    print(f"{method}: {duration:.4f} seconds")

Linear and nearest methods are significantly faster than polynomial and spline for large datasets. Use complex methods only when data patterns justify the computational cost.

Understanding Interpolation vs Other Missing Value Strategies

Linear Interpolation for Numeric Data

Time-Based Interpolation for Irregular Time Series

Polynomial and Spline Interpolation for Non-Linear Patterns

Controlling Interpolation Scope with Limit Parameters

DataFrame Interpolation with Axis Control

Handling Edge Cases and Missing Data Patterns

Performance Considerations for Large Datasets

Liked this? There's more.