How to Interpolate Missing Values in Pandas

Missing values appear in datasets for countless reasons: sensor malfunctions, network timeouts, manual data entry errors, or simply gaps in data collection schedules. When you encounter NaN values in...

Key Insights

  • Linear interpolation works well for gradual trends, but polynomial and spline methods handle curved patterns in your data more accurately—choose based on your data’s underlying shape.
  • Time-based interpolation (method='time') respects irregular timestamp spacing, making it essential for sensor data, financial time series, and any dataset where samples aren’t evenly distributed.
  • Always limit interpolation scope with limit and limit_area parameters to avoid filling large gaps with unreliable estimates—validate results visually before trusting interpolated values.

Introduction to Missing Data and Interpolation

Missing values appear in datasets for countless reasons: sensor malfunctions, network timeouts, manual data entry errors, or simply gaps in data collection schedules. When you encounter NaN values in Pandas, you have three main options: drop the rows, fill with a static value (mean, median, zero), or interpolate.

Dropping rows wastes valuable data. Simple filling ignores the relationship between surrounding values. Interpolation estimates missing values based on existing data points, preserving trends and patterns in your dataset.

Interpolation shines when your data has temporal or sequential ordering—time series, sensor readings, or any dataset where adjacent values relate to each other. If you’re working with categorical data or values that don’t follow a predictable pattern, interpolation isn’t the right tool.

Understanding Pandas Interpolation Basics

The interpolate() method in Pandas estimates missing values by drawing connections between known data points. By default, it uses linear interpolation, which assumes a straight line between adjacent non-null values.

import pandas as pd
import numpy as np

# Create a Series with missing values
data = pd.Series([1.0, np.nan, np.nan, 4.0, 5.0, np.nan, 7.0])
print("Original data:")
print(data)

# Apply linear interpolation
interpolated = data.interpolate()
print("\nAfter interpolation:")
print(interpolated)

Output:

Original data:
0    1.0
1    NaN
2    NaN
3    4.0
4    5.0
5    NaN
6    7.0
dtype: float64

After interpolation:
0    1.0
1    2.0
2    3.0
3    4.0
4    5.0
5    6.0
6    7.0
dtype: float64

The method fills the gap between 1.0 and 4.0 with evenly spaced values (2.0 and 3.0). It treats the index as equally spaced positions, regardless of actual index values.

Interpolation Methods Explained

Pandas supports multiple interpolation algorithms through the method parameter. Each handles data patterns differently.

Linear assumes constant rate of change between points. Fast and simple, but misses curves.

Polynomial fits a polynomial curve of specified order through all data points. Captures curves but can oscillate wildly with higher orders.

Spline uses piecewise polynomials that connect smoothly at data points. More stable than polynomial for complex patterns.

Time respects actual time differences in datetime indices. Essential for irregularly sampled time series.

Index uses numeric index values for spacing calculations instead of treating all gaps equally.

import pandas as pd
import numpy as np

# Data with a curved pattern (quadratic-ish)
index = [0, 1, 2, 3, 4, 5, 6, 7, 8]
values = [0, np.nan, np.nan, 9, np.nan, np.nan, 36, np.nan, 64]
data = pd.Series(values, index=index)

print("Original:", data.values)

# Linear interpolation
linear = data.interpolate(method='linear')
print("Linear:  ", linear.values)

# Polynomial interpolation (order 2 for quadratic)
poly = data.interpolate(method='polynomial', order=2)
print("Poly(2): ", poly.values.round(1))

# Spline interpolation (order 2)
spline = data.interpolate(method='spline', order=2)
print("Spline:  ", spline.values.round(1))

Output:

Original: [ 0. nan nan  9. nan nan 36. nan 64.]
Linear:   [ 0.  3.  6.  9. 18. 27. 36. 50. 64.]
Poly(2):  [ 0.  1.  4.  9. 16. 25. 36. 49. 64.]
Spline:   [ 0.  1.  4.  9. 16. 25. 36. 49. 64.]

The data follows a square pattern (0², 3², 6², 8²). Linear interpolation misses the curve entirely. Polynomial and spline methods with order 2 correctly identify the quadratic relationship and fill values accurately.

Use linear for data with constant trends. Switch to polynomial or spline when you see curves. Start with order 2 or 3—higher orders often overfit and produce unrealistic oscillations.

Handling Time Series Data

When working with datetime indices, standard linear interpolation ignores the actual time gaps between observations. A 5-minute gap gets the same treatment as a 5-hour gap. The time method fixes this.

import pandas as pd
import numpy as np

# Sensor readings with irregular timestamps
timestamps = pd.to_datetime([
    '2024-01-01 00:00:00',
    '2024-01-01 00:05:00',  # 5 min gap
    '2024-01-01 00:10:00',
    '2024-01-01 00:40:00',  # 30 min gap
    '2024-01-01 01:00:00',
])

readings = pd.Series(
    [100.0, np.nan, 110.0, np.nan, 150.0],
    index=timestamps
)

print("Original readings:")
print(readings)

# Standard linear interpolation (ignores time gaps)
linear_fill = readings.interpolate(method='linear')
print("\nLinear interpolation:")
print(linear_fill)

# Time-aware interpolation (respects actual time differences)
time_fill = readings.interpolate(method='time')
print("\nTime-based interpolation:")
print(time_fill)

Output:

Original readings:
2024-01-01 00:00:00    100.0
2024-01-01 00:05:00      NaN
2024-01-01 00:10:00    110.0
2024-01-01 00:40:00      NaN
2024-01-01 01:00:00    150.0
dtype: float64

Linear interpolation:
2024-01-01 00:00:00    100.0
2024-01-01 00:05:00    105.0
2024-01-01 00:10:00    110.0
2024-01-01 00:40:00    130.0
2024-01-01 01:00:00    150.0
dtype: float64

Time-based interpolation:
2024-01-01 00:00:00    100.0
2024-01-01 00:05:00    105.0
2024-01-01 00:10:00    110.0
2024-01-01 00:40:00    134.0
2024-01-01 01:00:00    150.0
dtype: float64

Linear interpolation places 130.0 exactly halfway between 110.0 and 150.0. Time-based interpolation calculates that 00:40 is 30 minutes into a 50-minute gap, so it places the value at 60% of the way from 110 to 150, yielding 134.0.

For any time series with irregular sampling, always use method='time'.

Controlling Interpolation Behavior

Pandas provides parameters to limit where and how much interpolation occurs. This prevents filling large gaps with unreliable estimates.

limit: Maximum number of consecutive NaNs to fill.

limit_direction: Which direction to fill—'forward', 'backward', or 'both' (default).

limit_area: Restrict filling to 'inside' (between valid values) or 'outside' (before first or after last valid value).

import pandas as pd
import numpy as np

data = pd.Series([1.0, np.nan, np.nan, np.nan, np.nan, 6.0, np.nan, 8.0])
print("Original:")
print(data.values)

# Limit to filling only 2 consecutive NaNs
limited = data.interpolate(limit=2)
print("\nLimit=2:")
print(limited.values)

# Only fill forward from known values
forward = data.interpolate(limit=2, limit_direction='forward')
print("\nLimit=2, forward only:")
print(forward.values)

# Only fill gaps between valid values (not edges)
inside = data.interpolate(limit=2, limit_area='inside')
print("\nLimit=2, inside only:")
print(inside.values)

Output:

Original:
[ 1. nan nan nan nan  6. nan  8.]

Limit=2:
[ 1.  2.  3. nan nan  6.  7.  8.]

Limit=2, forward only:
[ 1.  2.  3. nan nan  6.  7.  8.]

Limit=2, inside only:
[ 1.  2.  3. nan nan  6.  7.  8.]

The 4-NaN gap between 1.0 and 6.0 only gets partially filled because we limited to 2 consecutive fills. The single NaN between 6.0 and 8.0 fills completely since it’s within the limit.

Set limit based on your domain knowledge. If sensor readings shouldn’t be more than 3 samples apart, set limit=3. Larger gaps indicate something went wrong, and interpolation would produce fiction.

Interpolating DataFrames (Multi-column)

When working with DataFrames, interpolate() applies to all numeric columns by default. You can control the axis and handle columns differently when needed.

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'temperature': [20.0, np.nan, 22.0, np.nan, 24.0],
    'humidity': [45.0, 50.0, np.nan, np.nan, 60.0],
    'pressure': [1013.0, np.nan, np.nan, np.nan, 1015.0]
})

print("Original DataFrame:")
print(df)

# Interpolate all columns with same method
interpolated = df.interpolate(method='linear')
print("\nAll columns interpolated:")
print(interpolated)

# Different strategies per column
df_custom = df.copy()
df_custom['temperature'] = df['temperature'].interpolate(method='linear')
df_custom['humidity'] = df['humidity'].interpolate(method='linear', limit=1)
df_custom['pressure'] = df['pressure'].interpolate(method='polynomial', order=1)

print("\nCustom interpolation per column:")
print(df_custom)

Output:

Original DataFrame:
   temperature  humidity  pressure
0         20.0      45.0    1013.0
1          NaN      50.0       NaN
2         22.0       NaN       NaN
3          NaN       NaN       NaN
4         24.0      60.0    1015.0

All columns interpolated:
   temperature  humidity  pressure
0         20.0      45.0    1013.0
1         21.0      50.0    1013.5
2         22.0      53.3    1014.0
3         23.0      56.7    1014.5
4         24.0      60.0    1015.0

Custom interpolation per column:
   temperature  humidity  pressure
0         20.0      45.0    1013.0
1         21.0      50.0    1013.5
2         22.0      53.3    1014.0
3         23.0       NaN    1014.5
4         24.0      60.0    1015.0

Notice humidity has a NaN remaining in the custom version because we limited interpolation to 1 consecutive value, and there was a 2-NaN gap.

Best Practices and Pitfalls

Never extrapolate blindly. By default, interpolate() won’t fill NaNs at the start or end of a Series (no data points on both sides). If you force it with fill_value or limit_area='outside', you’re guessing beyond your data.

Watch for categorical data. Interpolation is meaningless for categories. A value between “red” and “blue” doesn’t exist. Filter to numeric columns before interpolating.

Validate visually. Always plot original versus interpolated data to catch unrealistic estimates.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Create data with intentional pattern
np.random.seed(42)
x = np.linspace(0, 10, 50)
y = np.sin(x) + np.random.normal(0, 0.1, 50)

# Introduce gaps
y_with_gaps = y.copy()
y_with_gaps[15:20] = np.nan
y_with_gaps[35:40] = np.nan

series = pd.Series(y_with_gaps, index=x)
interpolated = series.interpolate(method='spline', order=3)

# Plot comparison
fig, ax = plt.subplots(figsize=(10, 4))
ax.scatter(x, y_with_gaps, label='Original (with gaps)', alpha=0.7, s=30)
ax.plot(x, interpolated, label='Interpolated', color='red', linewidth=1.5)
ax.legend()
ax.set_xlabel('X')
ax.set_ylabel('Value')
ax.set_title('Interpolation Validation')
plt.tight_layout()
plt.savefig('interpolation_validation.png', dpi=100)
plt.show()

This visualization immediately reveals whether your chosen method captures the underlying pattern or produces artifacts.

Final recommendations: Start with linear interpolation. Move to spline or polynomial only when linear clearly fails. Always set a reasonable limit. Validate with plots before using interpolated data in analysis. Document which values are real and which are estimated—downstream users need to know.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.