Pandas - Calculate Difference Between Dates

Pandas handles date differences through direct subtraction of `datetime64` objects, which returns a `Timedelta` object representing the duration between two dates.

Key Insights

  • Pandas provides multiple methods to calculate date differences: subtraction returns Timedelta objects, while diff() computes row-wise differences in DataFrames
  • The dt accessor converts Timedelta objects to specific units (days, seconds, hours) and handles null values gracefully during date arithmetic
  • For complex date calculations, pd.DateOffset and pd.tseries.offsets provide business day logic, month-end handling, and custom period arithmetic that simple subtraction cannot achieve

Basic Date Subtraction

Pandas handles date differences through direct subtraction of datetime64 objects, which returns a Timedelta object representing the duration between two dates.

import pandas as pd
import numpy as np

# Create sample dates
start_date = pd.Timestamp('2024-01-15')
end_date = pd.Timestamp('2024-03-20')

# Calculate difference
difference = end_date - start_date
print(difference)  # 65 days 00:00:00
print(type(difference))  # <class 'pandas._libs.tseries.timedeltas.Timedelta'>

# Extract days
print(difference.days)  # 65

When working with Series, the same subtraction operation applies element-wise:

df = pd.DataFrame({
    'start': pd.to_datetime(['2024-01-01', '2024-02-15', '2024-03-10']),
    'end': pd.to_datetime(['2024-01-31', '2024-03-01', '2024-04-05'])
})

df['difference'] = df['end'] - df['start']
print(df)

Output:

       start        end difference
0 2024-01-01 2024-01-31    30 days
1 2024-02-15 2024-03-01    15 days
2 2024-03-10 2024-04-05    26 days

Converting Timedelta to Specific Units

The dt accessor provides methods to convert Timedelta objects into numerical values representing different time units.

df = pd.DataFrame({
    'start': pd.to_datetime(['2024-01-01 08:30:00', '2024-02-15 14:20:00']),
    'end': pd.to_datetime(['2024-01-05 17:45:00', '2024-02-20 09:15:00'])
})

df['timedelta'] = df['end'] - df['start']

# Convert to different units
df['days'] = df['timedelta'].dt.days
df['hours'] = df['timedelta'].dt.total_seconds() / 3600
df['seconds'] = df['timedelta'].dt.total_seconds()

print(df[['timedelta', 'days', 'hours', 'seconds']])

Output:

           timedelta  days      hours      seconds
0 4 days 09:15:00      4  105.25000    378900.0
1 4 days 18:55:00      4  114.91667    413700.0

Note that dt.days returns only the whole day component, while dt.total_seconds() provides the complete duration. For precise calculations, use total_seconds() and divide by the target unit.

Row-wise Differences with diff()

The diff() method calculates differences between consecutive rows, useful for time-series analysis and detecting changes over time.

df = pd.DataFrame({
    'date': pd.to_datetime(['2024-01-01', '2024-01-08', '2024-01-15', '2024-01-30']),
    'value': [100, 150, 120, 180]
})

df['date_diff'] = df['date'].diff()
df['value_diff'] = df['value'].diff()

print(df)

Output:

        date  value date_diff  value_diff
0 2024-01-01    100       NaT         NaN
1 2024-01-08    150   7 days        50.0
2 2024-01-15    120   7 days       -30.0
3 2024-01-30    180  15 days        60.0

The first row contains NaT (Not a Time) because there’s no previous row to compare. You can specify periods to compare non-consecutive rows:

df['two_period_diff'] = df['date'].diff(periods=2)
print(df[['date', 'two_period_diff']])

Output:

        date two_period_diff
0 2024-01-01             NaT
1 2024-01-08             NaT
2 2024-01-15        14 days
3 2024-01-30        22 days

Handling Null Values in Date Calculations

Date arithmetic in Pandas handles null values (NaT) gracefully, propagating them through calculations without raising errors.

df = pd.DataFrame({
    'start': pd.to_datetime(['2024-01-01', None, '2024-03-01', '2024-04-01']),
    'end': pd.to_datetime(['2024-02-01', '2024-03-15', None, '2024-05-01'])
})

df['difference'] = df['end'] - df['start']
df['days'] = df['difference'].dt.days

print(df)

Output:

       start        end difference  days
0 2024-01-01 2024-02-01    31 days  31.0
1        NaT 2024-03-15        NaT   NaN
2 2024-03-01        NaT        NaT   NaN
3 2024-04-01 2024-05-01    30 days  30.0

To filter out null differences:

valid_differences = df.dropna(subset=['difference'])
print(valid_differences)

Business Day Calculations

For business logic requiring workday calculations, use pd.offsets.BDay() to exclude weekends and optionally holidays.

from pandas.tseries.offsets import BDay

df = pd.DataFrame({
    'start': pd.to_datetime(['2024-01-15', '2024-02-16', '2024-03-20'])
})

# Add 10 business days
df['end'] = df['start'] + 10 * BDay()

# Calculate business days between dates
df['business_days'] = df.apply(
    lambda row: len(pd.bdate_range(row['start'], row['end'])) - 1, 
    axis=1
)

print(df)

For custom holidays:

from pandas.tseries.holiday import USFederalHolidayCalendar
from pandas.tseries.offsets import CustomBusinessDay

us_bd = CustomBusinessDay(calendar=USFederalHolidayCalendar())

start = pd.Timestamp('2024-07-01')
end = start + 10 * us_bd

print(f"Start: {start}")
print(f"End: {end}")

Month and Year Differences

Calculating month differences requires special handling since months have varying lengths. Use DateOffset for accurate month arithmetic.

from dateutil.relativedelta import relativedelta

df = pd.DataFrame({
    'start': pd.to_datetime(['2024-01-31', '2024-03-15', '2024-06-01']),
    'end': pd.to_datetime(['2024-05-31', '2024-08-20', '2024-12-15'])
})

def month_difference(start, end):
    delta = relativedelta(end, start)
    return delta.years * 12 + delta.months

df['months'] = df.apply(lambda x: month_difference(x['start'], x['end']), axis=1)

print(df)

Output:

       start        end  months
0 2024-01-31 2024-05-31       4
1 2024-03-15 2024-08-20       5
2 2024-06-01 2024-12-15       6

For approximate month calculations based on average days:

df['approx_months'] = (df['end'] - df['start']).dt.days / 30.44
print(df[['start', 'end', 'approx_months']])

Performance Optimization for Large Datasets

When working with millions of rows, vectorized operations significantly outperform iterative approaches.

# Generate large dataset
n = 1_000_000
df = pd.DataFrame({
    'start': pd.date_range('2020-01-01', periods=n, freq='h'),
    'end': pd.date_range('2020-02-01', periods=n, freq='h')
})

# Vectorized approach (fast)
%timeit df['diff_days'] = (df['end'] - df['start']).dt.days

# Avoid iterative approach (slow)
# %timeit df['diff_days'] = df.apply(lambda x: (x['end'] - x['start']).days, axis=1)

For complex calculations requiring iteration, use itertuples() instead of apply():

import time

def calculate_with_itertuples(df):
    results = []
    for row in df.itertuples():
        diff = (row.end - row.start).days
        results.append(diff)
    return results

# This is faster than apply() but slower than vectorized operations

The vectorized subtraction approach typically runs 50-100x faster than row-wise apply() operations on large datasets. Always prefer vectorized operations when possible, reserving iteration for truly complex logic that cannot be vectorized.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.