Pandas - Calculate Difference Between Dates

Key Insights

Pandas provides multiple methods to calculate date differences: subtraction returns Timedelta objects, while diff() computes row-wise differences in DataFrames
The dt accessor converts Timedelta objects to specific units (days, seconds, hours) and handles null values gracefully during date arithmetic
For complex date calculations, pd.DateOffset and pd.tseries.offsets provide business day logic, month-end handling, and custom period arithmetic that simple subtraction cannot achieve

Basic Date Subtraction

Pandas handles date differences through direct subtraction of datetime64 objects, which returns a Timedelta object representing the duration between two dates.

import pandas as pd
import numpy as np

# Create sample dates
start_date = pd.Timestamp('2024-01-15')
end_date = pd.Timestamp('2024-03-20')

# Calculate difference
difference = end_date - start_date
print(difference)  # 65 days 00:00:00
print(type(difference))  # <class 'pandas._libs.tseries.timedeltas.Timedelta'>

# Extract days
print(difference.days)  # 65

When working with Series, the same subtraction operation applies element-wise:

df = pd.DataFrame({
    'start': pd.to_datetime(['2024-01-01', '2024-02-15', '2024-03-10']),
    'end': pd.to_datetime(['2024-01-31', '2024-03-01', '2024-04-05'])
})

df['difference'] = df['end'] - df['start']
print(df)

Output:

       start        end difference
0 2024-01-01 2024-01-31    30 days
1 2024-02-15 2024-03-01    15 days
2 2024-03-10 2024-04-05    26 days

Converting Timedelta to Specific Units

The dt accessor provides methods to convert Timedelta objects into numerical values representing different time units.

df = pd.DataFrame({
    'start': pd.to_datetime(['2024-01-01 08:30:00', '2024-02-15 14:20:00']),
    'end': pd.to_datetime(['2024-01-05 17:45:00', '2024-02-20 09:15:00'])
})

df['timedelta'] = df['end'] - df['start']

# Convert to different units
df['days'] = df['timedelta'].dt.days
df['hours'] = df['timedelta'].dt.total_seconds() / 3600
df['seconds'] = df['timedelta'].dt.total_seconds()

print(df[['timedelta', 'days', 'hours', 'seconds']])

Output:

           timedelta  days      hours      seconds
0 4 days 09:15:00      4  105.25000    378900.0
1 4 days 18:55:00      4  114.91667    413700.0

Note that dt.days returns only the whole day component, while dt.total_seconds() provides the complete duration. For precise calculations, use total_seconds() and divide by the target unit.

Row-wise Differences with diff()

The diff() method calculates differences between consecutive rows, useful for time-series analysis and detecting changes over time.

df = pd.DataFrame({
    'date': pd.to_datetime(['2024-01-01', '2024-01-08', '2024-01-15', '2024-01-30']),
    'value': [100, 150, 120, 180]
})

df['date_diff'] = df['date'].diff()
df['value_diff'] = df['value'].diff()

print(df)

Output:

        date  value date_diff  value_diff
0 2024-01-01    100       NaT         NaN
1 2024-01-08    150   7 days        50.0
2 2024-01-15    120   7 days       -30.0
3 2024-01-30    180  15 days        60.0

The first row contains NaT (Not a Time) because there’s no previous row to compare. You can specify periods to compare non-consecutive rows:

df['two_period_diff'] = df['date'].diff(periods=2)
print(df[['date', 'two_period_diff']])

Output:

        date two_period_diff
0 2024-01-01             NaT
1 2024-01-08             NaT
2 2024-01-15        14 days
3 2024-01-30        22 days

Handling Null Values in Date Calculations

Date arithmetic in Pandas handles null values (NaT) gracefully, propagating them through calculations without raising errors.

df = pd.DataFrame({
    'start': pd.to_datetime(['2024-01-01', None, '2024-03-01', '2024-04-01']),
    'end': pd.to_datetime(['2024-02-01', '2024-03-15', None, '2024-05-01'])
})

df['difference'] = df['end'] - df['start']
df['days'] = df['difference'].dt.days

print(df)

Output:

       start        end difference  days
0 2024-01-01 2024-02-01    31 days  31.0
1        NaT 2024-03-15        NaT   NaN
2 2024-03-01        NaT        NaT   NaN
3 2024-04-01 2024-05-01    30 days  30.0

To filter out null differences:

valid_differences = df.dropna(subset=['difference'])
print(valid_differences)

Business Day Calculations

For business logic requiring workday calculations, use pd.offsets.BDay() to exclude weekends and optionally holidays.

from pandas.tseries.offsets import BDay

df = pd.DataFrame({
    'start': pd.to_datetime(['2024-01-15', '2024-02-16', '2024-03-20'])
})

# Add 10 business days
df['end'] = df['start'] + 10 * BDay()

# Calculate business days between dates
df['business_days'] = df.apply(
    lambda row: len(pd.bdate_range(row['start'], row['end'])) - 1, 
    axis=1
)

print(df)

For custom holidays:

from pandas.tseries.holiday import USFederalHolidayCalendar
from pandas.tseries.offsets import CustomBusinessDay

us_bd = CustomBusinessDay(calendar=USFederalHolidayCalendar())

start = pd.Timestamp('2024-07-01')
end = start + 10 * us_bd

print(f"Start: {start}")
print(f"End: {end}")

Month and Year Differences

Calculating month differences requires special handling since months have varying lengths. Use DateOffset for accurate month arithmetic.

from dateutil.relativedelta import relativedelta

df = pd.DataFrame({
    'start': pd.to_datetime(['2024-01-31', '2024-03-15', '2024-06-01']),
    'end': pd.to_datetime(['2024-05-31', '2024-08-20', '2024-12-15'])
})

def month_difference(start, end):
    delta = relativedelta(end, start)
    return delta.years * 12 + delta.months

df['months'] = df.apply(lambda x: month_difference(x['start'], x['end']), axis=1)

print(df)

Output:

       start        end  months
0 2024-01-31 2024-05-31       4
1 2024-03-15 2024-08-20       5
2 2024-06-01 2024-12-15       6

For approximate month calculations based on average days:

df['approx_months'] = (df['end'] - df['start']).dt.days / 30.44
print(df[['start', 'end', 'approx_months']])

Performance Optimization for Large Datasets

When working with millions of rows, vectorized operations significantly outperform iterative approaches.

# Generate large dataset
n = 1_000_000
df = pd.DataFrame({
    'start': pd.date_range('2020-01-01', periods=n, freq='h'),
    'end': pd.date_range('2020-02-01', periods=n, freq='h')
})

# Vectorized approach (fast)
%timeit df['diff_days'] = (df['end'] - df['start']).dt.days

# Avoid iterative approach (slow)
# %timeit df['diff_days'] = df.apply(lambda x: (x['end'] - x['start']).days, axis=1)

For complex calculations requiring iteration, use itertuples() instead of apply():

import time

def calculate_with_itertuples(df):
    results = []
    for row in df.itertuples():
        diff = (row.end - row.start).days
        results.append(diff)
    return results

# This is faster than apply() but slower than vectorized operations

The vectorized subtraction approach typically runs 50-100x faster than row-wise apply() operations on large datasets. Always prefer vectorized operations when possible, reserving iteration for truly complex logic that cannot be vectorized.

Basic Date Subtraction

Converting Timedelta to Specific Units

Row-wise Differences with diff()

Handling Null Values in Date Calculations

Business Day Calculations

Month and Year Differences

Performance Optimization for Large Datasets

Liked this? There's more.