Pandas - Resample Time Series Data | Application Architect

Key Insights

Resampling converts time series data between different frequencies using aggregation (downsampling) or interpolation (upsampling), essential for aligning datasets and analyzing trends at different time scales
The resample() method provides a flexible GroupBy-like interface with offset aliases (D, W, M, Q, Y) and supports custom aggregation functions, label positioning, and closed interval specifications
Production systems require careful handling of timezone-aware data, missing values, and business day conventions to avoid subtle bugs in financial and operational analytics

Understanding Resample Mechanics

Resampling reorganizes time series data into new time intervals. Downsampling reduces frequency (hourly to daily), requiring aggregation. Upsampling increases frequency (daily to hourly), requiring interpolation or forward-filling.

import pandas as pd
import numpy as np
from datetime import datetime, timedelta

# Create sample time series data
dates = pd.date_range('2024-01-01', periods=100, freq='H')
df = pd.DataFrame({
    'timestamp': dates,
    'temperature': np.random.randn(100).cumsum() + 20,
    'humidity': np.random.randint(40, 80, 100),
    'sensor_id': np.random.choice(['A', 'B'], 100)
})
df.set_index('timestamp', inplace=True)

# Downsample to daily averages
daily_avg = df.resample('D').mean()
print(daily_avg.head())

# Downsample to 6-hour intervals with multiple aggregations
six_hour = df.resample('6H').agg({
    'temperature': ['mean', 'min', 'max'],
    'humidity': 'mean'
})
print(six_hour)

The resample() method returns a Resampler object that behaves like GroupBy. You must call an aggregation method (mean, sum, count) or use agg() for multiple operations.

Offset Aliases and Custom Periods

Pandas provides extensive offset aliases for common time periods. Understanding these prevents hardcoding interval calculations.

# Common offset aliases
hourly = df.resample('H').mean()      # Hour
daily = df.resample('D').mean()       # Calendar day
weekly = df.resample('W').sum()       # Week ending Sunday
monthly = df.resample('M').mean()     # Month end
quarterly = df.resample('Q').sum()    # Quarter end
yearly = df.resample('Y').mean()      # Year end

# Business-specific aliases
business_days = df.resample('B').mean()  # Business days (Mon-Fri)
week_start = df.resample('W-MON').sum()  # Week starting Monday

# Custom frequencies
every_90min = df.resample('90T').mean()  # 90 minutes (T = minute)
every_2days = df.resample('2D').sum()    # Every 2 days
every_3hours = df.resample('3H').mean()  # Every 3 hours

# Anchored offsets for business calendars
month_start = df.resample('MS').mean()   # Month start
quarter_start = df.resample('QS').sum()  # Quarter start
business_month_end = df.resample('BM').mean()  # Business month end

For financial applications, business day frequencies automatically exclude weekends and can integrate with custom holiday calendars.

Label Positioning and Closed Intervals

Control how resampling assigns timestamps and includes boundary values using label and closed parameters.

# Create hourly data spanning exact boundaries
dates = pd.date_range('2024-01-01 00:00', periods=48, freq='H')
df = pd.DataFrame({
    'value': range(48)
}, index=dates)

# Label parameter: which bin edge gets the timestamp
left_label = df.resample('6H', label='left').sum()
right_label = df.resample('6H', label='right').sum()

print("Left label (default):")
print(left_label.head())
print("\nRight label:")
print(right_label.head())

# Closed parameter: which bin edge is inclusive
left_closed = df.resample('6H', closed='left').sum()
right_closed = df.resample('6H', closed='right').sum()

print("\nLeft closed (default):")
print(left_closed.head())
print("\nRight closed:")
print(right_closed.head())

# Combine both for precise control
custom = df.resample('6H', label='right', closed='right').sum()

The label parameter determines the timestamp displayed. The closed parameter determines whether intervals are [start, end) (left-closed) or (start, end] (right-closed). Financial tick data often uses right-closed intervals.

Upsampling with Interpolation

Upsampling creates new timestamps requiring fill strategies. Choose based on data characteristics.

# Daily data to upsample
daily_data = pd.DataFrame({
    'price': [100, 102, 98, 101, 103],
    'volume': [1000, 1200, 900, 1100, 1300]
}, index=pd.date_range('2024-01-01', periods=5, freq='D'))

# Forward fill - carry last value forward
hourly_ffill = daily_data.resample('H').ffill()

# Backward fill - use next value
hourly_bfill = daily_data.resample('H').bfill()

# Linear interpolation - smooth transition
hourly_interp = daily_data.resample('H').interpolate(method='linear')

# Polynomial interpolation for smoother curves
hourly_poly = daily_data.resample('H').interpolate(method='polynomial', order=2)

# Limit forward fill to avoid propagating stale data
hourly_limited = daily_data.resample('H').ffill(limit=6)  # Only 6 hours forward

print("Original daily data:")
print(daily_data)
print("\nInterpolated hourly data (first 25 hours):")
print(hourly_interp.head(25))

Forward fill works for categorical data or step-wise changes. Interpolation suits continuous measurements like temperature. Always set limit in production to prevent stale data propagation.

Handling Missing Data and Gaps

Real-world time series contain gaps. Handle them explicitly during resampling.

# Create data with gaps
dates = pd.date_range('2024-01-01', periods=100, freq='H')
# Randomly drop 20% of data
keep_idx = np.random.choice(dates, size=80, replace=False)
df_gaps = pd.DataFrame({
    'value': np.random.randn(80)
}, index=sorted(keep_idx))

# Resample with different strategies
daily_mean = df_gaps.resample('D').mean()  # NaN if no data in period
daily_sum = df_gaps.resample('D').sum()    # 0 if no data (misleading!)
daily_count = df_gaps.resample('D').count()  # Shows data availability

# Better approach: use agg with custom handling
daily_stats = df_gaps.resample('D').agg({
    'value': [
        ('mean', lambda x: x.mean() if len(x) >= 12 else np.nan),  # Require 12 hours
        ('count', 'count'),
        ('coverage', lambda x: len(x) / 24)  # Percentage of day covered
    ]
})

print(daily_stats)

# Fill missing periods explicitly
complete_range = pd.date_range(df_gaps.index.min(), df_gaps.index.max(), freq='H')
df_complete = df_gaps.reindex(complete_range)
daily_with_threshold = df_complete.resample('D').apply(
    lambda x: x.mean() if x.count() >= 18 else np.nan
)

Always track data availability when resampling. A daily mean from 2 hours of data differs from 24 hours. Set minimum observation thresholds for valid aggregations.

Timezone-Aware Resampling

Timezone handling prevents off-by-one errors in global systems.

# Create timezone-aware data
dates_utc = pd.date_range('2024-01-01', periods=72, freq='H', tz='UTC')
df_tz = pd.DataFrame({
    'transactions': np.random.randint(10, 100, 72)
}, index=dates_utc)

# Resample in UTC
daily_utc = df_tz.resample('D').sum()

# Convert to different timezone before resampling
df_ny = df_tz.tz_convert('America/New_York')
daily_ny = df_ny.resample('D').sum()  # Days aligned to NY midnight

print("UTC daily totals:")
print(daily_utc)
print("\nNew York daily totals:")
print(daily_ny)

# Handle daylight saving time transitions
dates_dst = pd.date_range('2024-03-09', periods=72, freq='H', tz='America/New_York')
df_dst = pd.DataFrame({'value': range(72)}, index=dates_dst)

# Resample across DST boundary
daily_dst = df_dst.resample('D').count()
print("\nCounts across DST transition:")
print(daily_dst)  # March 10 has 23 hours due to spring forward

Always specify timezone explicitly. Resampling naive timestamps can produce incorrect results across DST boundaries or when aggregating data from multiple regions.

Advanced Aggregations and OHLC

Financial and sensor data require specialized aggregations beyond mean and sum.

# Stock price data simulation
dates = pd.date_range('2024-01-01 09:30', periods=390, freq='1min')  # Trading day
df_stock = pd.DataFrame({
    'price': 100 + np.random.randn(390).cumsum() * 0.1,
    'volume': np.random.randint(100, 1000, 390)
}, index=dates)

# OHLC (Open, High, Low, Close) for candlestick charts
ohlc_5min = df_stock['price'].resample('5min').ohlc()
volume_5min = df_stock['volume'].resample('5min').sum()

candles = pd.concat([ohlc_5min, volume_5min], axis=1)
print(candles.head())

# Custom aggregations for sensor data
df_sensor = pd.DataFrame({
    'temperature': 20 + np.random.randn(1000).cumsum() * 0.1,
    'pressure': 1013 + np.random.randn(1000).cumsum() * 0.5
}, index=pd.date_range('2024-01-01', periods=1000, freq='T'))

# Complex aggregation with multiple metrics
hourly_stats = df_sensor.resample('H').agg({
    'temperature': [
        'mean',
        'std',
        ('range', lambda x: x.max() - x.min()),
        ('p95', lambda x: x.quantile(0.95))
    ],
    'pressure': ['mean', 'min', 'max']
})

# Rolling statistics within resampled periods
def rolling_stats(series):
    return pd.Series({
        'mean': series.mean(),
        'trend': series.iloc[-1] - series.iloc[0] if len(series) > 1 else 0,
        'volatility': series.std()
    })

hourly_rolling = df_sensor.resample('H').apply(rolling_stats)
print(hourly_rolling.head())

OHLC is built-in for price data. For custom metrics, lambda functions in agg() provide flexibility. Return Series from apply functions to create multiple columns per group.

Performance Optimization

Large time series require optimization strategies.

# Generate large dataset
large_dates = pd.date_range('2020-01-01', periods=1000000, freq='T')
df_large = pd.DataFrame({
    'sensor1': np.random.randn(1000000),
    'sensor2': np.random.randn(1000000),
    'sensor3': np.random.randn(1000000)
}, index=large_dates)

# Inefficient: multiple resampling operations
import time
start = time.time()
result1 = df_large['sensor1'].resample('H').mean()
result2 = df_large['sensor2'].resample('H').mean()
result3 = df_large['sensor3'].resample('H').mean()
print(f"Separate resampling: {time.time() - start:.2f}s")

# Efficient: single operation
start = time.time()
result_all = df_large.resample('H').mean()
print(f"Combined resampling: {time.time() - start:.2f}s")

# Use categorical for grouping with resampling
df_large['category'] = pd.Categorical(np.random.choice(['A', 'B', 'C'], 1000000))
grouped_resample = df_large.groupby('category').resample('H').mean()

Resample all columns simultaneously rather than individually. For grouped resampling, combine groupby() with resample() in a single chain. Use categorical dtypes for grouping columns to reduce memory usage.