How to Resample Time Series in Python
Time series resampling is the process of converting data from one frequency to another. When you decrease the frequency (hourly to daily), you're downsampling. When you increase it (daily to hourly),...
Key Insights
- Resampling transforms time series data between frequencies using pandas’
.resample()method, which acts likegroupby()but for time-based intervals - Downsampling aggregates high-frequency data to lower frequencies (hourly to daily) and requires choosing appropriate aggregation functions to avoid losing critical information
- Upsampling expands data to higher frequencies (daily to hourly) and necessitates interpolation strategies like forward fill, backward fill, or linear interpolation to handle the introduced gaps
Introduction to Time Series Resampling
Time series resampling is the process of converting data from one frequency to another. When you decrease the frequency (hourly to daily), you’re downsampling. When you increase it (daily to hourly), you’re upsampling. This operation is fundamental to time series analysis because real-world data rarely arrives at the exact frequency you need for analysis.
You’ll resample data for several reasons: aggregating high-frequency sensor readings to reduce noise, aligning datasets with different sampling rates before joining them, filling gaps in irregular data, or converting data to match reporting periods. Financial analysts resample tick data to daily prices, IoT engineers aggregate sensor readings to hourly averages, and data scientists resample to create features at consistent intervals for machine learning models.
Setting Up: pandas and Sample Data
pandas provides excellent datetime functionality through its DatetimeIndex. The library’s .resample() method is your primary tool for frequency conversion, functioning similarly to groupby() but specifically designed for time-based operations.
Let’s create sample data representing hourly temperature readings from a sensor:
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
# Create hourly data for 7 days
date_range = pd.date_range(start='2024-01-01', end='2024-01-07', freq='H')
np.random.seed(42)
# Simulate temperature readings with some noise and daily pattern
hours = np.arange(len(date_range))
temperature = 20 + 5 * np.sin(hours * 2 * np.pi / 24) + np.random.normal(0, 1, len(date_range))
df = pd.DataFrame({
'temperature': temperature,
'humidity': np.random.uniform(40, 80, len(date_range))
}, index=date_range)
print(df.head(10))
This creates a DataFrame with a DatetimeIndex, which is essential for resampling operations. Without a datetime index, .resample() won’t work.
Downsampling: Aggregating to Lower Frequencies
Downsampling reduces data frequency through aggregation. The key decision is choosing the right aggregation function—mean, sum, max, min, or custom logic—based on what the data represents.
Convert hourly temperature data to daily averages:
# Daily average temperature
daily_avg = df['temperature'].resample('D').mean()
print(daily_avg)
# Daily maximum temperature
daily_max = df['temperature'].resample('D').max()
print(daily_max)
# Daily sum (useful for cumulative metrics like rainfall)
daily_sum = df['temperature'].resample('D').sum()
print(daily_sum)
For multiple aggregations simultaneously, use .agg():
# Multiple statistics per day
daily_stats = df['temperature'].resample('D').agg(['mean', 'min', 'max', 'std'])
print(daily_stats)
# Different aggregations for different columns
multi_agg = df.resample('D').agg({
'temperature': ['mean', 'max', 'min'],
'humidity': ['mean', 'first', 'last']
})
print(multi_agg)
You can also apply custom functions:
# Custom aggregation: range (max - min)
def temp_range(x):
return x.max() - x.min()
daily_range = df['temperature'].resample('D').apply(temp_range)
print(daily_range)
# Count non-null values per period
daily_count = df['temperature'].resample('D').count()
print(daily_count)
Upsampling: Interpolating to Higher Frequencies
Upsampling increases frequency, creating gaps that must be filled. pandas offers several strategies for handling these missing values.
Start with daily data and upsample to hourly:
# Create daily data
daily_data = pd.DataFrame({
'value': [100, 105, 103, 108, 110]
}, index=pd.date_range('2024-01-01', periods=5, freq='D'))
# Upsample to hourly - creates NaN values
hourly_upsampled = daily_data.resample('H').asfreq()
print(hourly_upsampled.head(25))
Forward fill carries the last valid observation forward:
# Forward fill - repeat last known value
hourly_ffill = daily_data.resample('H').ffill()
print(hourly_ffill.head(25))
Backward fill uses the next valid observation:
# Backward fill - use next known value
hourly_bfill = daily_data.resample('H').bfill()
print(hourly_bfill.head(25))
Linear interpolation creates smooth transitions between known points:
# Linear interpolation
hourly_interp = daily_data.resample('H').asfreq()
hourly_interp = hourly_interp.interpolate(method='linear')
print(hourly_interp.head(25))
# Compare different interpolation methods
comparison = pd.DataFrame({
'original': daily_data.resample('H').asfreq()['value'],
'ffill': daily_data.resample('H').ffill()['value'],
'linear': daily_data.resample('H').asfreq().interpolate(method='linear')['value'],
'cubic': daily_data.resample('H').asfreq().interpolate(method='cubic')['value']
})
print(comparison.head(25))
Choose your interpolation method based on data characteristics. Linear works well for gradually changing metrics, forward fill suits discrete states, and cubic spline creates smoother curves for naturally smooth phenomena.
Advanced Resampling Techniques
pandas supports numerous frequency aliases and parameters for fine-tuned control.
Business day resampling excludes weekends:
# Create data including weekends
all_days = pd.date_range('2024-01-01', '2024-01-31', freq='D')
business_data = pd.DataFrame({
'revenue': np.random.uniform(10000, 50000, len(all_days))
}, index=all_days)
# Resample to weekly, considering only business days
weekly_business = business_data.resample('W').sum()
print(weekly_business)
# Business day frequency
business_daily = business_data.resample('B').sum()
print(business_daily)
Custom frequency strings provide flexibility:
# 15-minute intervals
minute_data = df.resample('15min').mean()
# 2-hour intervals
two_hour = df.resample('2H').mean()
# Quarterly aggregation
quarterly = df.resample('Q').mean()
# Monthly, starting from end of month
monthly_end = df.resample('M').mean()
# Monthly, starting from beginning of month
monthly_start = df.resample('MS').mean()
print(f"15-min shape: {minute_data.shape}")
print(f"2-hour shape: {two_hour.shape}")
print(f"Quarterly shape: {quarterly.shape}")
Control bin labeling and edge handling with label and closed parameters:
# Label bins with right edge (default)
right_label = df['temperature'].resample('D', label='right').mean()
# Label bins with left edge
left_label = df['temperature'].resample('D', label='left').mean()
# Control which bin edge is closed (inclusive)
closed_right = df['temperature'].resample('D', closed='right').mean()
closed_left = df['temperature'].resample('D', closed='left').mean()
print("Right label:\n", right_label.head())
print("\nLeft label:\n", left_label.head())
Practical Example: Real-World Application
Here’s a complete workflow processing financial OHLC (Open, High, Low, Close) data:
# Simulate minute-level stock price data
minute_range = pd.date_range('2024-01-01 09:30', '2024-01-01 16:00', freq='1min')
base_price = 100
prices = []
current_price = base_price
for _ in range(len(minute_range)):
current_price += np.random.normal(0, 0.5)
prices.append(current_price)
minute_data = pd.DataFrame({
'price': prices,
'volume': np.random.randint(100, 10000, len(minute_range))
}, index=minute_range)
# Resample to 15-minute OHLC bars
ohlc_15min = minute_data['price'].resample('15min').ohlc()
volume_15min = minute_data['volume'].resample('15min').sum()
# Combine into single DataFrame
bars_15min = ohlc_15min.copy()
bars_15min['volume'] = volume_15min
print(bars_15min.head())
# Calculate VWAP (Volume Weighted Average Price) for each period
def vwap(df):
return (df['price'] * df['volume']).sum() / df['volume'].sum()
vwap_15min = minute_data.resample('15min').apply(vwap)
bars_15min['vwap'] = vwap_15min
print("\nFinal 15-minute bars with VWAP:")
print(bars_15min.head())
# Resample to hourly for broader analysis
hourly_bars = minute_data['price'].resample('H').ohlc()
hourly_bars['volume'] = minute_data['volume'].resample('H').sum()
hourly_bars['vwap'] = minute_data.resample('H').apply(vwap)
print("\nHourly bars:")
print(hourly_bars)
Best Practices and Common Pitfalls
Choose aggregation functions that preserve data meaning. Averaging boolean flags loses information—use max() or any() instead. Summing prices makes no sense—use mean, VWAP, or OHLC. For rate data (requests per second), sum the counts then divide by the period duration.
Always verify your datetime index is properly sorted and has the correct dtype:
# Check index type
print(df.index.dtype) # Should be datetime64[ns]
# Sort if needed
df = df.sort_index()
# Set timezone if needed
df.index = df.index.tz_localize('UTC')
Handle timezone-aware data explicitly. Mixing timezone-naive and timezone-aware data causes errors:
# Make timezone-aware
df_utc = df.tz_localize('UTC')
# Convert between timezones
df_est = df_utc.tz_convert('America/New_York')
# Resample works normally with timezone-aware data
daily_est = df_est.resample('D').mean()
For large datasets, consider performance implications. Resampling is generally efficient, but complex custom aggregations can be slow. Use vectorized operations when possible:
# Slower: complex custom function
def slow_agg(x):
return x.apply(lambda y: y ** 2).mean()
# Faster: vectorized operations
def fast_agg(x):
return (x ** 2).mean()
Watch for data loss during downsampling. If you resample from seconds to hours using mean(), you lose information about volatility. Consider storing multiple statistics (mean, std, min, max) or using a frequency that preserves important patterns.
Finally, remember that resampling changes your data’s statistical properties. Downsampling reduces noise but can hide important short-term patterns. Upsampling with interpolation creates artificial data points that shouldn’t be treated as real observations. Always document your resampling choices and their implications for downstream analysis.