How to Resample Time Series Data in Pandas
Resampling is the process of changing the frequency of your time series data. If you have stock prices recorded every minute and need daily summaries, that's downsampling. If you have monthly revenue...
Key Insights
- Resampling transforms time series data between frequencies—downsampling aggregates data (minutes to hours), while upsampling interpolates to fill gaps (monthly to daily)
- The
.resample()method requires a DatetimeIndex and returns a Resampler object that works like groupby, letting you chain aggregation methods like.mean(),.sum(), or.ohlc() - Use the
closedandlabelparameters to control which edge of each time bin includes data points and how the resulting index is labeled—getting these wrong causes subtle off-by-one errors
Introduction to Time Series Resampling
Resampling is the process of changing the frequency of your time series data. If you have stock prices recorded every minute and need daily summaries, that’s downsampling. If you have monthly revenue figures and need to plot them alongside daily metrics, that’s upsampling.
Pandas makes both operations straightforward with the .resample() method, but the details matter. Choosing the wrong aggregation function or misunderstanding bin edges can introduce bugs that are hard to catch.
This guide covers the practical mechanics of resampling: how to aggregate high-frequency data into meaningful summaries, how to fill gaps when increasing frequency, and how to handle the edge cases that trip up most developers.
Before diving in, your DataFrame needs a DatetimeIndex. If your dates are in a regular column, you’ll need to convert and set them as the index first.
Setting Up Time Series Data
Let’s create realistic sample data to work with. We’ll generate minute-level stock price data that simulates a trading day:
import pandas as pd
import numpy as np
# Create minute-level data for a trading week
np.random.seed(42)
dates = pd.date_range(
start='2024-01-08 09:30',
end='2024-01-12 16:00',
freq='min'
)
# Filter to market hours only (9:30 AM - 4:00 PM)
dates = dates[(dates.hour >= 9) & (dates.hour < 16) |
((dates.hour == 9) & (dates.minute >= 30)) |
((dates.hour == 16) & (dates.minute == 0))]
# Simulate price movements with random walk
price_changes = np.random.randn(len(dates)) * 0.5
prices = 150 + np.cumsum(price_changes)
df = pd.DataFrame({
'price': prices,
'volume': np.random.randint(100, 10000, len(dates))
}, index=dates)
print(df.head(10))
If your data comes from a CSV or database with dates as strings, convert them first:
# Converting string dates to DatetimeIndex
raw_df = pd.read_csv('trades.csv')
raw_df['timestamp'] = pd.to_datetime(raw_df['timestamp'])
raw_df = raw_df.set_index('timestamp')
The index must be a DatetimeIndex for .resample() to work. Calling .resample() on a DataFrame with an integer or string index raises a TypeError.
Downsampling: Aggregating to Lower Frequency
Downsampling reduces data frequency by grouping multiple observations into single summary values. The .resample() method takes a frequency string and returns a Resampler object—similar to groupby—that you chain with an aggregation method.
Common frequency strings:
'min'or'T'— minute'H'— hour'D'— calendar day'W'— week'M'— month end'Q'— quarter end
Converting Minute Data to Daily OHLC Bars
Financial data often requires OHLC (Open, High, Low, Close) aggregation. Pandas has a built-in method for this:
# Create daily OHLC bars from minute data
daily_ohlc = df['price'].resample('D').ohlc()
print(daily_ohlc)
Output:
open high low close
2024-01-08 150.246291 155.097498 148.714171 153.841573
2024-01-09 154.012847 157.234521 151.892347 155.234891
2024-01-10 155.891234 158.234567 152.123456 156.789012
...
For volume, you typically want the sum:
# Combine OHLC prices with total volume
daily_bars = df['price'].resample('D').ohlc()
daily_bars['volume'] = df['volume'].resample('D').sum()
print(daily_bars)
Weekly Sales Totals from Daily Data
For business metrics like sales, sum is usually the right aggregation:
# Sample daily sales data
sales_dates = pd.date_range('2024-01-01', periods=90, freq='D')
daily_sales = pd.DataFrame({
'revenue': np.random.randint(1000, 5000, 90),
'orders': np.random.randint(10, 100, 90)
}, index=sales_dates)
# Weekly totals
weekly_sales = daily_sales.resample('W').sum()
print(weekly_sales.head())
# Monthly averages
monthly_avg = daily_sales.resample('M').mean()
print(monthly_avg.head())
The aggregation method you choose depends entirely on what the data represents. Revenue gets summed; temperatures get averaged; inventory levels might use .last() to get end-of-period snapshots.
Upsampling: Increasing Data Frequency
Upsampling increases frequency, which creates gaps that need filling. If you have monthly data and resample to daily, Pandas doesn’t know what values should appear on the 2nd through 30th of each month.
# Monthly revenue data
monthly_dates = pd.date_range('2024-01-31', periods=6, freq='M')
monthly_revenue = pd.DataFrame({
'revenue': [50000, 52000, 48000, 55000, 58000, 54000]
}, index=monthly_dates)
# Upsample to daily - creates NaN values
daily_revenue = monthly_revenue.resample('D').asfreq()
print(daily_revenue.head(35))
The result has NaN for every day except month-ends. You have three main options for filling:
# Forward fill: carry last known value forward
daily_ffill = monthly_revenue.resample('D').ffill()
# Backward fill: use next known value
daily_bfill = monthly_revenue.resample('D').bfill()
# Interpolate: linear interpolation between known points
daily_interp = monthly_revenue.resample('D').interpolate(method='linear')
print("Forward fill:")
print(daily_ffill.head(10))
print("\nInterpolated:")
print(daily_interp.head(10))
Forward fill is appropriate when values remain constant until updated (like subscription counts). Interpolation makes sense for continuous measurements where gradual change is expected.
Custom Aggregations and Multiple Columns
Real datasets often need different aggregations for different columns. Use .agg() with a dictionary:
# Different aggregations per column
daily_summary = df.resample('D').agg({
'price': ['mean', 'std', 'first', 'last'],
'volume': 'sum'
})
print(daily_summary)
This creates a MultiIndex column structure. To flatten it:
daily_summary.columns = ['_'.join(col).strip() for col in daily_summary.columns]
print(daily_summary.columns)
# Index(['price_mean', 'price_std', 'price_first', 'price_last', 'volume_sum'])
For custom aggregation logic, pass a function:
# Custom function: calculate intraday range as percentage
def intraday_range_pct(series):
return (series.max() - series.min()) / series.mean() * 100
daily_volatility = df['price'].resample('D').apply(intraday_range_pct)
print(daily_volatility)
Handling Edge Cases
The closed and label Parameters
These parameters control how time bins are constructed and labeled. They’re critical for getting correct results, especially with financial data.
# Sample data: hourly observations
hourly_data = pd.DataFrame({
'value': range(24)
}, index=pd.date_range('2024-01-15', periods=24, freq='H'))
# Default: closed='left', label='left'
# Bin [00:00, 06:00) labeled as 00:00
default_resample = hourly_data.resample('6H').sum()
# closed='right': bin (00:00, 06:00]
right_closed = hourly_data.resample('6H', closed='right').sum()
# label='right': label bin with right edge
right_label = hourly_data.resample('6H', label='right').sum()
print("Default (closed='left', label='left'):")
print(default_resample)
print("\nclosed='right':")
print(right_closed)
print("\nlabel='right':")
print(right_label)
For daily resampling of financial data, you often want closed='right' so that midnight belongs to the previous day’s bin.
Dealing with Missing Periods
If your data has gaps, resampling still creates entries for missing periods:
# Data with a gap (missing Jan 10)
sparse_dates = pd.to_datetime(['2024-01-08', '2024-01-09', '2024-01-11', '2024-01-12'])
sparse_df = pd.DataFrame({'value': [100, 110, 130, 125]}, index=sparse_dates)
# Resampling fills the gap with NaN
complete = sparse_df.resample('D').mean()
print(complete)
# Fill missing periods explicitly
complete_filled = sparse_df.resample('D').mean().ffill()
print(complete_filled)
Timezone-Aware Resampling
Resampling respects timezone information, which matters when aggregating across DST boundaries:
# Timezone-aware data
tz_dates = pd.date_range('2024-03-09', periods=72, freq='H', tz='US/Eastern')
tz_df = pd.DataFrame({'value': range(72)}, index=tz_dates)
# Daily resampling respects timezone
daily_tz = tz_df.resample('D').sum()
print(daily_tz)
Conclusion
Here’s a quick reference for common frequency aliases:
| Alias | Description |
|---|---|
min or T |
Minute |
H |
Hour |
D |
Calendar day |
B |
Business day |
W |
Week (Sunday end) |
W-MON |
Week (Monday end) |
M |
Month end |
MS |
Month start |
Q |
Quarter end |
Y |
Year end |
When to use .resample() vs .groupby() with Grouper: Use .resample() when your index is already a DatetimeIndex and you’re doing straightforward time-based aggregation. Use pd.Grouper(key='date_column', freq='D') inside .groupby() when you need to group by time and other columns simultaneously, or when your dates aren’t the index.
# Grouper example: aggregate by date AND category
df_with_category = df.copy()
df_with_category['category'] = np.random.choice(['A', 'B'], len(df))
df_with_category = df_with_category.reset_index()
grouped = df_with_category.groupby([
pd.Grouper(key='index', freq='D'),
'category'
]).agg({'price': 'mean', 'volume': 'sum'})
Resampling is one of those operations that seems simple until you hit edge cases. Test your aggregations on small samples where you can verify the results manually before running on production data.