How to Use Expanding Window in Pandas

Expanding windows are one of Pandas' most underutilized features. While most developers reach for rolling windows when they need windowed calculations, expanding windows solve a fundamentally...

Key Insights

  • Expanding windows calculate cumulative statistics from the first row to the current row, making them ideal for running totals, cumulative averages, and tracking metrics that should incorporate all historical data.
  • Unlike rolling windows that use a fixed-size lookback period, expanding windows grow with each row, which means memory usage stays constant but computation time increases linearly with dataset size.
  • The min_periods parameter controls when calculations start producing valid results, letting you avoid misleading statistics from insufficient data points.

Introduction to Expanding Windows

Expanding windows are one of Pandas’ most underutilized features. While most developers reach for rolling windows when they need windowed calculations, expanding windows solve a fundamentally different problem: computing statistics that should include all historical data up to the current point.

The distinction matters. A rolling window with a size of 30 always looks at exactly 30 rows. An expanding window starts with 1 row, then 2, then 3, growing until it encompasses your entire dataset. This makes expanding windows perfect for cumulative metrics where you want the full historical context at each point.

import pandas as pd
import numpy as np

# Create sample data
data = pd.Series([10, 20, 30, 40, 50])

# Rolling window (fixed size of 3)
rolling_mean = data.rolling(window=3).mean()

# Expanding window (grows from start)
expanding_mean = data.expanding().mean()

comparison = pd.DataFrame({
    'value': data,
    'rolling_3': rolling_mean,
    'expanding': expanding_mean
})
print(comparison)

Output:

   value  rolling_3  expanding
0     10        NaN       10.0
1     20        NaN       15.0
2     30       20.0       20.0
3     40       30.0       25.0
4     50       40.0       30.0

Notice how the rolling mean only produces values once it has 3 data points, and each calculation uses exactly 3 values. The expanding mean starts immediately and incorporates all previous values at each step.

Basic Syntax and Setup

The .expanding() method attaches to a Series or DataFrame and returns an Expanding object. You then chain an aggregation function to compute your statistic.

import pandas as pd

df = pd.DataFrame({
    'date': pd.date_range('2024-01-01', periods=10),
    'revenue': [100, 150, 120, 180, 200, 175, 220, 190, 250, 230],
    'orders': [10, 15, 12, 18, 20, 17, 22, 19, 25, 23]
})

# Basic expanding mean
df['cumulative_avg_revenue'] = df['revenue'].expanding().mean()

# The min_periods parameter controls when calculations start
# Default is 1, meaning results start from the first row
df['avg_with_min_3'] = df['revenue'].expanding(min_periods=3).mean()

print(df[['date', 'revenue', 'cumulative_avg_revenue', 'avg_with_min_3']])

The min_periods parameter is the only configuration option for expanding windows, but it’s crucial. Setting min_periods=3 means the first two rows return NaN, and calculations only begin once you have at least 3 data points. This prevents misleading statistics from tiny sample sizes.

Common Aggregation Functions

Pandas provides built-in methods for the most common statistical operations. These are optimized C implementations, so always prefer them over custom functions when possible.

import pandas as pd
import numpy as np

# Simulated stock price data
np.random.seed(42)
dates = pd.date_range('2024-01-01', periods=20)
prices = 100 + np.cumsum(np.random.randn(20) * 2)

stock = pd.DataFrame({
    'date': dates,
    'price': prices,
    'volume': np.random.randint(1000, 5000, 20)
})

# Apply multiple expanding calculations
stock['cumulative_mean'] = stock['price'].expanding().mean()
stock['cumulative_std'] = stock['price'].expanding().std()
stock['cumulative_min'] = stock['price'].expanding().min()
stock['cumulative_max'] = stock['price'].expanding().max()
stock['cumulative_count'] = stock['price'].expanding().count()
stock['total_volume'] = stock['volume'].expanding().sum()

# Calculate expanding range (max - min)
stock['price_range'] = stock['cumulative_max'] - stock['cumulative_min']

print(stock[['date', 'price', 'cumulative_mean', 'cumulative_std', 'price_range']].head(10))

The var() method computes expanding variance, which is useful for risk analysis. Note that std() and var() use N-1 degrees of freedom by default (sample statistics). Pass ddof=0 if you need population statistics.

Cumulative Statistics Use Cases

Expanding windows shine in business analytics where you need running totals and cumulative performance metrics.

import pandas as pd
import numpy as np

# E-commerce order data
orders = pd.DataFrame({
    'order_date': pd.date_range('2024-01-01', periods=30),
    'order_value': np.random.uniform(25, 200, 30).round(2),
    'items_sold': np.random.randint(1, 10, 30)
})

# Cumulative sales metrics
orders['running_total_revenue'] = orders['order_value'].expanding().sum()
orders['running_avg_order_value'] = orders['order_value'].expanding().mean()
orders['total_items_sold'] = orders['items_sold'].expanding().sum()

# Calculate cumulative average items per order
orders['avg_items_per_order'] = orders['items_sold'].expanding().mean()

# Track volatility with expanding standard deviation
orders['order_value_volatility'] = orders['order_value'].expanding(min_periods=5).std()

# Cumulative conversion rate example (if you had visitor data)
orders['orders_to_date'] = orders['order_value'].expanding().count()

print(orders.tail(10))

A practical application is tracking whether current performance is above or below historical average:

# Flag orders above cumulative average
orders['above_avg'] = orders['order_value'] > orders['running_avg_order_value']

# Calculate percentage difference from cumulative average
orders['pct_from_avg'] = (
    (orders['order_value'] - orders['running_avg_order_value']) 
    / orders['running_avg_order_value'] * 100
).round(2)

Custom Aggregations with .apply()

When built-in methods don’t cover your needs, .apply() lets you run custom functions on each expanding window.

import pandas as pd
import numpy as np
from scipy import stats

df = pd.DataFrame({
    'value': [10, 15, 12, 18, 25, 22, 30, 28, 35, 32]
})

# Custom function: expanding 75th percentile
def expanding_percentile_75(x):
    return np.percentile(x, 75)

df['p75'] = df['value'].expanding().apply(expanding_percentile_75, raw=True)

# Custom function: coefficient of variation (std / mean)
def coef_of_variation(x):
    if len(x) < 2:
        return np.nan
    return np.std(x, ddof=1) / np.mean(x)

df['cv'] = df['value'].expanding(min_periods=2).apply(coef_of_variation, raw=True)

# Custom function: skewness
def expanding_skew(x):
    if len(x) < 3:
        return np.nan
    return stats.skew(x)

df['skewness'] = df['value'].expanding(min_periods=3).apply(expanding_skew, raw=True)

print(df)

The raw=True parameter passes a NumPy array instead of a Series to your function, which is significantly faster. Always use raw=True unless you need Series functionality like index access.

For weighted calculations:

# Expanding weighted average where recent values get higher weights
def weighted_mean(values):
    weights = np.arange(1, len(values) + 1)
    return np.average(values, weights=weights)

df['weighted_expanding_mean'] = df['value'].expanding().apply(weighted_mean, raw=True)

Handling Missing Data

Missing values require careful handling with expanding windows. By default, NaN values are excluded from calculations.

import pandas as pd
import numpy as np

# Data with missing values
df = pd.DataFrame({
    'value': [10, np.nan, 30, 40, np.nan, 60, 70, np.nan, 90, 100]
})

# Default behavior: NaN excluded
df['expanding_mean'] = df['value'].expanding().mean()
df['expanding_count'] = df['value'].expanding().count()

# min_periods affects when results start
df['mean_min3'] = df['value'].expanding(min_periods=3).mean()

print(df)

Output:

   value  expanding_mean  expanding_count  mean_min3
0   10.0       10.000000              1.0        NaN
1    NaN       10.000000              1.0        NaN
2   30.0       20.000000              2.0        NaN
3   40.0       26.666667              3.0  26.666667
4    NaN       26.666667              3.0  26.666667
5   60.0       35.000000              4.0  35.000000
6   70.0       42.000000              5.0  42.000000
7    NaN       42.000000              5.0  42.000000
8   90.0       50.000000              6.0  50.000000
9  100.0       57.142857              7.0  57.142857

Notice that min_periods counts valid values, not total rows. With min_periods=3, you need 3 non-NaN values before getting results.

For sparse datasets, consider filling missing values before applying expanding calculations:

# Forward fill before expanding
df['value_filled'] = df['value'].ffill()
df['expanding_filled'] = df['value_filled'].expanding().mean()

Performance Tips and Alternatives

Expanding windows are convenient, but they’re not always the fastest option. Pandas provides dedicated cumulative functions that outperform expanding equivalents.

import pandas as pd
import numpy as np
import time

# Large dataset for benchmarking
n = 1_000_000
large_df = pd.DataFrame({
    'value': np.random.randn(n)
})

# Method 1: expanding().sum()
start = time.time()
result1 = large_df['value'].expanding().sum()
expanding_time = time.time() - start

# Method 2: cumsum() - dedicated cumulative function
start = time.time()
result2 = large_df['value'].cumsum()
cumsum_time = time.time() - start

print(f"expanding().sum(): {expanding_time:.4f}s")
print(f"cumsum(): {cumsum_time:.4f}s")
print(f"Speedup: {expanding_time / cumsum_time:.1f}x")

On my machine, cumsum() runs about 3-5x faster than expanding().sum(). Use these dedicated functions when available:

Expanding Method Faster Alternative
expanding().sum() cumsum()
expanding().max() cummax()
expanding().min() cummin()
expanding().prod() cumprod()

However, stick with expanding windows when you need min_periods control or when using mean(), std(), or other statistics without dedicated cumulative equivalents.

For memory efficiency, expanding windows don’t store intermediate results—they recompute at each step. This keeps memory constant but means computation scales linearly. If you’re applying the same expanding calculation repeatedly, consider caching results.

# Cache expanding statistics you'll reuse
expanding_stats = df['value'].expanding()
df['exp_mean'] = expanding_stats.mean()
df['exp_std'] = expanding_stats.std()  # Reuses the Expanding object

Expanding windows are a powerful tool for cumulative analytics. Use them when you need the full historical context at each data point, and reach for dedicated cumulative functions when raw performance matters.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.