How to Use Window Functions in Polars
Window functions solve a specific problem: you need to compute something across groups of rows, but you don't want to lose your row-level granularity. Think calculating each employee's salary as a...
Key Insights
- Polars’
over()method is the gateway to window functions, letting you compute aggregations across partitions without collapsing your DataFrame—essential for analytics that need row-level detail alongside group-level metrics. - Unlike pandas, Polars window functions integrate seamlessly with lazy evaluation, meaning you can chain complex window operations and let the query optimizer figure out the most efficient execution plan.
- Mastering
rank(), cumulative operations, andshift()in Polars will handle 90% of your analytical needs—from leaderboards to time-series analysis—with significantly less code than SQL equivalents.
Introduction to Window Functions
Window functions solve a specific problem: you need to compute something across groups of rows, but you don’t want to lose your row-level granularity. Think calculating each employee’s salary as a percentage of their department’s total, or ranking products within categories, or computing running totals over time.
If you’ve used SQL, you know window functions through the OVER() clause. Polars takes the same concept and makes it feel native to Python’s expression syntax. The mental model is identical: define what you want to calculate, then specify the “window” of rows to consider.
Where Polars shines is composability. You can chain window operations, combine them with filters, and build complex analytics in a single expression chain. The query optimizer handles the rest.
Let’s build up from basics to advanced patterns.
Basic Syntax: over() in Polars
The over() method is your entry point to window functions. It takes partition columns and applies the preceding expression within each partition.
Here’s the fundamental pattern:
import polars as pl
# Sample sales data
df = pl.DataFrame({
"region": ["East", "East", "West", "West", "East", "West"],
"product": ["A", "B", "A", "B", "C", "C"],
"sales": [100, 150, 200, 120, 80, 90]
})
# Add total sales per region without collapsing rows
result = df.with_columns(
pl.col("sales").sum().over("region").alias("region_total")
)
print(result)
Output:
shape: (6, 4)
┌────────┬─────────┬───────┬──────────────┐
│ region ┆ product ┆ sales ┆ region_total │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ i64 ┆ i64 │
╞════════╪═════════╪═══════╪══════════════╡
│ East ┆ A ┆ 100 ┆ 330 │
│ East ┆ B ┆ 150 ┆ 330 │
│ West ┆ A ┆ 200 ┆ 410 │
│ West ┆ B ┆ 120 ┆ 410 │
│ East ┆ C ┆ 80 ┆ 330 │
│ West ┆ C ┆ 90 ┆ 410 │
└────────┴─────────┴───────┴──────────────┘
Every row keeps its original data, but now includes the aggregated value for its partition. This is fundamentally different from group_by().agg(), which would collapse to one row per region.
You can partition by multiple columns:
# Sales total per region AND product combination
df.with_columns(
pl.col("sales").sum().over(["region", "product"]).alias("region_product_total")
)
Ranking and Row Numbering
Ranking functions assign positions to rows within partitions. Polars provides three variants that handle ties differently:
rank(): Tied values get the same rank, with gaps after ties (1, 2, 2, 4)dense_rank(): Tied values get the same rank, no gaps (1, 2, 2, 3)row_number(): Every row gets a unique number, ties broken arbitrarily
# Product performance data
products = pl.DataFrame({
"region": ["North", "North", "North", "South", "South", "South"],
"product": ["Widget", "Gadget", "Gizmo", "Widget", "Gadget", "Gizmo"],
"revenue": [5000, 5000, 3000, 8000, 6000, 6000]
})
# Rank products by revenue within each region
ranked = products.with_columns([
pl.col("revenue").rank(descending=True).over("region").alias("rank"),
pl.col("revenue").rank(method="dense", descending=True).over("region").alias("dense_rank"),
])
print(ranked)
Output:
shape: (6, 5)
┌────────┬─────────┬─────────┬──────┬────────────┐
│ region ┆ product ┆ revenue ┆ rank ┆ dense_rank │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ i64 ┆ u32 ┆ u32 │
╞════════╪═════════╪═════════╪══════╪════════════╡
│ North ┆ Widget ┆ 5000 ┆ 1 ┆ 1 │
│ North ┆ Gadget ┆ 5000 ┆ 1 ┆ 1 │
│ North ┆ Gizmo ┆ 3000 ┆ 3 ┆ 2 │
│ South ┆ Widget ┆ 8000 ┆ 1 ┆ 1 │
│ South ┆ Gadget ┆ 6000 ┆ 2 ┆ 2 │
│ South ┆ Gizmo ┆ 6000 ┆ 2 ┆ 2 │
└────────┴─────────┴─────────┴──────┴────────────┘
A practical use case: finding the top N performers per category:
# Get top 2 products per region
top_products = (
ranked
.filter(pl.col("dense_rank") <= 2)
)
Running Aggregations
Cumulative and rolling calculations are essential for time-series analysis. Polars handles these through cum_* functions and rolling_* methods combined with over().
# Customer transaction history
transactions = pl.DataFrame({
"customer_id": ["C1", "C1", "C1", "C2", "C2", "C2"],
"date": ["2024-01-01", "2024-01-05", "2024-01-10",
"2024-01-02", "2024-01-06", "2024-01-11"],
"amount": [100, 250, 75, 500, 200, 300]
}).with_columns(pl.col("date").str.to_date())
# Running total per customer
running_totals = transactions.sort("customer_id", "date").with_columns(
pl.col("amount").cum_sum().over("customer_id").alias("running_total")
)
print(running_totals)
Output:
shape: (6, 4)
┌─────────────┬────────────┬────────┬───────────────┐
│ customer_id ┆ date ┆ amount ┆ running_total │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ date ┆ i64 ┆ i64 │
╞═════════════╪════════════╪════════╪═══════════════╡
│ C1 ┆ 2024-01-01 ┆ 100 ┆ 100 │
│ C1 ┆ 2024-01-05 ┆ 250 ┆ 350 │
│ C1 ┆ 2024-01-10 ┆ 75 ┆ 425 │
│ C2 ┆ 2024-01-02 ┆ 500 ┆ 500 │
│ C2 ┆ 2024-01-06 ┆ 200 ┆ 700 │
│ C2 ┆ 2024-01-11 ┆ 300 ┆ 1000 │
└─────────────┴────────────┴────────┴───────────────┘
For rolling windows with a fixed size, use rolling_mean() or similar:
# Daily metrics with 7-day rolling average
daily_metrics = pl.DataFrame({
"date": pl.date_range(pl.date(2024, 1, 1), pl.date(2024, 1, 14), eager=True),
"value": [10, 12, 8, 15, 20, 18, 22, 25, 19, 30, 28, 35, 32, 40]
})
rolling = daily_metrics.with_columns(
pl.col("value").rolling_mean(window_size=7).alias("rolling_7d_avg")
)
print(rolling)
The first 6 rows will have null values for the rolling average since there aren’t enough preceding values.
Lead/Lag Operations
Comparing current values to previous or next values is fundamental to time-series analysis. Polars uses shift() for this:
# Stock price data
stocks = pl.DataFrame({
"symbol": ["AAPL", "AAPL", "AAPL", "GOOG", "GOOG", "GOOG"],
"date": ["2024-01-01", "2024-01-02", "2024-01-03"] * 2,
"price": [185.0, 187.5, 186.0, 140.0, 142.0, 141.5]
}).with_columns(pl.col("date").str.to_date())
# Calculate day-over-day change
price_changes = stocks.sort("symbol", "date").with_columns([
pl.col("price").shift(1).over("symbol").alias("prev_price"),
(pl.col("price") - pl.col("price").shift(1).over("symbol")).alias("price_change"),
((pl.col("price") / pl.col("price").shift(1).over("symbol") - 1) * 100)
.round(2)
.alias("pct_change")
])
print(price_changes)
Output:
shape: (6, 6)
┌────────┬────────────┬───────┬────────────┬──────────────┬────────────┐
│ symbol ┆ date ┆ price ┆ prev_price ┆ price_change ┆ pct_change │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ date ┆ f64 ┆ f64 ┆ f64 ┆ f64 │
╞════════╪════════════╪═══════╪════════════╪══════════════╪════════════╡
│ AAPL ┆ 2024-01-01 ┆ 185.0 ┆ null ┆ null ┆ null │
│ AAPL ┆ 2024-01-02 ┆ 187.5 ┆ 185.0 ┆ 2.5 ┆ 1.35 │
│ AAPL ┆ 2024-01-03 ┆ 186.0 ┆ 187.5 ┆ -1.5 ┆ -0.8 │
│ GOOG ┆ 2024-01-01 ┆ 140.0 ┆ null ┆ null ┆ null │
│ GOOG ┆ 2024-01-02 ┆ 142.0 ┆ 140.0 ┆ 2.0 ┆ 1.43 │
│ GOOG ┆ 2024-01-03 ┆ 141.5 ┆ 142.0 ┆ -0.5 ┆ -0.35 │
└────────┴────────────┴───────┴────────────┴──────────────┴────────────┘
Use shift(-1) for lead operations (looking forward instead of backward).
Combining Multiple Window Expressions
Real analytics often require multiple window calculations in a single query. Polars handles this elegantly:
# Sales data for complex analysis
sales = pl.DataFrame({
"region": ["East", "East", "East", "West", "West", "West"],
"rep": ["Alice", "Bob", "Carol", "Dave", "Eve", "Frank"],
"sales": [50000, 75000, 60000, 80000, 45000, 70000]
})
# Multiple window calculations in one pass
analysis = sales.with_columns([
# Total for the region
pl.col("sales").sum().over("region").alias("region_total"),
# Percent of region total
(pl.col("sales") / pl.col("sales").sum().over("region") * 100)
.round(1)
.alias("pct_of_region"),
# Rank within region
pl.col("sales").rank(descending=True).over("region").alias("region_rank"),
# Difference from region average
(pl.col("sales") - pl.col("sales").mean().over("region"))
.round(0)
.alias("vs_region_avg")
])
print(analysis)
This produces a comprehensive view with row-level detail plus contextual metrics—exactly what dashboards and reports need.
Performance Tips
Polars’ lazy evaluation is your friend for complex window operations. The query optimizer can reorder and combine operations for efficiency:
# Eager execution - each operation runs immediately
eager_result = (
df
.with_columns(pl.col("sales").sum().over("region").alias("total"))
.with_columns(pl.col("sales").rank().over("region").alias("rank"))
)
# Lazy execution - operations are optimized before running
lazy_result = (
df.lazy()
.with_columns([
pl.col("sales").sum().over("region").alias("total"),
pl.col("sales").rank().over("region").alias("rank")
])
.collect()
)
The lazy version allows Polars to potentially parallelize the window calculations and avoid redundant partition operations.
Key performance considerations:
-
Sort before window operations when using cumulative or shift functions. Polars doesn’t assume order, so explicit sorting ensures correct results.
-
Combine window expressions in a single
with_columns()call when they share the same partition. This hints to the optimizer that partitioning work can be reused. -
Use lazy mode for complex pipelines. The overhead is negligible, and the optimization benefits compound with complexity.
-
Avoid mixing eager and lazy in the same pipeline. Pick one and stick with it.
Window functions in Polars strike a balance between SQL’s declarative power and Python’s expressiveness. Once you internalize the over() pattern, you’ll find yourself reaching for window functions constantly—they’re that useful for real-world analytics.