How to Use Over Expression in Polars

Window functions solve a specific problem: you need to calculate something based on groups of rows, but you want to keep every original row intact. Think calculating each employee's salary as a...

Key Insights

  • Polars’ over() expression is the equivalent of SQL window functions, letting you perform grouped calculations while preserving every row in your DataFrame—unlike group_by() which collapses rows.
  • You can partition by multiple columns, combine with sorting for cumulative operations, and chain multiple over() expressions in a single select() or with_columns() call for complex analytics.
  • The over() expression integrates seamlessly with Polars’ lazy evaluation, meaning you get query optimization benefits when building complex window-based transformations.

Introduction to Window Functions in Polars

Window functions solve a specific problem: you need to calculate something based on groups of rows, but you want to keep every original row intact. Think calculating each employee’s salary as a percentage of their department’s total, or finding the running sum of sales within each region.

In SQL, you’d write SUM(salary) OVER (PARTITION BY department). In pandas, you’d use groupby().transform(). In Polars, you use the over() expression.

Polars’ approach is more composable than both alternatives. Instead of a separate method, over() is just another expression modifier. You can chain it with any aggregation, filter, or transformation. This fits Polars’ philosophy of building complex queries from simple, composable pieces.

Let’s dig into how it works.

Basic Syntax and How over() Works

The over() expression takes an aggregation and broadcasts the result back to every row in the partition. The partition is defined by one or more columns—all rows with the same value in those columns belong to the same partition.

Here’s the fundamental pattern:

import polars as pl

df = pl.DataFrame({
    "group": ["A", "A", "A", "B", "B"],
    "value": [10, 20, 30, 40, 50]
})

result = df.with_columns(
    pl.col("value").sum().over("group").alias("group_total")
)

print(result)

Output:

shape: (5, 3)
┌───────┬───────┬─────────────┐
│ group ┆ value ┆ group_total │
│ ---   ┆ ---   ┆ ---         │
│ str   ┆ i64   ┆ i64         │
╞═══════╪═══════╪═════════════╡
│ A     ┆ 10    ┆ 60          │
│ A     ┆ 20    ┆ 60          │
│ A     ┆ 30    ┆ 60          │
│ B     ┆ 40    ┆ 90          │
│ B     ┆ 50    ┆ 90          │
└───────┴───────┴─────────────┘

Notice that every row keeps its original value, but gains a new column with the group’s total. Group A rows all show 60, group B rows all show 90. The row count stays at 5.

Compare this to group_by():

grouped = df.group_by("group").agg(pl.col("value").sum().alias("group_total"))
print(grouped)
shape: (2, 2)
┌───────┬─────────────┐
│ group ┆ group_total │
│ ---   ┆ ---         │
│ str   ┆ i64         │
╞═══════╪═════════════╡
│ A     ┆ 60          │
│ B     ┆ 90          │
└───────┴─────────────┘

With group_by(), you get 2 rows. With over(), you get 5. That’s the core difference.

Common Use Cases with Single Partition Column

Let’s look at practical applications you’ll encounter regularly.

Percentage of Group Total

This is the classic use case. You want to know what fraction each row contributes to its group:

df = pl.DataFrame({
    "department": ["Sales", "Sales", "Engineering", "Engineering", "Engineering"],
    "employee": ["Alice", "Bob", "Carol", "Dave", "Eve"],
    "salary": [70000, 80000, 90000, 85000, 95000]
})

result = df.with_columns(
    (pl.col("salary") / pl.col("salary").sum().over("department") * 100)
    .round(2)
    .alias("pct_of_dept")
)

print(result)
shape: (5, 4)
┌─────────────┬──────────┬────────┬─────────────┐
│ department  ┆ employee ┆ salary ┆ pct_of_dept │
│ ---         ┆ ---      ┆ ---    ┆ ---         │
│ str         ┆ str      ┆ i64    ┆ f64         │
╞═════════════╪══════════╪════════╪═════════════╡
│ Sales       ┆ Alice    ┆ 70000  ┆ 46.67       │
│ Sales       ┆ Bob      ┆ 80000  ┆ 53.33       │
│ Engineering ┆ Carol    ┆ 90000  ┆ 33.33       │
│ Engineering ┆ Dave     ┆ 85000  ┆ 31.48       │
│ Engineering ┆ Eve      ┆ 95000  ┆ 35.19       │
└─────────────┴──────────┴────────┴─────────────┘

Deviation from Group Mean

Useful for identifying outliers or normalizing values within groups:

result = df.with_columns([
    pl.col("salary").mean().over("department").alias("dept_avg"),
    (pl.col("salary") - pl.col("salary").mean().over("department")).alias("deviation")
])

print(result)
shape: (5, 5)
┌─────────────┬──────────┬────────┬──────────────┬───────────┐
│ department  ┆ employee ┆ salary ┆ dept_avg     ┆ deviation │
│ ---         ┆ ---      ┆ ---    ┆ ---          ┆ ---       │
│ str         ┆ str      ┆ i64    ┆ f64          ┆ f64       │
╞═════════════╪══════════╪════════╪══════════════╪═══════════╡
│ Sales       ┆ Alice    ┆ 70000  ┆ 75000.0      ┆ -5000.0   │
│ Sales       ┆ Bob      ┆ 80000  ┆ 75000.0      ┆ 5000.0    │
│ Engineering ┆ Carol    ┆ 90000  ┆ 90000.0      ┆ 0.0       │
│ Engineering ┆ Dave     ┆ 85000  ┆ 90000.0      ┆ -5000.0   │
│ Engineering ┆ Eve      ┆ 95000  ┆ 90000.0      ┆ 5000.0    │
└─────────────┴──────────┴────────┴──────────────┴───────────┘

Multiple Partition Columns

Real data often has hierarchical groupings. Pass a list of column names to partition by multiple dimensions:

df = pl.DataFrame({
    "region": ["East", "East", "East", "West", "West", "West"],
    "category": ["Electronics", "Electronics", "Furniture", "Electronics", "Furniture", "Furniture"],
    "sales": [100, 150, 200, 300, 250, 175]
})

result = df.with_columns(
    pl.col("sales").sum().over(["region", "category"]).alias("region_category_total")
)

print(result)
shape: (6, 4)
┌────────┬─────────────┬───────┬───────────────────────┐
│ region ┆ category    ┆ sales ┆ region_category_total │
│ ---    ┆ ---         ┆ ---   ┆ ---                   │
│ str    ┆ str         ┆ i64   ┆ i64                   │
╞════════╪═════════════╪═══════╪═══════════════════════╡
│ East   ┆ Electronics ┆ 100   ┆ 250                   │
│ East   ┆ Electronics ┆ 150   ┆ 250                   │
│ East   ┆ Furniture   ┆ 200   ┆ 200                   │
│ West   ┆ Electronics ┆ 300   ┆ 300                   │
│ West   ┆ Furniture   ┆ 250   ┆ 425                   │
│ West   ┆ Furniture   ┆ 175   ┆ 425                   │
└────────┴─────────────┴───────┴───────────────────────┘

You can also combine different partition levels in the same query:

result = df.with_columns([
    pl.col("sales").sum().over("region").alias("region_total"),
    pl.col("sales").sum().over(["region", "category"]).alias("region_cat_total"),
    (pl.col("sales") / pl.col("sales").sum().over("region") * 100)
    .round(2)
    .alias("pct_of_region")
])

Combining over() with Sorting and Ranking

Window functions really shine when you need ordered operations within groups. Polars handles this with the sort_by parameter in over().

Cumulative Sum Within Groups

df = pl.DataFrame({
    "group": ["A", "A", "A", "B", "B", "B"],
    "date": ["2024-01-01", "2024-01-02", "2024-01-03", "2024-01-01", "2024-01-02", "2024-01-03"],
    "amount": [10, 20, 15, 30, 25, 35]
})

result = df.with_columns(
    pl.col("amount")
    .cum_sum()
    .over("group", order_by="date")
    .alias("running_total")
)

print(result)
shape: (6, 4)
┌───────┬────────────┬────────┬───────────────┐
│ group ┆ date       ┆ amount ┆ running_total │
│ ---   ┆ ---        ┆ ---    ┆ ---           │
│ str   ┆ str        ┆ i64    ┆ i64           │
╞═══════╪════════════╪════════╪═══════════════╡
│ A     ┆ 2024-01-01 ┆ 10     ┆ 10            │
│ A     ┆ 2024-01-02 ┆ 20     ┆ 30            │
│ A     ┆ 2024-01-03 ┆ 15     ┆ 45            │
│ B     ┆ 2024-01-01 ┆ 30     ┆ 30            │
│ B     ┆ 2024-01-02 ┆ 25     ┆ 55            │
│ B     ┆ 2024-01-03 ┆ 35     ┆ 90            │
└───────┴────────────┴────────┴───────────────┘

Row Numbering and Ranking Within Groups

df = pl.DataFrame({
    "department": ["Sales", "Sales", "Sales", "Engineering", "Engineering"],
    "employee": ["Alice", "Bob", "Carol", "Dave", "Eve"],
    "performance_score": [85, 92, 78, 95, 88]
})

result = df.with_columns([
    pl.col("employee")
    .rank(method="ordinal")
    .over("department", order_by=pl.col("performance_score").sort(descending=True))
    .alias("dept_rank")
])

print(result)
shape: (5, 4)
┌─────────────┬──────────┬───────────────────┬───────────┐
│ department  ┆ employee ┆ performance_score ┆ dept_rank │
│ ---         ┆ ---      ┆ ---               ┆ ---       │
│ str         ┆ str      ┆ i64               ┆ u32       │
╞═════════════╪══════════╪═══════════════════╪═══════════╡
│ Sales       ┆ Alice    ┆ 85                ┆ 2         │
│ Sales       ┆ Bob      ┆ 92                ┆ 1         │
│ Sales       ┆ Carol    ┆ 78                ┆ 3         │
│ Engineering ┆ Dave     ┆ 95                ┆ 1         │
│ Engineering ┆ Eve      ┆ 88                ┆ 2         │
└─────────────┴──────────┴───────────────────┴───────────┘

Performance Considerations and Best Practices

When to Use over() vs group_by()

Use over() when you need the result broadcast back to every row. Use group_by() when you want aggregated output with fewer rows.

If you find yourself doing group_by().agg() followed by a join() back to the original DataFrame, that’s a sign you should use over() instead.

Lazy Evaluation Benefits

Always prefer lazy mode for complex queries with multiple over() expressions:

# Preferred approach
result = (
    df.lazy()
    .with_columns([
        pl.col("value").sum().over("group").alias("group_sum"),
        pl.col("value").mean().over("group").alias("group_mean"),
        pl.col("value").std().over("group").alias("group_std")
    ])
    .collect()
)

Polars’ query optimizer can often combine multiple over() calls on the same partition into a single pass over the data.

Common Pitfalls

Forgetting to alias: Without .alias(), your new column will have an auto-generated name that’s hard to work with.

Misunderstanding row order: Without explicit order_by, cumulative functions operate on rows in their current order, which may not be what you expect.

Over-partitioning: Partitioning by too many columns creates many small groups, which can hurt performance. If your partitions have only 1-2 rows each, reconsider your approach.

Conclusion

The over() expression is Polars’ answer to SQL window functions, and it’s more flexible than most alternatives. The key patterns to remember:

  • Basic aggregations: col("x").sum().over("group")
  • Multiple partitions: over(["col1", "col2"])
  • Ordered operations: over("group", order_by="date")

Start with these patterns, and you’ll handle most window function needs. For advanced operations like sliding windows or custom frame boundaries, check the Polars documentation on window functions.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.