How to Use GroupBy in Polars

GroupBy operations are fundamental to data analysis. You split data into groups based on one or more columns, apply aggregations to each group, and combine the results. It's how you answer questions...

Key Insights

  • Polars’ groupby operations run significantly faster than pandas by leveraging Rust’s performance and parallel execution—expect 5-10x speedups on multi-core machines without changing your code.
  • The expression-based API lets you chain complex aggregations, filters, and transformations in a single agg() call, eliminating the need for multiple passes over your data.
  • Using lazy evaluation with lazy().group_by().collect() enables Polars’ query optimizer to reorder operations and minimize memory usage on large datasets.

Introduction to GroupBy in Polars

GroupBy operations are fundamental to data analysis. You split data into groups based on one or more columns, apply aggregations to each group, and combine the results. It’s how you answer questions like “what’s the average order value per customer?” or “how many transactions occurred per day per region?”

Polars implements groupby differently than pandas. Where pandas builds on NumPy and processes data sequentially, Polars uses Rust underneath and parallelizes aggregations across CPU cores automatically. This architectural difference translates to real performance gains—often 5-10x faster on the same hardware.

Polars also offers two execution modes: eager and lazy. Eager execution runs immediately (like pandas), while lazy execution builds a query plan that Polars optimizes before running. For groupby operations on large datasets, lazy mode can dramatically reduce memory usage and execution time.

Let’s dig into the syntax and patterns you’ll use daily.

Basic GroupBy Syntax

The core pattern is straightforward: call group_by() with your grouping columns, then chain agg() with your aggregations.

import polars as pl

# Sample data
df = pl.DataFrame({
    "category": ["electronics", "clothing", "electronics", "clothing", "electronics"],
    "product": ["laptop", "shirt", "phone", "pants", "tablet"],
    "price": [1200, 45, 800, 60, 500],
    "quantity": [2, 10, 5, 8, 3]
})

# Group by single column
result = df.group_by("category").agg(
    pl.col("price").sum()
)
print(result)

This produces:

shape: (2, 2)
┌─────────────┬───────┐
│ category    ┆ price │
│ ---         ┆ ---   │
│ str         ┆ i64   │
╞═════════════╪═══════╡
│ clothing    ┆ 105   │
│ electronics ┆ 2500  │
└─────────────┴───────┘

Grouping by multiple columns works the same way—pass them as a list:

df = pl.DataFrame({
    "region": ["north", "north", "south", "south", "north"],
    "category": ["electronics", "clothing", "electronics", "clothing", "electronics"],
    "sales": [1000, 500, 800, 300, 1200]
})

result = df.group_by(["region", "category"]).agg(
    pl.col("sales").sum()
)
print(result)

Note that Polars doesn’t guarantee row order in groupby results by default. This is intentional—it allows for faster parallel processing. If you need deterministic ordering, we’ll cover that shortly.

Common Aggregation Functions

Polars provides all the standard aggregations you’d expect. Here’s a reference of the most common ones:

Function Description
sum() Sum of values
mean() Arithmetic mean
median() Median value
min() Minimum value
max() Maximum value
count() Number of rows
first() First value in group
last() Last value in group
n_unique() Count of unique values
std() Standard deviation
var() Variance

The real power comes from combining multiple aggregations in a single agg() call. Use .alias() to name your output columns:

df = pl.DataFrame({
    "store": ["A", "A", "A", "B", "B", "B"],
    "product": ["x", "y", "x", "x", "y", "z"],
    "revenue": [100, 150, 120, 200, 180, 90],
    "units": [10, 15, 12, 20, 18, 9]
})

result = df.group_by("store").agg(
    pl.col("revenue").sum().alias("total_revenue"),
    pl.col("revenue").mean().alias("avg_revenue"),
    pl.col("units").sum().alias("total_units"),
    pl.col("product").n_unique().alias("unique_products"),
    pl.len().alias("transaction_count")
)
print(result)

Output:

shape: (2, 6)
┌───────┬───────────────┬─────────────┬─────────────┬─────────────────┬───────────────────┐
│ store ┆ total_revenue ┆ avg_revenue ┆ total_units ┆ unique_products ┆ transaction_count │
│ ---   ┆ ---           ┆ ---         ┆ ---         ┆ ---             ┆ ---               │
│ str   ┆ i64           ┆ f64         ┆ i64         ┆ u32             ┆ u32               │
╞═══════╪═══════════════╪═════════════╪═════════════╪═════════════════╪═══════════════════╡
│ B     ┆ 470           ┆ 156.666667  ┆ 47          ┆ 3               ┆ 3                 │
│ A     ┆ 370           ┆ 123.333333  ┆ 37          ┆ 2               ┆ 3                 │
└───────┴───────────────┴─────────────┴─────────────┴─────────────────┴───────────────────┘

Notice pl.len() for counting rows—it’s cleaner than pl.col("any_column").count() when you just need the group size.

Advanced Aggregations with Expressions

Polars expressions let you build complex aggregations that would require multiple steps in pandas. You can filter within groups, apply conditions, and combine operations.

Here’s conditional aggregation—counting or summing only rows that match a condition:

df = pl.DataFrame({
    "department": ["sales", "sales", "sales", "engineering", "engineering"],
    "employee": ["alice", "bob", "carol", "dave", "eve"],
    "status": ["active", "inactive", "active", "active", "active"],
    "salary": [75000, 65000, 80000, 95000, 90000]
})

result = df.group_by("department").agg(
    pl.len().alias("total_employees"),
    pl.col("status").filter(pl.col("status") == "active").count().alias("active_employees"),
    pl.col("salary").filter(pl.col("status") == "active").mean().alias("avg_active_salary"),
    pl.col("salary").sum().alias("total_payroll")
)
print(result)

You can also use when().then().otherwise() for more complex conditional logic:

df = pl.DataFrame({
    "category": ["A", "A", "A", "B", "B"],
    "value": [10, 25, 15, 30, 5],
    "threshold": [20, 20, 20, 20, 20]
})

result = df.group_by("category").agg(
    pl.when(pl.col("value") > pl.col("threshold"))
      .then(pl.col("value"))
      .otherwise(0)
      .sum()
      .alias("sum_above_threshold"),
    
    (pl.col("value") > pl.col("threshold")).sum().alias("count_above_threshold")
)
print(result)

This pattern eliminates the need for intermediate columns or separate filtering steps.

GroupBy with Sorting and Ordering

By default, group_by() doesn’t preserve or guarantee any particular row order. For deterministic results, use maintain_order=True:

result = df.group_by("category", maintain_order=True).agg(
    pl.col("value").sum()
)

This keeps groups in the order they first appear in the data. There’s a small performance cost, so only use it when you need it.

A common task is getting the top N rows per group. Combine sort() with head() inside your aggregation:

df = pl.DataFrame({
    "category": ["A", "A", "A", "B", "B", "B"],
    "product": ["p1", "p2", "p3", "p4", "p5", "p6"],
    "sales": [100, 300, 200, 150, 400, 250]
})

# Get top 2 products by sales per category
result = df.sort("sales", descending=True).group_by("category").agg(
    pl.col("product").head(2).alias("top_products"),
    pl.col("sales").head(2).alias("top_sales")
)
print(result)

For more control, use over() with window functions to rank within groups, then filter:

df_ranked = df.with_columns(
    pl.col("sales").rank(descending=True).over("category").alias("rank")
).filter(pl.col("rank") <= 2)
print(df_ranked)

Lazy Evaluation with GroupBy

For large datasets, lazy evaluation is essential. Instead of executing immediately, Polars builds a query plan and optimizes it before running.

# Eager execution (immediate)
result_eager = df.group_by("category").agg(
    pl.col("value").sum()
)

# Lazy execution (optimized)
result_lazy = (
    df.lazy()
    .group_by("category")
    .agg(pl.col("value").sum())
    .collect()
)

The lazy version looks almost identical, but the benefits compound with complex queries. Polars can:

  • Push filters before groupby operations to reduce data size early
  • Combine multiple operations into single passes
  • Optimize column selection to read only what’s needed
  • Parallelize more effectively

Here’s a more realistic example showing the pattern:

result = (
    pl.scan_parquet("sales_data/*.parquet")  # Lazy by default
    .filter(pl.col("date") >= "2024-01-01")
    .group_by(["region", "product_category"])
    .agg(
        pl.col("revenue").sum().alias("total_revenue"),
        pl.col("order_id").n_unique().alias("unique_orders"),
        pl.col("quantity").mean().alias("avg_quantity")
    )
    .sort("total_revenue", descending=True)
    .collect()
)

The optimizer will push the date filter down to the Parquet reader, only read the columns needed for the aggregation, and execute the sort after reducing data through groupby.

Comparison with Pandas

If you’re coming from pandas, here’s a quick syntax mapping:

import pandas as pd
import polars as pl

# Sample data (same for both)
data = {
    "category": ["A", "A", "B", "B"],
    "value": [10, 20, 30, 40]
}

# Pandas
df_pd = pd.DataFrame(data)
result_pd = df_pd.groupby("category").agg(
    total=("value", "sum"),
    average=("value", "mean")
).reset_index()

# Polars
df_pl = pl.DataFrame(data)
result_pl = df_pl.group_by("category").agg(
    pl.col("value").sum().alias("total"),
    pl.col("value").mean().alias("average")
)

Key differences to remember:

Aspect Pandas Polars
Method name groupby() group_by()
Result type GroupBy object Requires .agg()
Index handling Creates MultiIndex No index concept
Column selection ["col"] or .col pl.col("col")
Order preservation Preserved by default Not preserved by default
Null handling Excludes by default Includes by default

Performance-wise, expect Polars to be 3-10x faster on datasets that fit in memory, with even larger gains on multi-core machines or when using lazy evaluation.

The mental model shift is from pandas’ method-chaining on columns to Polars’ expression-based approach. Once you internalize pl.col() as the entry point for column operations, the rest follows naturally.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.