How to GroupBy Multiple Columns in Polars

Polars has rapidly become the go-to DataFrame library for Python developers who need speed. Built in Rust with a lazy execution engine, it routinely outperforms Pandas by 10-100x on real workloads....

Key Insights

  • Polars uses group_by() with a list of column names for multi-column grouping, combined with .agg() for applying one or more aggregation expressions in a single, readable chain.
  • The maintain_order=True parameter preserves row order at the cost of performance—omit it when order doesn’t matter for significant speed gains on large datasets.
  • Polars’ expression system allows dynamic column selection with wildcards and regex, making it trivial to aggregate dozens of columns without listing each one explicitly.

Introduction

Polars has rapidly become the go-to DataFrame library for Python developers who need speed. Built in Rust with a lazy execution engine, it routinely outperforms Pandas by 10-100x on real workloads. But raw performance means nothing if you can’t express your data transformations clearly.

Grouping by multiple columns is one of the most common operations in data analysis. Whether you’re calculating sales by region and product category, summarizing user behavior by date and cohort, or aggregating metrics across multiple dimensions, multi-column groupby is foundational.

This article covers everything you need to know about grouping by multiple columns in Polars—from basic syntax to advanced expression-based aggregations.

Basic Syntax for Multi-Column GroupBy

The group_by() method accepts column names as positional arguments or as a list. For multi-column grouping, simply pass multiple column names:

import polars as pl

# Sample sales data
df = pl.DataFrame({
    "region": ["North", "North", "South", "South", "North", "South"],
    "category": ["Electronics", "Clothing", "Electronics", "Clothing", "Electronics", "Electronics"],
    "sales": [1200, 450, 890, 320, 1100, 760],
    "units": [10, 15, 8, 12, 9, 6]
})

# GroupBy multiple columns
result = df.group_by("region", "category").agg(
    pl.col("sales").sum()
)

print(result)

Output:

shape: (4, 3)
┌────────┬─────────────┬───────┐
│ region ┆ category    ┆ sales │
│ ---    ┆ ---         ┆ ---   │
│ str    ┆ str         ┆ i64   │
╞════════╪═════════════╪═══════╡
│ South  ┆ Clothing    ┆ 320   │
│ North  ┆ Electronics ┆ 2300  │
│ South  ┆ Electronics ┆ 1650  │
│ North  ┆ Clothing    ┆ 450   │
└────────┴─────────────┴───────┘

Notice that the output order is not deterministic. Polars optimizes for speed by default, which means group order may vary between runs. We’ll address this later with maintain_order.

You can also pass columns as a list, which is useful when column names are stored in a variable:

group_cols = ["region", "category"]
result = df.group_by(group_cols).agg(pl.col("sales").sum())

Multiple Aggregations on Grouped Data

Real analysis rarely involves a single aggregation. Polars lets you chain multiple aggregation expressions inside .agg():

result = df.group_by("region", "category").agg(
    pl.col("sales").sum().alias("total_sales"),
    pl.col("sales").mean().alias("avg_sale"),
    pl.col("units").sum().alias("total_units"),
    pl.col("units").count().alias("transaction_count"),
    pl.col("sales").first().alias("first_sale"),
    pl.col("sales").last().alias("last_sale")
)

print(result)

Output:

shape: (4, 8)
┌────────┬─────────────┬─────────────┬──────────┬─────────────┬───────────────────┬────────────┬───────────┐
│ region ┆ category    ┆ total_sales ┆ avg_sale ┆ total_units ┆ transaction_count ┆ first_sale ┆ last_sale │
│ ---    ┆ ---         ┆ ---         ┆ ---      ┆ ---         ┆ ---               ┆ ---        ┆ ---       │
│ str    ┆ str         ┆ i64         ┆ f64      ┆ i64         ┆ u32               ┆ i64        ┆ i64       │
╞════════╪═════════════╪═════════════╪══════════╪═════════════╪═══════════════════╪════════════╪═══════════╡
│ South  ┆ Electronics ┆ 1650        ┆ 825.0    ┆ 14          ┆ 2                 ┆ 890        ┆ 760       │
│ North  ┆ Electronics ┆ 2300        ┆ 1150.0   ┆ 19          ┆ 2                 ┆ 1200       ┆ 1100      │
│ South  ┆ Clothing    ┆ 320         ┆ 320.0    ┆ 12          ┆ 1                 ┆ 320        ┆ 320       │
│ North  ┆ Clothing    ┆ 450         ┆ 450.0    ┆ 15          ┆ 1                 ┆ 450        ┆ 450       │
└────────┴─────────────┴─────────────┴──────────┴─────────────┴───────────────────┴────────────┴───────────┘

Each expression in .agg() is evaluated independently, and Polars parallelizes these computations automatically. Use .alias() to give output columns meaningful names—without it, the column name defaults to the input column name, which causes conflicts when you aggregate the same column multiple ways.

Using Expressions for Advanced Grouping

Polars’ expression system is where it truly shines. Instead of listing columns explicitly, you can use wildcards, regex patterns, and programmatic column selection.

Aggregating Multiple Columns with Wildcards

df = pl.DataFrame({
    "region": ["North", "North", "South", "South"],
    "category": ["A", "B", "A", "B"],
    "sales_q1": [100, 200, 150, 180],
    "sales_q2": [120, 210, 160, 190],
    "sales_q3": [130, 220, 170, 200],
    "sales_q4": [140, 230, 180, 210]
})

# Sum all columns starting with "sales_"
result = df.group_by("region", "category").agg(
    pl.col("^sales_.*$").sum()
)

print(result)

The regex pattern ^sales_.*$ matches all columns starting with “sales_”. This approach scales effortlessly—whether you have 4 quarterly columns or 400 metric columns.

Applying Different Aggregations to Different Columns

result = df.group_by("region", "category").agg(
    pl.col("^sales_q[12]$").sum().name.suffix("_total"),
    pl.col("^sales_q[34]$").mean().name.suffix("_avg")
)

print(result)

The .name.suffix() method appends a string to each matched column name, keeping outputs distinguishable. You can also use .name.prefix() or .name.map() for more complex renaming logic.

Dynamic Column Selection

When column names come from external sources, build expressions programmatically:

numeric_cols = ["sales_q1", "sales_q2", "sales_q3", "sales_q4"]

result = df.group_by("region", "category").agg(
    [pl.col(c).sum().alias(f"{c}_sum") for c in numeric_cols],
    [pl.col(c).mean().alias(f"{c}_mean") for c in numeric_cols]
)

This pattern is invaluable when working with dynamic schemas or configuration-driven pipelines.

Maintaining Sort Order with maintain_order

By default, group_by() makes no guarantees about output order. This enables parallel hashing optimizations. If you need deterministic ordering, use maintain_order=True:

df = pl.DataFrame({
    "date": ["2024-01-01", "2024-01-01", "2024-01-02", "2024-01-02"],
    "product": ["A", "B", "A", "B"],
    "revenue": [100, 200, 150, 250]
})

# Without maintain_order (faster, non-deterministic order)
result_fast = df.group_by("date", "product").agg(
    pl.col("revenue").sum()
)

# With maintain_order (preserves first-appearance order)
result_ordered = df.group_by("date", "product", maintain_order=True).agg(
    pl.col("revenue").sum()
)

print("Without maintain_order:")
print(result_fast)
print("\nWith maintain_order:")
print(result_ordered)

With maintain_order=True, groups appear in the order their first row was encountered in the original DataFrame. This is crucial for time-series data or when downstream code depends on consistent ordering.

The performance cost is real but often acceptable. On a 10-million-row dataset, maintain_order=True might add 20-50% overhead. Benchmark with your actual data to decide.

Lazy vs Eager Execution

Polars’ lazy API enables query optimization before execution. For complex pipelines, lazy execution can dramatically improve performance by pushing down filters, eliminating unused columns, and optimizing join order.

# Eager execution (immediate)
df_eager = pl.read_csv("sales_data.csv")
result_eager = df_eager.group_by("region", "category").agg(
    pl.col("sales").sum().alias("total_sales")
)

# Lazy execution (deferred until .collect())
result_lazy = (
    pl.scan_csv("sales_data.csv")
    .group_by("region", "category")
    .agg(pl.col("sales").sum().alias("total_sales"))
    .collect()
)

With scan_csv(), Polars builds a query plan. If you only need certain columns, Polars reads only those from disk. If you filter before grouping, the filter is pushed down. The .collect() call triggers execution.

For exploratory work, eager mode is fine. For production pipelines processing large files, always prefer lazy mode:

# Full lazy pipeline with optimization
result = (
    pl.scan_csv("sales_data.csv")
    .filter(pl.col("year") >= 2023)
    .group_by("region", "category")
    .agg(
        pl.col("sales").sum().alias("total_sales"),
        pl.col("sales").mean().alias("avg_sale")
    )
    .sort("total_sales", descending=True)
    .collect()
)

Comparison with Pandas

If you’re migrating from Pandas, the syntax differences are minimal but important:

import pandas as pd
import polars as pl

# Sample data
data = {
    "region": ["North", "North", "South", "South"],
    "category": ["A", "B", "A", "B"],
    "sales": [100, 200, 150, 250]
}

# Pandas
df_pandas = pd.DataFrame(data)
result_pandas = df_pandas.groupby(["region", "category"]).agg(
    total_sales=("sales", "sum"),
    avg_sales=("sales", "mean")
).reset_index()

# Polars
df_polars = pl.DataFrame(data)
result_polars = df_polars.group_by("region", "category").agg(
    pl.col("sales").sum().alias("total_sales"),
    pl.col("sales").mean().alias("avg_sales")
)

Key differences to note:

  1. Method name: Polars uses group_by() (with underscore), Pandas uses groupby().
  2. Column specification: Polars uses expressions (pl.col("sales").sum()), Pandas uses tuples or dictionaries.
  3. Index handling: Polars never uses indexes. No .reset_index() needed.
  4. Aggregation syntax: Polars’ expression-based approach is more verbose but far more flexible.

The Polars syntax may feel unfamiliar initially, but the expression system’s power becomes apparent as your queries grow more complex. Operations that require awkward workarounds in Pandas—like aggregating with different functions per column, or applying the same function to columns matching a pattern—are natural in Polars.

Conclusion

Multi-column groupby in Polars is straightforward once you internalize the expression-based API. Pass column names to group_by(), chain aggregation expressions in .agg(), and use .alias() to name your outputs clearly.

For production code, prefer lazy execution with scan_csv() and .collect(). Use maintain_order=True only when ordering matters. Leverage regex and wildcard patterns to keep your code maintainable as schemas evolve.

The initial learning curve pays dividends quickly. Polars’ performance advantages compound as your datasets grow, and its expressive API makes complex aggregations readable rather than cryptic.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.