How to Use GroupBy in Polars
GroupBy operations are fundamental to data analysis. You split data into groups based on one or more columns, apply aggregations to each group, and combine the results. It's how you answer questions...
Key Insights
- Polars’ groupby operations run significantly faster than pandas by leveraging Rust’s performance and parallel execution—expect 5-10x speedups on multi-core machines without changing your code.
- The expression-based API lets you chain complex aggregations, filters, and transformations in a single
agg()call, eliminating the need for multiple passes over your data. - Using lazy evaluation with
lazy().group_by().collect()enables Polars’ query optimizer to reorder operations and minimize memory usage on large datasets.
Introduction to GroupBy in Polars
GroupBy operations are fundamental to data analysis. You split data into groups based on one or more columns, apply aggregations to each group, and combine the results. It’s how you answer questions like “what’s the average order value per customer?” or “how many transactions occurred per day per region?”
Polars implements groupby differently than pandas. Where pandas builds on NumPy and processes data sequentially, Polars uses Rust underneath and parallelizes aggregations across CPU cores automatically. This architectural difference translates to real performance gains—often 5-10x faster on the same hardware.
Polars also offers two execution modes: eager and lazy. Eager execution runs immediately (like pandas), while lazy execution builds a query plan that Polars optimizes before running. For groupby operations on large datasets, lazy mode can dramatically reduce memory usage and execution time.
Let’s dig into the syntax and patterns you’ll use daily.
Basic GroupBy Syntax
The core pattern is straightforward: call group_by() with your grouping columns, then chain agg() with your aggregations.
import polars as pl
# Sample data
df = pl.DataFrame({
"category": ["electronics", "clothing", "electronics", "clothing", "electronics"],
"product": ["laptop", "shirt", "phone", "pants", "tablet"],
"price": [1200, 45, 800, 60, 500],
"quantity": [2, 10, 5, 8, 3]
})
# Group by single column
result = df.group_by("category").agg(
pl.col("price").sum()
)
print(result)
This produces:
shape: (2, 2)
┌─────────────┬───────┐
│ category ┆ price │
│ --- ┆ --- │
│ str ┆ i64 │
╞═════════════╪═══════╡
│ clothing ┆ 105 │
│ electronics ┆ 2500 │
└─────────────┴───────┘
Grouping by multiple columns works the same way—pass them as a list:
df = pl.DataFrame({
"region": ["north", "north", "south", "south", "north"],
"category": ["electronics", "clothing", "electronics", "clothing", "electronics"],
"sales": [1000, 500, 800, 300, 1200]
})
result = df.group_by(["region", "category"]).agg(
pl.col("sales").sum()
)
print(result)
Note that Polars doesn’t guarantee row order in groupby results by default. This is intentional—it allows for faster parallel processing. If you need deterministic ordering, we’ll cover that shortly.
Common Aggregation Functions
Polars provides all the standard aggregations you’d expect. Here’s a reference of the most common ones:
| Function | Description |
|---|---|
sum() |
Sum of values |
mean() |
Arithmetic mean |
median() |
Median value |
min() |
Minimum value |
max() |
Maximum value |
count() |
Number of rows |
first() |
First value in group |
last() |
Last value in group |
n_unique() |
Count of unique values |
std() |
Standard deviation |
var() |
Variance |
The real power comes from combining multiple aggregations in a single agg() call. Use .alias() to name your output columns:
df = pl.DataFrame({
"store": ["A", "A", "A", "B", "B", "B"],
"product": ["x", "y", "x", "x", "y", "z"],
"revenue": [100, 150, 120, 200, 180, 90],
"units": [10, 15, 12, 20, 18, 9]
})
result = df.group_by("store").agg(
pl.col("revenue").sum().alias("total_revenue"),
pl.col("revenue").mean().alias("avg_revenue"),
pl.col("units").sum().alias("total_units"),
pl.col("product").n_unique().alias("unique_products"),
pl.len().alias("transaction_count")
)
print(result)
Output:
shape: (2, 6)
┌───────┬───────────────┬─────────────┬─────────────┬─────────────────┬───────────────────┐
│ store ┆ total_revenue ┆ avg_revenue ┆ total_units ┆ unique_products ┆ transaction_count │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ f64 ┆ i64 ┆ u32 ┆ u32 │
╞═══════╪═══════════════╪═════════════╪═════════════╪═════════════════╪═══════════════════╡
│ B ┆ 470 ┆ 156.666667 ┆ 47 ┆ 3 ┆ 3 │
│ A ┆ 370 ┆ 123.333333 ┆ 37 ┆ 2 ┆ 3 │
└───────┴───────────────┴─────────────┴─────────────┴─────────────────┴───────────────────┘
Notice pl.len() for counting rows—it’s cleaner than pl.col("any_column").count() when you just need the group size.
Advanced Aggregations with Expressions
Polars expressions let you build complex aggregations that would require multiple steps in pandas. You can filter within groups, apply conditions, and combine operations.
Here’s conditional aggregation—counting or summing only rows that match a condition:
df = pl.DataFrame({
"department": ["sales", "sales", "sales", "engineering", "engineering"],
"employee": ["alice", "bob", "carol", "dave", "eve"],
"status": ["active", "inactive", "active", "active", "active"],
"salary": [75000, 65000, 80000, 95000, 90000]
})
result = df.group_by("department").agg(
pl.len().alias("total_employees"),
pl.col("status").filter(pl.col("status") == "active").count().alias("active_employees"),
pl.col("salary").filter(pl.col("status") == "active").mean().alias("avg_active_salary"),
pl.col("salary").sum().alias("total_payroll")
)
print(result)
You can also use when().then().otherwise() for more complex conditional logic:
df = pl.DataFrame({
"category": ["A", "A", "A", "B", "B"],
"value": [10, 25, 15, 30, 5],
"threshold": [20, 20, 20, 20, 20]
})
result = df.group_by("category").agg(
pl.when(pl.col("value") > pl.col("threshold"))
.then(pl.col("value"))
.otherwise(0)
.sum()
.alias("sum_above_threshold"),
(pl.col("value") > pl.col("threshold")).sum().alias("count_above_threshold")
)
print(result)
This pattern eliminates the need for intermediate columns or separate filtering steps.
GroupBy with Sorting and Ordering
By default, group_by() doesn’t preserve or guarantee any particular row order. For deterministic results, use maintain_order=True:
result = df.group_by("category", maintain_order=True).agg(
pl.col("value").sum()
)
This keeps groups in the order they first appear in the data. There’s a small performance cost, so only use it when you need it.
A common task is getting the top N rows per group. Combine sort() with head() inside your aggregation:
df = pl.DataFrame({
"category": ["A", "A", "A", "B", "B", "B"],
"product": ["p1", "p2", "p3", "p4", "p5", "p6"],
"sales": [100, 300, 200, 150, 400, 250]
})
# Get top 2 products by sales per category
result = df.sort("sales", descending=True).group_by("category").agg(
pl.col("product").head(2).alias("top_products"),
pl.col("sales").head(2).alias("top_sales")
)
print(result)
For more control, use over() with window functions to rank within groups, then filter:
df_ranked = df.with_columns(
pl.col("sales").rank(descending=True).over("category").alias("rank")
).filter(pl.col("rank") <= 2)
print(df_ranked)
Lazy Evaluation with GroupBy
For large datasets, lazy evaluation is essential. Instead of executing immediately, Polars builds a query plan and optimizes it before running.
# Eager execution (immediate)
result_eager = df.group_by("category").agg(
pl.col("value").sum()
)
# Lazy execution (optimized)
result_lazy = (
df.lazy()
.group_by("category")
.agg(pl.col("value").sum())
.collect()
)
The lazy version looks almost identical, but the benefits compound with complex queries. Polars can:
- Push filters before groupby operations to reduce data size early
- Combine multiple operations into single passes
- Optimize column selection to read only what’s needed
- Parallelize more effectively
Here’s a more realistic example showing the pattern:
result = (
pl.scan_parquet("sales_data/*.parquet") # Lazy by default
.filter(pl.col("date") >= "2024-01-01")
.group_by(["region", "product_category"])
.agg(
pl.col("revenue").sum().alias("total_revenue"),
pl.col("order_id").n_unique().alias("unique_orders"),
pl.col("quantity").mean().alias("avg_quantity")
)
.sort("total_revenue", descending=True)
.collect()
)
The optimizer will push the date filter down to the Parquet reader, only read the columns needed for the aggregation, and execute the sort after reducing data through groupby.
Comparison with Pandas
If you’re coming from pandas, here’s a quick syntax mapping:
import pandas as pd
import polars as pl
# Sample data (same for both)
data = {
"category": ["A", "A", "B", "B"],
"value": [10, 20, 30, 40]
}
# Pandas
df_pd = pd.DataFrame(data)
result_pd = df_pd.groupby("category").agg(
total=("value", "sum"),
average=("value", "mean")
).reset_index()
# Polars
df_pl = pl.DataFrame(data)
result_pl = df_pl.group_by("category").agg(
pl.col("value").sum().alias("total"),
pl.col("value").mean().alias("average")
)
Key differences to remember:
| Aspect | Pandas | Polars |
|---|---|---|
| Method name | groupby() |
group_by() |
| Result type | GroupBy object | Requires .agg() |
| Index handling | Creates MultiIndex | No index concept |
| Column selection | ["col"] or .col |
pl.col("col") |
| Order preservation | Preserved by default | Not preserved by default |
| Null handling | Excludes by default | Includes by default |
Performance-wise, expect Polars to be 3-10x faster on datasets that fit in memory, with even larger gains on multi-core machines or when using lazy evaluation.
The mental model shift is from pandas’ method-chaining on columns to Polars’ expression-based approach. Once you internalize pl.col() as the entry point for column operations, the rest follows naturally.