How to GroupBy and Aggregate in Polars
Polars has rapidly become the go-to DataFrame library for Python developers who need speed. Built in Rust with a query optimizer, it consistently outperforms pandas by 10-100x on common operations....
Key Insights
- Polars’
group_by().agg()pattern lets you compute multiple aggregations across different columns in a single, highly optimized pass through your data. - Expression-based aggregations give you fine-grained control over complex calculations like conditional sums, weighted averages, and filtered counts without writing custom functions.
- Lazy evaluation with
scan_*functions andcollect()can dramatically improve performance on large datasets by optimizing the entire query plan before execution.
Introduction
Polars has rapidly become the go-to DataFrame library for Python developers who need speed. Built in Rust with a query optimizer, it consistently outperforms pandas by 10-100x on common operations. But raw speed means nothing if you can’t express your analysis clearly.
GroupBy and aggregation operations sit at the heart of data analysis. Whether you’re calculating regional sales totals, computing user engagement metrics, or building feature engineering pipelines, you’ll use these operations constantly. Polars handles them exceptionally well, but the syntax differs enough from pandas that it trips up newcomers.
This article covers everything you need to know about grouping and aggregating data in Polars, from basic operations to performance-optimized patterns for production workloads.
Basic GroupBy Syntax
The fundamental pattern in Polars is group_by().agg(). Unlike pandas, where you chain aggregation methods directly, Polars separates the grouping operation from the aggregation specification. This design enables powerful optimizations and clearer code.
Let’s start with a simple sales dataset:
import polars as pl
# Create sample sales data
sales = pl.DataFrame({
"region": ["North", "South", "North", "East", "South", "East", "North", "South"],
"product": ["Widget", "Widget", "Gadget", "Widget", "Gadget", "Gadget", "Widget", "Widget"],
"revenue": [1200, 800, 1500, 950, 1100, 1300, 1400, 750],
"quantity": [10, 8, 12, 9, 11, 10, 14, 7]
})
# Basic groupby with single aggregation
regional_totals = sales.group_by("region").agg(
pl.col("revenue").sum()
)
print(regional_totals)
Output:
shape: (3, 2)
┌────────┬─────────┐
│ region ┆ revenue │
│ --- ┆ --- │
│ str ┆ i64 │
╞════════╪═════════╡
│ South ┆ 2650 │
│ North ┆ 4100 │
│ East ┆ 2250 │
└────────┴─────────┘
Notice that the output order isn’t guaranteed. Polars prioritizes speed over determinism by default. We’ll address this in the performance section.
Common single aggregations work as you’d expect:
# Count rows per group
sales.group_by("region").agg(pl.col("revenue").count())
# Calculate mean
sales.group_by("region").agg(pl.col("revenue").mean())
# Find maximum
sales.group_by("region").agg(pl.col("quantity").max())
# Get first value in each group
sales.group_by("region").agg(pl.col("product").first())
Multiple Aggregations in One Pass
Real analysis rarely needs just one metric. Polars shines when computing multiple aggregations simultaneously. Pass a list of expressions to agg(), and Polars optimizes the entire computation:
# Multiple aggregations on different columns
summary = sales.group_by("product").agg([
pl.col("revenue").sum().alias("total_revenue"),
pl.col("revenue").mean().alias("avg_revenue"),
pl.col("revenue").min().alias("min_sale"),
pl.col("revenue").max().alias("max_sale"),
pl.col("quantity").sum().alias("units_sold"),
pl.len().alias("transaction_count")
])
print(summary)
Output:
shape: (2, 7)
┌────────┬───────────────┬─────────────┬──────────┬──────────┬────────────┬───────────────────┐
│ product┆ total_revenue ┆ avg_revenue ┆ min_sale ┆ max_sale ┆ units_sold ┆ transaction_count │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ f64 ┆ i64 ┆ i64 ┆ i64 ┆ u32 │
╞════════╪═══════════════╪═════════════╪══════════╪══════════╪════════════╪═══════════════════╡
│ Widget ┆ 5100 ┆ 1020.0 ┆ 750 ┆ 1400 ┆ 48 ┆ 5 │
│ Gadget ┆ 3900 ┆ 1300.0 ┆ 1100 ┆ 1500 ┆ 33 ┆ 3 │
└────────┴───────────────┴─────────────┴──────────┴──────────┴────────────┴───────────────────┘
The alias() method renames output columns. Without it, Polars generates names like revenue_sum, which work but aren’t always what you want.
You can also apply the same aggregation to multiple columns using selectors:
# Apply sum to all numeric columns
sales.group_by("region").agg(
pl.col("revenue", "quantity").sum().name.suffix("_total")
)
# Or use column selectors for more dynamic selection
sales.group_by("region").agg(
pl.all().exclude("region", "product").mean().name.suffix("_avg")
)
Grouping by Multiple Columns
Hierarchical groupings require multiple columns. Pass them as a list to group_by():
# Extended dataset with time dimension
sales_extended = pl.DataFrame({
"year": [2023, 2023, 2023, 2023, 2024, 2024, 2024, 2024],
"quarter": ["Q1", "Q2", "Q1", "Q2", "Q1", "Q2", "Q1", "Q2"],
"product_type": ["Hardware", "Hardware", "Software", "Software",
"Hardware", "Hardware", "Software", "Software"],
"revenue": [50000, 55000, 30000, 35000, 60000, 65000, 40000, 45000],
"units": [500, 520, 1000, 1100, 580, 620, 1200, 1350]
})
# Group by multiple columns
yearly_product_summary = sales_extended.group_by(["year", "product_type"]).agg([
pl.col("revenue").sum().alias("total_revenue"),
pl.col("units").sum().alias("total_units"),
pl.col("revenue").mean().alias("avg_quarterly_revenue")
])
print(yearly_product_summary)
Output:
shape: (4, 5)
┌──────┬──────────────┬───────────────┬─────────────┬───────────────────────┐
│ year ┆ product_type ┆ total_revenue ┆ total_units ┆ avg_quarterly_revenue │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ i64 ┆ i64 ┆ f64 │
╞══════╪══════════════╪═══════════════╪═════════════╪═══════════════════════╡
│ 2023 ┆ Hardware ┆ 105000 ┆ 1020 ┆ 52500.0 │
│ 2023 ┆ Software ┆ 65000 ┆ 2100 ┆ 32500.0 │
│ 2024 ┆ Hardware ┆ 125000 ┆ 1200 ┆ 62500.0 │
│ 2024 ┆ Software ┆ 85000 ┆ 2550 ┆ 42500.0 │
└──────┴──────────────┴───────────────┴─────────────┴───────────────────────┘
Custom Aggregations with Expressions
Polars expressions unlock sophisticated aggregations that would require custom functions in pandas. The expression API is composable—you build complex calculations by chaining simple operations.
Conditional Aggregations
Use filter() within expressions to aggregate subsets of each group:
# Sales data with a threshold consideration
sales_detailed = pl.DataFrame({
"region": ["North", "North", "North", "South", "South", "South"],
"salesperson": ["Alice", "Bob", "Alice", "Carol", "Dave", "Carol"],
"deal_size": [5000, 3000, 8000, 4500, 2000, 7000],
"is_new_customer": [True, False, True, False, True, False]
})
# Conditional aggregations
regional_analysis = sales_detailed.group_by("region").agg([
pl.col("deal_size").sum().alias("total_sales"),
# Sum only deals above threshold
pl.col("deal_size").filter(pl.col("deal_size") > 4000).sum().alias("large_deals_total"),
# Count new customer acquisitions
pl.col("is_new_customer").filter(pl.col("is_new_customer")).count().alias("new_customers"),
# Calculate percentage of sales from new customers
(pl.col("deal_size").filter(pl.col("is_new_customer")).sum() /
pl.col("deal_size").sum() * 100).alias("new_customer_revenue_pct")
])
print(regional_analysis)
Weighted Averages
Weighted calculations are straightforward with expressions:
# Product ratings with review counts
products = pl.DataFrame({
"category": ["Electronics", "Electronics", "Clothing", "Clothing", "Clothing"],
"product_id": [1, 2, 3, 4, 5],
"rating": [4.5, 3.8, 4.2, 4.8, 3.5],
"review_count": [1200, 300, 800, 150, 500]
})
# Weighted average rating by category
category_ratings = products.group_by("category").agg([
# Simple average (not weighted)
pl.col("rating").mean().alias("simple_avg_rating"),
# Weighted average by review count
(pl.col("rating") * pl.col("review_count")).sum() /
pl.col("review_count").sum()
.alias("weighted_avg_rating"),
# Total reviews
pl.col("review_count").sum().alias("total_reviews")
])
print(category_ratings)
List Aggregations
Sometimes you need to collect values rather than reduce them:
# Collect all values into a list
sales_detailed.group_by("region").agg([
pl.col("salesperson").unique().alias("salespeople"),
pl.col("deal_size").sort(descending=True).alias("deals_ranked")
])
Performance Tips
Polars is fast by default, but understanding a few patterns helps you squeeze out maximum performance.
The maintain_order Parameter
By default, group_by() doesn’t preserve input order—this allows parallel processing optimizations. If you need deterministic output order, set maintain_order=True:
# Deterministic but slightly slower
sales.group_by("region", maintain_order=True).agg(
pl.col("revenue").sum()
)
Use this sparingly. The performance cost is real, especially on large datasets. Only enable it when your downstream code actually depends on row order.
Lazy Evaluation for Large Datasets
For datasets that don’t fit comfortably in memory, or when you’re chaining many operations, lazy evaluation is transformative. Polars builds a query plan and optimizes it before execution:
# Lazy evaluation example
# In production, use scan_parquet, scan_csv, etc.
lazy_sales = sales.lazy()
result = (
lazy_sales
.filter(pl.col("revenue") > 900)
.group_by("region")
.agg([
pl.col("revenue").sum().alias("total_revenue"),
pl.col("quantity").mean().alias("avg_quantity")
])
.sort("total_revenue", descending=True)
.collect() # Execute the optimized plan
)
print(result)
For file-based workflows, scan_parquet() and scan_csv() read data lazily:
# Production pattern for large files
result = (
pl.scan_parquet("sales_data/*.parquet")
.group_by("region", "year")
.agg([
pl.col("revenue").sum(),
pl.col("customer_id").n_unique().alias("unique_customers")
])
.collect()
)
The query optimizer pushes filters down, eliminates unused columns early, and parallelizes operations automatically. On multi-gigabyte datasets, this can mean the difference between minutes and seconds.
Comparison to Pandas
In benchmarks, Polars groupby operations typically run 5-50x faster than equivalent pandas code. The gap widens with:
- More groups
- More aggregation functions
- Larger datasets
- String group keys
If you’re migrating from pandas, the main syntax change is explicit agg() calls instead of chained methods. The mental model shift pays dividends in both performance and code clarity.
Conclusion
Polars’ groupby and aggregation capabilities combine performance with expressiveness. The key patterns to remember:
- Use
group_by().agg()with lists of expressions for multiple aggregations - Leverage
filter()within expressions for conditional aggregations - Apply lazy evaluation with
scan_*functions for large datasets - Only use
maintain_order=Truewhen you actually need deterministic ordering
For advanced use cases, explore Polars’ group_by_dynamic() for time-based rolling windows and group_by().map_groups() when you truly need custom Python functions (though expressions should cover 95% of cases).
The official Polars documentation covers these topics in depth, and the library’s error messages are genuinely helpful when you hit edge cases.