How to GroupBy and Aggregate in Polars

Polars has rapidly become the go-to DataFrame library for Python developers who need speed. Built in Rust with a query optimizer, it consistently outperforms pandas by 10-100x on common operations....

Key Insights

  • Polars’ group_by().agg() pattern lets you compute multiple aggregations across different columns in a single, highly optimized pass through your data.
  • Expression-based aggregations give you fine-grained control over complex calculations like conditional sums, weighted averages, and filtered counts without writing custom functions.
  • Lazy evaluation with scan_* functions and collect() can dramatically improve performance on large datasets by optimizing the entire query plan before execution.

Introduction

Polars has rapidly become the go-to DataFrame library for Python developers who need speed. Built in Rust with a query optimizer, it consistently outperforms pandas by 10-100x on common operations. But raw speed means nothing if you can’t express your analysis clearly.

GroupBy and aggregation operations sit at the heart of data analysis. Whether you’re calculating regional sales totals, computing user engagement metrics, or building feature engineering pipelines, you’ll use these operations constantly. Polars handles them exceptionally well, but the syntax differs enough from pandas that it trips up newcomers.

This article covers everything you need to know about grouping and aggregating data in Polars, from basic operations to performance-optimized patterns for production workloads.

Basic GroupBy Syntax

The fundamental pattern in Polars is group_by().agg(). Unlike pandas, where you chain aggregation methods directly, Polars separates the grouping operation from the aggregation specification. This design enables powerful optimizations and clearer code.

Let’s start with a simple sales dataset:

import polars as pl

# Create sample sales data
sales = pl.DataFrame({
    "region": ["North", "South", "North", "East", "South", "East", "North", "South"],
    "product": ["Widget", "Widget", "Gadget", "Widget", "Gadget", "Gadget", "Widget", "Widget"],
    "revenue": [1200, 800, 1500, 950, 1100, 1300, 1400, 750],
    "quantity": [10, 8, 12, 9, 11, 10, 14, 7]
})

# Basic groupby with single aggregation
regional_totals = sales.group_by("region").agg(
    pl.col("revenue").sum()
)

print(regional_totals)

Output:

shape: (3, 2)
┌────────┬─────────┐
│ region ┆ revenue │
│ ---    ┆ ---     │
│ str    ┆ i64     │
╞════════╪═════════╡
│ South  ┆ 2650    │
│ North  ┆ 4100    │
│ East   ┆ 2250    │
└────────┴─────────┘

Notice that the output order isn’t guaranteed. Polars prioritizes speed over determinism by default. We’ll address this in the performance section.

Common single aggregations work as you’d expect:

# Count rows per group
sales.group_by("region").agg(pl.col("revenue").count())

# Calculate mean
sales.group_by("region").agg(pl.col("revenue").mean())

# Find maximum
sales.group_by("region").agg(pl.col("quantity").max())

# Get first value in each group
sales.group_by("region").agg(pl.col("product").first())

Multiple Aggregations in One Pass

Real analysis rarely needs just one metric. Polars shines when computing multiple aggregations simultaneously. Pass a list of expressions to agg(), and Polars optimizes the entire computation:

# Multiple aggregations on different columns
summary = sales.group_by("product").agg([
    pl.col("revenue").sum().alias("total_revenue"),
    pl.col("revenue").mean().alias("avg_revenue"),
    pl.col("revenue").min().alias("min_sale"),
    pl.col("revenue").max().alias("max_sale"),
    pl.col("quantity").sum().alias("units_sold"),
    pl.len().alias("transaction_count")
])

print(summary)

Output:

shape: (2, 7)
┌────────┬───────────────┬─────────────┬──────────┬──────────┬────────────┬───────────────────┐
│ product┆ total_revenue ┆ avg_revenue ┆ min_sale ┆ max_sale ┆ units_sold ┆ transaction_count │
│ ---    ┆ ---           ┆ ---         ┆ ---      ┆ ---      ┆ ---        ┆ ---               │
│ str    ┆ i64           ┆ f64         ┆ i64      ┆ i64      ┆ i64        ┆ u32               │
╞════════╪═══════════════╪═════════════╪══════════╪══════════╪════════════╪═══════════════════╡
│ Widget ┆ 5100          ┆ 1020.0      ┆ 750      ┆ 1400     ┆ 48         ┆ 5                 │
│ Gadget ┆ 3900          ┆ 1300.0      ┆ 1100     ┆ 1500     ┆ 33         ┆ 3                 │
└────────┴───────────────┴─────────────┴──────────┴──────────┴────────────┴───────────────────┘

The alias() method renames output columns. Without it, Polars generates names like revenue_sum, which work but aren’t always what you want.

You can also apply the same aggregation to multiple columns using selectors:

# Apply sum to all numeric columns
sales.group_by("region").agg(
    pl.col("revenue", "quantity").sum().name.suffix("_total")
)

# Or use column selectors for more dynamic selection
sales.group_by("region").agg(
    pl.all().exclude("region", "product").mean().name.suffix("_avg")
)

Grouping by Multiple Columns

Hierarchical groupings require multiple columns. Pass them as a list to group_by():

# Extended dataset with time dimension
sales_extended = pl.DataFrame({
    "year": [2023, 2023, 2023, 2023, 2024, 2024, 2024, 2024],
    "quarter": ["Q1", "Q2", "Q1", "Q2", "Q1", "Q2", "Q1", "Q2"],
    "product_type": ["Hardware", "Hardware", "Software", "Software", 
                     "Hardware", "Hardware", "Software", "Software"],
    "revenue": [50000, 55000, 30000, 35000, 60000, 65000, 40000, 45000],
    "units": [500, 520, 1000, 1100, 580, 620, 1200, 1350]
})

# Group by multiple columns
yearly_product_summary = sales_extended.group_by(["year", "product_type"]).agg([
    pl.col("revenue").sum().alias("total_revenue"),
    pl.col("units").sum().alias("total_units"),
    pl.col("revenue").mean().alias("avg_quarterly_revenue")
])

print(yearly_product_summary)

Output:

shape: (4, 5)
┌──────┬──────────────┬───────────────┬─────────────┬───────────────────────┐
│ year ┆ product_type ┆ total_revenue ┆ total_units ┆ avg_quarterly_revenue │
│ ---  ┆ ---          ┆ ---           ┆ ---         ┆ ---                   │
│ i64  ┆ str          ┆ i64           ┆ i64         ┆ f64                   │
╞══════╪══════════════╪═══════════════╪═════════════╪═══════════════════════╡
│ 2023 ┆ Hardware     ┆ 105000        ┆ 1020        ┆ 52500.0               │
│ 2023 ┆ Software     ┆ 65000         ┆ 2100        ┆ 32500.0               │
│ 2024 ┆ Hardware     ┆ 125000        ┆ 1200        ┆ 62500.0               │
│ 2024 ┆ Software     ┆ 85000         ┆ 2550        ┆ 42500.0               │
└──────┴──────────────┴───────────────┴─────────────┴───────────────────────┘

Custom Aggregations with Expressions

Polars expressions unlock sophisticated aggregations that would require custom functions in pandas. The expression API is composable—you build complex calculations by chaining simple operations.

Conditional Aggregations

Use filter() within expressions to aggregate subsets of each group:

# Sales data with a threshold consideration
sales_detailed = pl.DataFrame({
    "region": ["North", "North", "North", "South", "South", "South"],
    "salesperson": ["Alice", "Bob", "Alice", "Carol", "Dave", "Carol"],
    "deal_size": [5000, 3000, 8000, 4500, 2000, 7000],
    "is_new_customer": [True, False, True, False, True, False]
})

# Conditional aggregations
regional_analysis = sales_detailed.group_by("region").agg([
    pl.col("deal_size").sum().alias("total_sales"),
    
    # Sum only deals above threshold
    pl.col("deal_size").filter(pl.col("deal_size") > 4000).sum().alias("large_deals_total"),
    
    # Count new customer acquisitions
    pl.col("is_new_customer").filter(pl.col("is_new_customer")).count().alias("new_customers"),
    
    # Calculate percentage of sales from new customers
    (pl.col("deal_size").filter(pl.col("is_new_customer")).sum() / 
     pl.col("deal_size").sum() * 100).alias("new_customer_revenue_pct")
])

print(regional_analysis)

Weighted Averages

Weighted calculations are straightforward with expressions:

# Product ratings with review counts
products = pl.DataFrame({
    "category": ["Electronics", "Electronics", "Clothing", "Clothing", "Clothing"],
    "product_id": [1, 2, 3, 4, 5],
    "rating": [4.5, 3.8, 4.2, 4.8, 3.5],
    "review_count": [1200, 300, 800, 150, 500]
})

# Weighted average rating by category
category_ratings = products.group_by("category").agg([
    # Simple average (not weighted)
    pl.col("rating").mean().alias("simple_avg_rating"),
    
    # Weighted average by review count
    (pl.col("rating") * pl.col("review_count")).sum() / 
    pl.col("review_count").sum()
    .alias("weighted_avg_rating"),
    
    # Total reviews
    pl.col("review_count").sum().alias("total_reviews")
])

print(category_ratings)

List Aggregations

Sometimes you need to collect values rather than reduce them:

# Collect all values into a list
sales_detailed.group_by("region").agg([
    pl.col("salesperson").unique().alias("salespeople"),
    pl.col("deal_size").sort(descending=True).alias("deals_ranked")
])

Performance Tips

Polars is fast by default, but understanding a few patterns helps you squeeze out maximum performance.

The maintain_order Parameter

By default, group_by() doesn’t preserve input order—this allows parallel processing optimizations. If you need deterministic output order, set maintain_order=True:

# Deterministic but slightly slower
sales.group_by("region", maintain_order=True).agg(
    pl.col("revenue").sum()
)

Use this sparingly. The performance cost is real, especially on large datasets. Only enable it when your downstream code actually depends on row order.

Lazy Evaluation for Large Datasets

For datasets that don’t fit comfortably in memory, or when you’re chaining many operations, lazy evaluation is transformative. Polars builds a query plan and optimizes it before execution:

# Lazy evaluation example
# In production, use scan_parquet, scan_csv, etc.
lazy_sales = sales.lazy()

result = (
    lazy_sales
    .filter(pl.col("revenue") > 900)
    .group_by("region")
    .agg([
        pl.col("revenue").sum().alias("total_revenue"),
        pl.col("quantity").mean().alias("avg_quantity")
    ])
    .sort("total_revenue", descending=True)
    .collect()  # Execute the optimized plan
)

print(result)

For file-based workflows, scan_parquet() and scan_csv() read data lazily:

# Production pattern for large files
result = (
    pl.scan_parquet("sales_data/*.parquet")
    .group_by("region", "year")
    .agg([
        pl.col("revenue").sum(),
        pl.col("customer_id").n_unique().alias("unique_customers")
    ])
    .collect()
)

The query optimizer pushes filters down, eliminates unused columns early, and parallelizes operations automatically. On multi-gigabyte datasets, this can mean the difference between minutes and seconds.

Comparison to Pandas

In benchmarks, Polars groupby operations typically run 5-50x faster than equivalent pandas code. The gap widens with:

  • More groups
  • More aggregation functions
  • Larger datasets
  • String group keys

If you’re migrating from pandas, the main syntax change is explicit agg() calls instead of chained methods. The mental model shift pays dividends in both performance and code clarity.

Conclusion

Polars’ groupby and aggregation capabilities combine performance with expressiveness. The key patterns to remember:

  • Use group_by().agg() with lists of expressions for multiple aggregations
  • Leverage filter() within expressions for conditional aggregations
  • Apply lazy evaluation with scan_* functions for large datasets
  • Only use maintain_order=True when you actually need deterministic ordering

For advanced use cases, explore Polars’ group_by_dynamic() for time-based rolling windows and group_by().map_groups() when you truly need custom Python functions (though expressions should cover 95% of cases).

The official Polars documentation covers these topics in depth, and the library’s error messages are genuinely helpful when you hit edge cases.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.