Polars: Lazy vs Eager Evaluation Guide

Polars has emerged as the high-performance alternative to pandas, and one of its most powerful features is the choice between eager and lazy evaluation. This isn't just an academic distinction—it...

Key Insights

  • Lazy evaluation enables Polars to optimize your entire query plan before execution, often delivering 2-10x performance improvements through predicate pushdown, projection pruning, and parallel execution strategies
  • Use eager mode for interactive exploration and small datasets; switch to lazy mode for production pipelines, large files, and any scenario where you’re chaining multiple operations
  • The explain() method is your best friend for understanding what optimizations Polars applies—use it to verify your queries are being optimized as expected

Introduction to Polars Evaluation Modes

Polars has emerged as the high-performance alternative to pandas, and one of its most powerful features is the choice between eager and lazy evaluation. This isn’t just an academic distinction—it fundamentally changes how your data processing code executes and performs.

Eager evaluation executes operations immediately, just like pandas. You call a method, it runs, you get results. Simple and familiar.

Lazy evaluation defers all computation until you explicitly request results. Polars builds a query plan, optimizes it, then executes everything in one optimized pass.

Understanding when to use each mode is the difference between code that processes gigabytes in seconds versus code that crashes or crawls. Let’s dig in.

Eager Evaluation Explained

Eager mode is Polars’ default behavior. Every operation executes immediately and returns a concrete DataFrame with actual data in memory.

import polars as pl

# Eager: operations execute immediately
df = pl.read_csv("sales_data.csv")  # File read NOW

# Each operation runs and returns results immediately
filtered = df.filter(pl.col("amount") > 100)  # Executes NOW
selected = filtered.select(["customer_id", "amount", "date"])  # Executes NOW
grouped = selected.group_by("customer_id").agg(
    pl.col("amount").sum().alias("total_amount"),
    pl.col("amount").count().alias("transaction_count")
)  # Executes NOW

print(grouped.head())

Each line creates a new DataFrame in memory. If your CSV has 50 columns but you only need 3, eager mode still reads all 50 into memory first, then discards 47. If you filter out 90% of rows, you’ve already loaded 100% into memory.

This is fine for:

  • Interactive exploration in notebooks
  • Small datasets (under a few hundred MB)
  • Quick one-off analyses
  • Debugging and understanding your data

The mental model is straightforward: what you write is what executes, in that order.

Lazy Evaluation Explained

Lazy mode flips the script. Nothing executes until you call collect(). Instead, Polars builds a query plan—a directed acyclic graph of operations.

import polars as pl

# Lazy: nothing executes yet
lf = pl.scan_csv("sales_data.csv")  # No file read, just a plan

# Building the query plan, no execution
lf = lf.filter(pl.col("amount") > 100)  # Plan updated
lf = lf.select(["customer_id", "amount", "date"])  # Plan updated
lf = lf.group_by("customer_id").agg(
    pl.col("amount").sum().alias("total_amount"),
    pl.col("amount").count().alias("transaction_count")
)  # Plan updated

# NOW everything executes in one optimized pass
result = lf.collect()
print(result.head())

The variable lf is a LazyFrame, not a DataFrame. It contains no data—only a description of what operations to perform. When you call collect(), Polars analyzes the entire plan, optimizes it, and executes everything.

You can convert between modes freely:

# Eager DataFrame to LazyFrame
df = pl.read_csv("data.csv")
lf = df.lazy()

# LazyFrame to eager DataFrame
lf = pl.scan_csv("data.csv")
df = lf.collect()

Query Optimization in Lazy Mode

Here’s where lazy evaluation earns its keep. Polars applies several optimizations automatically:

Predicate Pushdown: Filters move as early as possible in the plan. If you filter rows after a join, Polars pushes that filter before the join when possible, reducing work.

Projection Pushdown: Only columns actually needed are read from disk. Select 3 columns from a 50-column CSV? Only those 3 are parsed.

Slice Pushdown: If you only need the first 100 rows, Polars stops reading after 100 rows.

Common Subexpression Elimination: Repeated calculations are computed once and reused.

Use explain() to see the optimized plan:

import polars as pl

lf = (
    pl.scan_csv("sales_data.csv")
    .filter(pl.col("amount") > 100)
    .filter(pl.col("region") == "WEST")
    .select(["customer_id", "amount"])
    .group_by("customer_id")
    .agg(pl.col("amount").sum())
)

# View the optimized query plan
print(lf.explain())

Output shows something like:

AGGREGATE
    [col("amount").sum()] BY [col("customer_id")] FROM
    CSV SCAN sales_data.csv
    PROJECT 2/12 COLUMNS
    SELECTION: [(col("amount")) > (100)] & [(col("region")) == (String(WEST))]

Notice: PROJECT 2/12 COLUMNS—Polars only reads 2 of 12 columns. The filters are combined and pushed down to the scan level. This happens automatically.

Compare with the unoptimized plan:

print(lf.explain(optimized=False))

You’ll see the operations in your original order, without pushdowns. The difference in execution time can be dramatic.

Performance Comparison

Let’s benchmark both modes on a realistic workload:

import polars as pl
import time

# Generate a larger dataset for meaningful comparison
def create_test_data(n_rows: int = 5_000_000):
    import random
    return pl.DataFrame({
        "id": range(n_rows),
        "category": [f"cat_{i % 100}" for i in range(n_rows)],
        "region": [["NORTH", "SOUTH", "EAST", "WEST"][i % 4] for i in range(n_rows)],
        "amount": [random.uniform(10, 1000) for _ in range(n_rows)],
        "quantity": [random.randint(1, 100) for _ in range(n_rows)],
        # Add extra columns that won't be used
        **{f"unused_{i}": range(n_rows) for i in range(20)}
    })

df = create_test_data()
df.write_csv("benchmark_data.csv")

# Eager benchmark
start = time.perf_counter()
result_eager = (
    pl.read_csv("benchmark_data.csv")
    .filter(pl.col("amount") > 500)
    .filter(pl.col("region") == "WEST")
    .select(["category", "amount", "quantity"])
    .group_by("category")
    .agg([
        pl.col("amount").sum().alias("total_amount"),
        pl.col("quantity").mean().alias("avg_quantity")
    ])
)
eager_time = time.perf_counter() - start

# Lazy benchmark
start = time.perf_counter()
result_lazy = (
    pl.scan_csv("benchmark_data.csv")
    .filter(pl.col("amount") > 500)
    .filter(pl.col("region") == "WEST")
    .select(["category", "amount", "quantity"])
    .group_by("category")
    .agg([
        pl.col("amount").sum().alias("total_amount"),
        pl.col("quantity").mean().alias("avg_quantity")
    ])
    .collect()
)
lazy_time = time.perf_counter() - start

print(f"Eager: {eager_time:.3f}s")
print(f"Lazy:  {lazy_time:.3f}s")
print(f"Speedup: {eager_time / lazy_time:.2f}x")

On a 5-million-row dataset with 25 columns where you only need 3, lazy mode typically runs 2-4x faster. The gains come from:

  • Reading only 3 columns instead of 25
  • Applying filters during the scan, not after loading everything

Memory usage differs too. Eager mode peaks at full dataset size; lazy mode peaks at filtered subset size.

Choosing the Right Mode

Here’s my decision framework:

Use eager when:

  • Exploring data interactively
  • Dataset fits comfortably in memory
  • You need to inspect intermediate results frequently
  • Operations are simple (single filter, quick aggregation)

Use lazy when:

  • Building production data pipelines
  • Working with files larger than available RAM
  • Chaining multiple operations together
  • Reading from Parquet, CSV, or other file formats
  • Performance matters

Hybrid approach for exploration:

import polars as pl

# Start lazy for efficient loading
lf = pl.scan_parquet("huge_dataset.parquet")

# Apply filters and selections lazily
lf = lf.filter(pl.col("date") >= "2024-01-01")
lf = lf.select(["user_id", "event_type", "value"])

# Collect a sample for exploration
sample = lf.head(10000).collect()

# Explore eagerly
print(sample.describe())
print(sample.group_by("event_type").len())

# Once you understand the data, build your full lazy pipeline
final_result = (
    lf
    .group_by("user_id")
    .agg(pl.col("value").sum())
    .filter(pl.col("value") > 1000)
    .collect()
)

Common Pitfalls and Best Practices

Pitfall 1: Collecting too early

# Bad: defeats the purpose of lazy evaluation
lf = pl.scan_csv("data.csv")
df = lf.collect()  # Collected everything!
result = df.filter(...).select(...).group_by(...)

# Good: stay lazy until the end
result = (
    pl.scan_csv("data.csv")
    .filter(...)
    .select(...)
    .group_by(...)
    .collect()  # Single collection at the end
)

Pitfall 2: Forgetting to collect

# Bug: this returns a LazyFrame, not results
def get_summary(path: str):
    return (
        pl.scan_csv(path)
        .group_by("category")
        .agg(pl.col("amount").sum())
    )  # Missing .collect()!

# The caller gets a LazyFrame, not data
summary = get_summary("data.csv")
print(summary)  # Prints query plan, not data

Pitfall 3: Breaking the optimization chain

# Bad: intermediate collect breaks optimization
lf = pl.scan_csv("data.csv")
df = lf.filter(pl.col("x") > 0).collect()  # Optimization boundary
result = df.lazy().select(["x", "y"]).collect()  # Projection can't push down

# Good: single lazy chain
result = (
    pl.scan_csv("data.csv")
    .filter(pl.col("x") > 0)
    .select(["x", "y"])
    .collect()
)

Best practices for production:

  1. Always use scan_* functions for file reading in pipelines
  2. Chain operations fluently without intermediate variables
  3. Call collect() exactly once, at the end
  4. Use explain() to verify optimizations are applied
  5. Consider streaming=True in collect() for datasets larger than RAM
# Production-ready pattern
def process_sales_data(input_path: str, output_path: str) -> None:
    (
        pl.scan_parquet(input_path)
        .filter(pl.col("status") == "completed")
        .select(["order_id", "customer_id", "amount", "date"])
        .with_columns(pl.col("date").dt.month().alias("month"))
        .group_by(["customer_id", "month"])
        .agg(pl.col("amount").sum().alias("monthly_total"))
        .collect(streaming=True)
        .write_parquet(output_path)
    )

Lazy evaluation is Polars’ superpower. Use it by default for anything beyond quick exploration, and your data pipelines will be faster, use less memory, and scale to datasets that would choke pandas.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.