How to Use Lazy Evaluation in Polars

Key Insights

Lazy evaluation in Polars defers computation until you call collect(), enabling the query optimizer to reorder operations, eliminate redundant work, and push filters closer to the data source for dramatic performance gains.
Use scan_csv(), scan_parquet(), and df.lazy() to enter lazy mode, then chain transformations freely—Polars builds a query plan rather than executing immediately.
Streaming mode (collect(streaming=True)) extends lazy evaluation to datasets larger than RAM by processing data in batches, making Polars viable for big data workloads without Spark-like infrastructure.

Introduction to Lazy Evaluation

Polars offers two distinct execution modes: eager and lazy. Eager evaluation executes operations immediately, returning results after each step. Lazy evaluation defers all computation, building a query plan that executes only when you explicitly request results.

This distinction matters because lazy evaluation unlocks optimizations impossible in eager mode. When Polars sees your entire query upfront, it can reorder operations, eliminate unnecessary columns early, and push filters down to the data source. The result is faster execution and lower memory consumption.

Most Polars tutorials start with eager mode because it feels familiar—you call a method, you get a result. But production workloads demand lazy evaluation. Once you understand how to think in query plans rather than step-by-step transformations, you’ll rarely go back to eager mode for anything beyond quick exploration.

Lazy vs Eager API Comparison

The API difference between eager and lazy is minimal, which makes adoption straightforward. Here’s the same filtering operation in both modes:

import polars as pl

# Eager evaluation - executes immediately
df = pl.read_csv("sales.csv")
result = df.filter(pl.col("amount") > 1000)
print(result)

# Lazy evaluation - builds query plan, executes on collect()
lf = pl.scan_csv("sales.csv")
result = lf.filter(pl.col("amount") > 1000).collect()
print(result)

The eager version reads the entire CSV into memory, then filters. The lazy version does something smarter: it knows about the filter before reading any data, potentially skipping rows during the scan itself.

You can also convert between modes:

# Eager DataFrame to LazyFrame
df = pl.read_csv("sales.csv")
lf = df.lazy()

# LazyFrame to DataFrame
result = lf.filter(pl.col("amount") > 1000).collect()

The performance implications compound with query complexity. Consider a pipeline that filters, joins, groups, and selects specific columns. In eager mode, each operation materializes intermediate results. In lazy mode, Polars analyzes the entire pipeline and optimizes holistically.

Memory efficiency follows the same pattern. Eager evaluation keeps intermediate DataFrames in memory until garbage collection. Lazy evaluation never creates intermediates—it streams data through the optimized query plan.

Building Lazy Query Plans

LazyFrames represent query plans, not data. When you chain operations on a LazyFrame, you’re constructing a directed acyclic graph (DAG) of transformations. Nothing executes until you call collect().

import polars as pl

# Build a complex query plan
lf = (
    pl.scan_csv("orders.csv")
    .filter(pl.col("status") == "completed")
    .with_columns(
        pl.col("order_date").str.to_datetime(),
        (pl.col("quantity") * pl.col("unit_price")).alias("total")
    )
    .filter(pl.col("total") > 500)
    .group_by("customer_id")
    .agg(
        pl.col("total").sum().alias("customer_total"),
        pl.col("order_date").max().alias("last_order")
    )
    .sort("customer_total", descending=True)
    .head(100)
)

# Nothing has executed yet - lf is just a plan
print(type(lf))  # <class 'polars.LazyFrame'>

# Execute the plan
result = lf.collect()

This query plan includes filtering, column creation, aggregation, sorting, and limiting. Polars sees all of it before touching the CSV file. It knows you only need certain columns, so it won’t read others. It knows about both filters, so it can apply them optimally.

The scan_* family of functions creates LazyFrames directly from files:

# Different data sources, same lazy interface
lf_csv = pl.scan_csv("data.csv")
lf_parquet = pl.scan_parquet("data.parquet")
lf_json = pl.scan_ndjson("data.jsonl")
lf_ipc = pl.scan_ipc("data.arrow")

Parquet files benefit most from lazy evaluation because the format supports predicate pushdown natively. Polars can skip entire row groups that don’t match your filters.

Query Optimization Under the Hood

Polars applies several optimization passes to your query plan. Understanding these helps you write queries that optimize well.

Predicate pushdown moves filters as close to the data source as possible. If you filter after a join, Polars may push that filter before the join if it only references columns from one side.

Projection pushdown eliminates columns you never use. If your query only selects three columns from a 50-column dataset, Polars won’t read the other 47.

Common subexpression elimination identifies repeated calculations and computes them once.

Inspect what Polars actually plans to do with explain():

import polars as pl

lf = (
    pl.scan_csv("products.csv")
    .filter(pl.col("category") == "electronics")
    .select("product_id", "name", "price")
    .filter(pl.col("price") > 100)
)

# View the optimized query plan
print(lf.explain())

The output shows the optimized plan in a readable format. You’ll see that both filters are combined and pushed down, and only the three selected columns are read.

For complex queries, visualize the plan as a graph:

# Generate a visualization (requires graphviz)
lf.show_graph(optimized=True)

# Compare optimized vs unoptimized
print("=== Unoptimized ===")
print(lf.explain(optimized=False))
print("\n=== Optimized ===")
print(lf.explain(optimized=True))

The unoptimized plan shows your operations in the order you wrote them. The optimized plan shows what Polars actually executes. Comparing these reveals how much work the optimizer saves you.

Streaming Large Datasets

Lazy evaluation enables streaming execution for datasets that exceed available RAM. Instead of loading everything into memory, Polars processes data in batches.

import polars as pl

# Process a 50GB file on a machine with 16GB RAM
result = (
    pl.scan_csv("massive_logs.csv")
    .filter(pl.col("level") == "ERROR")
    .group_by("service")
    .agg(pl.len().alias("error_count"))
    .sort("error_count", descending=True)
    .collect(streaming=True)
)

Streaming mode works by breaking the query into chunks that fit in memory. Not all operations support streaming—anything requiring a global view of the data (like certain joins or sorts on unsorted data) may force materialization.

Check if your query can stream:

lf = pl.scan_parquet("huge_dataset/*.parquet")

# This will warn if streaming isn't fully supported
result = lf.filter(...).group_by(...).agg(...).collect(streaming=True)

For truly massive datasets, combine streaming with sink operations that write results directly to disk:

# Stream results directly to Parquet without holding in memory
(
    pl.scan_csv("input/*.csv")
    .filter(pl.col("valid") == True)
    .with_columns(pl.col("timestamp").str.to_datetime())
    .sink_parquet("output/processed.parquet")
)

The sink_* methods extend streaming to output, enabling end-to-end processing of datasets limited only by disk space.

Common Patterns and Best Practices

Delay collect() as long as possible. Every collect() call materializes results and breaks the optimization chain. Build your entire pipeline lazily, then collect once at the end.

# Bad: Multiple collects break optimization
df1 = pl.scan_csv("data.csv").filter(...).collect()
df2 = df1.lazy().with_columns(...).collect()
df3 = df2.lazy().group_by(...).agg(...).collect()

# Good: Single collect after full pipeline
result = (
    pl.scan_csv("data.csv")
    .filter(...)
    .with_columns(...)
    .group_by(...)
    .agg(...)
    .collect()
)

Use scan_* instead of read_* for files. Starting lazy means you never accidentally materialize data you’ll filter away.

Refactor eager pipelines incrementally. Here’s a real-world example:

# Original eager code
df = pl.read_parquet("events.parquet")
df = df.filter(pl.col("event_type") == "purchase")
df = df.with_columns(pl.col("amount").cast(pl.Float64))
df = df.group_by("user_id").agg(pl.col("amount").sum())
df = df.filter(pl.col("amount") > 1000)
df = df.sort("amount", descending=True).head(100)

# Refactored lazy code
result = (
    pl.scan_parquet("events.parquet")
    .filter(pl.col("event_type") == "purchase")
    .with_columns(pl.col("amount").cast(pl.Float64))
    .group_by("user_id")
    .agg(pl.col("amount").sum())
    .filter(pl.col("amount") > 1000)
    .sort("amount", descending=True)
    .head(100)
    .collect()
)

The refactored version enables predicate pushdown for the event_type filter, projection pushdown to skip unused columns, and optimized execution order.

Profile before and after. Polars provides timing information:

# Time the lazy execution
result = lf.collect()

# Or use Python's timing
import time
start = time.perf_counter()
result = lf.collect()
print(f"Execution time: {time.perf_counter() - start:.2f}s")

Conclusion

Lazy evaluation transforms Polars from a fast DataFrame library into a query optimization engine. By deferring execution, you give Polars the context it needs to eliminate waste, reorder operations, and process data efficiently.

Start by replacing read_csv() with scan_csv() in your existing code. Chain your transformations without intermediate variables. Call collect() once at the end. For large datasets, add streaming=True and watch Polars handle files that would crash pandas.

The mental shift from “execute this operation” to “add this to my query plan” takes practice. But once you internalize it, you’ll write faster code with less effort. The optimizer handles the details you used to manage manually.