How to Read Parquet Files in Polars

Key Insights

Polars’ scan_parquet() with lazy evaluation enables predicate pushdown and column pruning, reading only the data you actually need from disk
Reading Parquet files in Polars is typically 2-5x faster than pandas while using significantly less memory due to its Rust-based engine and zero-copy architecture
Glob patterns and Hive-style partitioning support make Polars excellent for working with large, distributed datasets across multiple files

Why Polars for Parquet?

Parquet has become the de facto standard for analytical data storage. Its columnar format, efficient compression, and schema preservation make it ideal for data engineering workflows. But the tool you use to read Parquet matters enormously for performance.

Polars, built in Rust with Python bindings, treats Parquet as a first-class citizen. Unlike pandas, which relies on PyArrow or fastparquet as intermediaries, Polars has native Parquet support that exploits the format’s strengths. This means faster reads, lower memory usage, and query optimization that pushes filters directly into the file scan.

Let’s explore how to read Parquet files effectively with Polars.

Basic Parquet Reading

The simplest approach uses pl.read_parquet() for eager loading. This reads the entire file into memory immediately.

import polars as pl

# Read a single Parquet file
df = pl.read_parquet("data/sales_2024.parquet")

# Inspect the schema
print(df.schema)
# {'order_id': Int64, 'customer_id': Int64, 'amount': Float64, 'order_date': Date}

# Preview the data
print(df.head())
# shape: (5, 4)
# ┌──────────┬─────────────┬─────────┬────────────┐
# │ order_id ┆ customer_id ┆ amount  ┆ order_date │
# │ ---      ┆ ---         ┆ ---     ┆ ---        │
# │ i64      ┆ i64         ┆ f64     ┆ date       │
# ╞══════════╪═════════════╪═════════╪════════════╡
# │ 1        ┆ 101         ┆ 250.50  ┆ 2024-01-15 │
# │ 2        ┆ 102         ┆ 175.25  ┆ 2024-01-16 │
# │ ...      ┆ ...         ┆ ...     ┆ ...        │
# └──────────┴─────────────┴─────────┴────────────┘

# Get basic statistics
print(df.describe())

This works well for files that fit comfortably in memory. Polars automatically infers types from Parquet’s embedded schema, preserving the exact data types without guessing.

Lazy Reading with scan_parquet()

For larger files or when you only need a subset of data, scan_parquet() is the better choice. It creates a lazy frame that doesn’t read any data until you explicitly call .collect().

import polars as pl

# Create a lazy frame - no data read yet
lf = pl.scan_parquet("data/large_transactions.parquet")

# Build a query
query = (
    lf
    .filter(pl.col("amount") > 1000)
    .select(["order_id", "customer_id", "amount"])
    .sort("amount", descending=True)
    .head(100)
)

# See the query plan before execution
print(query.explain())
# SLICE[first=100]
#   SORT BY [col("amount")]
#     FILTER [(col("amount")) > (1000)] FROM
#       Parquet SCAN data/large_transactions.parquet
#       PROJECT 3/4 COLUMNS

# Execute the query
result = query.collect()

The query plan reveals something important: Polars will only read 3 of 4 columns and apply the filter during the scan. This is predicate pushdown in action. For a 10GB Parquet file where you need 100 filtered rows, this can reduce I/O by orders of magnitude.

Use scan_parquet() when:

Files are larger than available RAM
You’re filtering or selecting subsets of data
You’re chaining multiple operations before materializing results
You want to inspect the query plan before execution

Use read_parquet() when:

Files are small and you need all the data
You’re doing exploratory analysis and want immediate results

Reading Multiple Files and Directories

Real-world data pipelines rarely involve single files. Polars handles multiple files elegantly with glob patterns.

import polars as pl

# Read all Parquet files in a directory
df = pl.read_parquet("data/sales/*.parquet")

# Read files matching a pattern
df = pl.read_parquet("data/sales/sales_2024_*.parquet")

# Lazy scan with glob pattern
lf = pl.scan_parquet("data/sales/**/*.parquet")  # Recursive

For Hive-style partitioned datasets (common in data lakes), Polars can extract partition values as columns:

# Directory structure:
# data/events/
#   year=2023/month=01/data.parquet
#   year=2023/month=02/data.parquet
#   year=2024/month=01/data.parquet

# Read with Hive partitioning
lf = pl.scan_parquet(
    "data/events/**/*.parquet",
    hive_partitioning=True
)

# The 'year' and 'month' columns are automatically extracted
result = (
    lf
    .filter((pl.col("year") == 2024) & (pl.col("month") == 1))
    .collect()
)

Polars is smart enough to skip reading partition directories that don’t match your filter, making queries on partitioned data extremely fast.

Column Selection and Predicate Pushdown

Selecting only the columns you need is one of the easiest performance wins with columnar formats. Polars supports this at read time.

import polars as pl

# Select specific columns during eager read
df = pl.read_parquet(
    "data/wide_table.parquet",
    columns=["user_id", "event_type", "timestamp"]
)

# With lazy evaluation, column pruning happens automatically
lf = pl.scan_parquet("data/wide_table.parquet")
result = (
    lf
    .select(["user_id", "event_type", "timestamp"])
    .filter(pl.col("event_type") == "purchase")
    .collect()
)

The lazy approach is preferable because Polars optimizes the entire query. If you later add more operations, the optimizer can make better decisions about what to read.

For complex filtering, predicate pushdown becomes even more valuable:

import polars as pl
from datetime import date

lf = pl.scan_parquet("data/orders.parquet")

# Complex filter - pushed down to Parquet scan
result = (
    lf
    .filter(
        (pl.col("order_date") >= date(2024, 1, 1)) &
        (pl.col("order_date") < date(2024, 4, 1)) &
        (pl.col("status").is_in(["completed", "shipped"])) &
        (pl.col("amount") > 100)
    )
    .group_by("customer_id")
    .agg(pl.col("amount").sum().alias("total_spent"))
    .collect()
)

Advanced Options

Polars provides fine-grained control over Parquet reading for specialized use cases.

import polars as pl

# Read only first N rows (useful for sampling)
sample = pl.read_parquet("data/large_file.parquet", n_rows=1000)

# Add a row index column
df = pl.read_parquet(
    "data/file.parquet",
    row_index_name="idx",
    row_index_offset=0
)

# Control parallelism strategy
# "auto" (default), "columns", "row_groups", "none"
df = pl.read_parquet(
    "data/file.parquet",
    parallel="row_groups"  # Best for files with many row groups
)

# Read specific row groups
df = pl.read_parquet(
    "data/file.parquet",
    row_groups=[0, 2, 4]  # Read only these row groups
)

For cloud storage, Polars integrates with object stores directly:

import polars as pl

# Read from S3 (requires credentials in environment or AWS config)
df = pl.read_parquet("s3://my-bucket/data/file.parquet")

# With explicit credentials
storage_options = {
    "aws_access_key_id": "YOUR_KEY",
    "aws_secret_access_key": "YOUR_SECRET",
    "aws_region": "us-east-1"
}

lf = pl.scan_parquet(
    "s3://my-bucket/data/*.parquet",
    storage_options=storage_options
)

# GCS works similarly
df = pl.read_parquet("gs://my-bucket/data/file.parquet")

Performance Tips and Comparison

Let’s quantify the performance difference between Polars and pandas:

import polars as pl
import pandas as pd
import time

# Generate test file (run once)
# pl.DataFrame({
#     "id": range(10_000_000),
#     "value": [float(i) for i in range(10_000_000)],
#     "category": ["A", "B", "C", "D"] * 2_500_000
# }).write_parquet("benchmark.parquet")

# Pandas timing
start = time.perf_counter()
pdf = pd.read_parquet("benchmark.parquet")
pandas_time = time.perf_counter() - start
print(f"Pandas: {pandas_time:.2f}s")

# Polars eager timing
start = time.perf_counter()
df = pl.read_parquet("benchmark.parquet")
polars_eager_time = time.perf_counter() - start
print(f"Polars eager: {polars_eager_time:.2f}s")

# Polars lazy with filter (the real advantage)
start = time.perf_counter()
result = (
    pl.scan_parquet("benchmark.parquet")
    .filter(pl.col("category") == "A")
    .select(["id", "value"])
    .collect()
)
polars_lazy_time = time.perf_counter() - start
print(f"Polars lazy+filter: {polars_lazy_time:.2f}s")

# Typical results on 10M row file:
# Pandas: 1.85s
# Polars eager: 0.42s
# Polars lazy+filter: 0.15s

Best practices for maximum performance:

Always use scan_parquet() for large files - Let the optimizer do its job
Filter early - Put filters as early as possible in your query chain
Select only needed columns - Avoid select("*") patterns
Use appropriate parallelism - row_groups for wide files, columns for tall files
Partition your data - Use Hive-style partitioning for datasets you filter by date or category
Compress wisely - Snappy for speed, Zstd for size; Polars handles both transparently

Polars’ Parquet support isn’t just faster—it’s architecturally superior for analytical workloads. The combination of lazy evaluation, predicate pushdown, and native Rust performance makes it the right choice for modern data engineering.