How to Read Parquet Files in Polars
Parquet has become the de facto standard for analytical data storage. Its columnar format, efficient compression, and schema preservation make it ideal for data engineering workflows. But the tool...
Key Insights
- Polars’
scan_parquet()with lazy evaluation enables predicate pushdown and column pruning, reading only the data you actually need from disk - Reading Parquet files in Polars is typically 2-5x faster than pandas while using significantly less memory due to its Rust-based engine and zero-copy architecture
- Glob patterns and Hive-style partitioning support make Polars excellent for working with large, distributed datasets across multiple files
Why Polars for Parquet?
Parquet has become the de facto standard for analytical data storage. Its columnar format, efficient compression, and schema preservation make it ideal for data engineering workflows. But the tool you use to read Parquet matters enormously for performance.
Polars, built in Rust with Python bindings, treats Parquet as a first-class citizen. Unlike pandas, which relies on PyArrow or fastparquet as intermediaries, Polars has native Parquet support that exploits the format’s strengths. This means faster reads, lower memory usage, and query optimization that pushes filters directly into the file scan.
Let’s explore how to read Parquet files effectively with Polars.
Basic Parquet Reading
The simplest approach uses pl.read_parquet() for eager loading. This reads the entire file into memory immediately.
import polars as pl
# Read a single Parquet file
df = pl.read_parquet("data/sales_2024.parquet")
# Inspect the schema
print(df.schema)
# {'order_id': Int64, 'customer_id': Int64, 'amount': Float64, 'order_date': Date}
# Preview the data
print(df.head())
# shape: (5, 4)
# ┌──────────┬─────────────┬─────────┬────────────┐
# │ order_id ┆ customer_id ┆ amount ┆ order_date │
# │ --- ┆ --- ┆ --- ┆ --- │
# │ i64 ┆ i64 ┆ f64 ┆ date │
# ╞══════════╪═════════════╪═════════╪════════════╡
# │ 1 ┆ 101 ┆ 250.50 ┆ 2024-01-15 │
# │ 2 ┆ 102 ┆ 175.25 ┆ 2024-01-16 │
# │ ... ┆ ... ┆ ... ┆ ... │
# └──────────┴─────────────┴─────────┴────────────┘
# Get basic statistics
print(df.describe())
This works well for files that fit comfortably in memory. Polars automatically infers types from Parquet’s embedded schema, preserving the exact data types without guessing.
Lazy Reading with scan_parquet()
For larger files or when you only need a subset of data, scan_parquet() is the better choice. It creates a lazy frame that doesn’t read any data until you explicitly call .collect().
import polars as pl
# Create a lazy frame - no data read yet
lf = pl.scan_parquet("data/large_transactions.parquet")
# Build a query
query = (
lf
.filter(pl.col("amount") > 1000)
.select(["order_id", "customer_id", "amount"])
.sort("amount", descending=True)
.head(100)
)
# See the query plan before execution
print(query.explain())
# SLICE[first=100]
# SORT BY [col("amount")]
# FILTER [(col("amount")) > (1000)] FROM
# Parquet SCAN data/large_transactions.parquet
# PROJECT 3/4 COLUMNS
# Execute the query
result = query.collect()
The query plan reveals something important: Polars will only read 3 of 4 columns and apply the filter during the scan. This is predicate pushdown in action. For a 10GB Parquet file where you need 100 filtered rows, this can reduce I/O by orders of magnitude.
Use scan_parquet() when:
- Files are larger than available RAM
- You’re filtering or selecting subsets of data
- You’re chaining multiple operations before materializing results
- You want to inspect the query plan before execution
Use read_parquet() when:
- Files are small and you need all the data
- You’re doing exploratory analysis and want immediate results
Reading Multiple Files and Directories
Real-world data pipelines rarely involve single files. Polars handles multiple files elegantly with glob patterns.
import polars as pl
# Read all Parquet files in a directory
df = pl.read_parquet("data/sales/*.parquet")
# Read files matching a pattern
df = pl.read_parquet("data/sales/sales_2024_*.parquet")
# Lazy scan with glob pattern
lf = pl.scan_parquet("data/sales/**/*.parquet") # Recursive
For Hive-style partitioned datasets (common in data lakes), Polars can extract partition values as columns:
# Directory structure:
# data/events/
# year=2023/month=01/data.parquet
# year=2023/month=02/data.parquet
# year=2024/month=01/data.parquet
# Read with Hive partitioning
lf = pl.scan_parquet(
"data/events/**/*.parquet",
hive_partitioning=True
)
# The 'year' and 'month' columns are automatically extracted
result = (
lf
.filter((pl.col("year") == 2024) & (pl.col("month") == 1))
.collect()
)
Polars is smart enough to skip reading partition directories that don’t match your filter, making queries on partitioned data extremely fast.
Column Selection and Predicate Pushdown
Selecting only the columns you need is one of the easiest performance wins with columnar formats. Polars supports this at read time.
import polars as pl
# Select specific columns during eager read
df = pl.read_parquet(
"data/wide_table.parquet",
columns=["user_id", "event_type", "timestamp"]
)
# With lazy evaluation, column pruning happens automatically
lf = pl.scan_parquet("data/wide_table.parquet")
result = (
lf
.select(["user_id", "event_type", "timestamp"])
.filter(pl.col("event_type") == "purchase")
.collect()
)
The lazy approach is preferable because Polars optimizes the entire query. If you later add more operations, the optimizer can make better decisions about what to read.
For complex filtering, predicate pushdown becomes even more valuable:
import polars as pl
from datetime import date
lf = pl.scan_parquet("data/orders.parquet")
# Complex filter - pushed down to Parquet scan
result = (
lf
.filter(
(pl.col("order_date") >= date(2024, 1, 1)) &
(pl.col("order_date") < date(2024, 4, 1)) &
(pl.col("status").is_in(["completed", "shipped"])) &
(pl.col("amount") > 100)
)
.group_by("customer_id")
.agg(pl.col("amount").sum().alias("total_spent"))
.collect()
)
Advanced Options
Polars provides fine-grained control over Parquet reading for specialized use cases.
import polars as pl
# Read only first N rows (useful for sampling)
sample = pl.read_parquet("data/large_file.parquet", n_rows=1000)
# Add a row index column
df = pl.read_parquet(
"data/file.parquet",
row_index_name="idx",
row_index_offset=0
)
# Control parallelism strategy
# "auto" (default), "columns", "row_groups", "none"
df = pl.read_parquet(
"data/file.parquet",
parallel="row_groups" # Best for files with many row groups
)
# Read specific row groups
df = pl.read_parquet(
"data/file.parquet",
row_groups=[0, 2, 4] # Read only these row groups
)
For cloud storage, Polars integrates with object stores directly:
import polars as pl
# Read from S3 (requires credentials in environment or AWS config)
df = pl.read_parquet("s3://my-bucket/data/file.parquet")
# With explicit credentials
storage_options = {
"aws_access_key_id": "YOUR_KEY",
"aws_secret_access_key": "YOUR_SECRET",
"aws_region": "us-east-1"
}
lf = pl.scan_parquet(
"s3://my-bucket/data/*.parquet",
storage_options=storage_options
)
# GCS works similarly
df = pl.read_parquet("gs://my-bucket/data/file.parquet")
Performance Tips and Comparison
Let’s quantify the performance difference between Polars and pandas:
import polars as pl
import pandas as pd
import time
# Generate test file (run once)
# pl.DataFrame({
# "id": range(10_000_000),
# "value": [float(i) for i in range(10_000_000)],
# "category": ["A", "B", "C", "D"] * 2_500_000
# }).write_parquet("benchmark.parquet")
# Pandas timing
start = time.perf_counter()
pdf = pd.read_parquet("benchmark.parquet")
pandas_time = time.perf_counter() - start
print(f"Pandas: {pandas_time:.2f}s")
# Polars eager timing
start = time.perf_counter()
df = pl.read_parquet("benchmark.parquet")
polars_eager_time = time.perf_counter() - start
print(f"Polars eager: {polars_eager_time:.2f}s")
# Polars lazy with filter (the real advantage)
start = time.perf_counter()
result = (
pl.scan_parquet("benchmark.parquet")
.filter(pl.col("category") == "A")
.select(["id", "value"])
.collect()
)
polars_lazy_time = time.perf_counter() - start
print(f"Polars lazy+filter: {polars_lazy_time:.2f}s")
# Typical results on 10M row file:
# Pandas: 1.85s
# Polars eager: 0.42s
# Polars lazy+filter: 0.15s
Best practices for maximum performance:
- Always use
scan_parquet()for large files - Let the optimizer do its job - Filter early - Put filters as early as possible in your query chain
- Select only needed columns - Avoid
select("*")patterns - Use appropriate parallelism -
row_groupsfor wide files,columnsfor tall files - Partition your data - Use Hive-style partitioning for datasets you filter by date or category
- Compress wisely - Snappy for speed, Zstd for size; Polars handles both transparently
Polars’ Parquet support isn’t just faster—it’s architecturally superior for analytical workloads. The combination of lazy evaluation, predicate pushdown, and native Rust performance makes it the right choice for modern data engineering.