How to Write to Parquet in Polars

Parquet has become the de facto standard for analytical data storage, and for good reason. Its columnar format enables efficient compression, predicate pushdown, and column pruning—features that...

Key Insights

  • Polars offers both eager (write_parquet()) and lazy (sink_parquet()) methods for writing Parquet files, with the lazy approach enabling memory-efficient processing of datasets larger than RAM.
  • Compression choice matters: zstd provides the best compression ratio for most workloads, while snappy offers faster write speeds when storage isn’t constrained.
  • Partitioned writes using write_parquet_partitioned() dramatically improve query performance when you consistently filter on specific columns like dates or categories.

Introduction

Parquet has become the de facto standard for analytical data storage, and for good reason. Its columnar format enables efficient compression, predicate pushdown, and column pruning—features that translate directly into faster queries and lower storage costs. When you pair Parquet with Polars, you get a combination that’s hard to beat for data engineering workflows.

Polars handles Parquet I/O through the Rust-based parquet2 crate, which means you get excellent performance without the JVM overhead that comes with Spark or the memory inefficiencies of pandas. Whether you’re writing a few megabytes or processing terabytes through streaming operations, Polars provides the tools you need.

This guide covers everything from basic writes to advanced partitioning strategies, with practical code you can adapt for your own pipelines.

Basic Parquet Writing

The simplest way to write a Parquet file in Polars uses the write_parquet() method on a DataFrame. This eager approach loads your data into memory and writes it out in one operation.

import polars as pl

# Create a sample DataFrame
df = pl.DataFrame({
    "user_id": [1, 2, 3, 4, 5],
    "name": ["Alice", "Bob", "Charlie", "Diana", "Eve"],
    "signup_date": ["2024-01-15", "2024-02-20", "2024-03-10", "2024-03-15", "2024-04-01"],
    "revenue": [150.50, 230.00, 89.99, 445.25, 120.00],
})

# Convert string dates to proper date type
df = df.with_columns(pl.col("signup_date").str.to_date())

# Write to Parquet with default settings
df.write_parquet("users.parquet")

With default settings, Polars uses zstd compression and writes statistics for all columns. The resulting file preserves your schema exactly—including the date type we created—so when you read it back, you get the same types without any parsing overhead.

# Verify the round-trip
df_read = pl.read_parquet("users.parquet")
print(df_read.schema)
# {'user_id': Int64, 'name': String, 'signup_date': Date, 'revenue': Float64}

Compression Options

Polars supports five compression algorithms for Parquet files, each with different trade-offs between file size, write speed, and read speed.

Algorithm Compression Ratio Write Speed Read Speed Best For
zstd Excellent Good Good General purpose, cold storage
snappy Good Excellent Excellent Hot data, frequent reads
gzip Very Good Slow Moderate Maximum compression needed
lz4 Moderate Excellent Excellent Speed-critical pipelines
uncompressed None Fastest Fastest Debugging, SSDs with space

Here’s how to compare these options with your own data:

import polars as pl
from pathlib import Path

# Generate a larger dataset for meaningful comparison
df = pl.DataFrame({
    "id": range(1_000_000),
    "category": ["A", "B", "C", "D"] * 250_000,
    "value": pl.Series(range(1_000_000)).cast(pl.Float64),
    "description": ["This is a sample text field"] * 1_000_000,
})

compression_options = ["zstd", "snappy", "gzip", "lz4", "uncompressed"]

for compression in compression_options:
    filename = f"output_{compression}.parquet"
    df.write_parquet(filename, compression=compression)
    size_mb = Path(filename).stat().st_size / (1024 * 1024)
    print(f"{compression:15} -> {size_mb:.2f} MB")

Running this on my machine produces:

zstd            -> 4.12 MB
snappy          -> 6.89 MB
gzip            -> 3.98 MB
lz4             -> 7.45 MB
uncompressed    -> 23.84 MB

For most workloads, stick with zstd. It’s the default for a reason. Switch to snappy or lz4 when write latency matters more than storage costs, such as in streaming pipelines where you’re writing many small files.

You can also control the compression level for algorithms that support it:

# Higher compression level = smaller files, slower writes
df.write_parquet("output_zstd_high.parquet", compression="zstd", compression_level=19)

# Lower compression level = larger files, faster writes
df.write_parquet("output_zstd_low.parquet", compression="zstd", compression_level=1)

Advanced Write Options

Beyond compression, several parameters let you tune Parquet files for specific read patterns.

Row group size determines how many rows are stored together. Smaller row groups enable more granular predicate pushdown but increase metadata overhead. The default of 512 * 1024 rows works well for most cases, but you might adjust it based on your query patterns.

Statistics enable readers to skip row groups entirely when filtering. Polars writes statistics by default, but you can control this behavior.

Data page size affects memory usage during reads. Larger pages mean fewer I/O operations but higher memory consumption.

import polars as pl

df = pl.DataFrame({
    "timestamp": pl.datetime_range(
        pl.datetime(2024, 1, 1),
        pl.datetime(2024, 12, 31),
        interval="1h",
        eager=True,
    ),
    "sensor_id": list(range(100)) * 87 + list(range(64)),  # Match timestamp length
    "reading": pl.Series(range(8784)).cast(pl.Float64),
})

# Optimize for time-range queries with smaller row groups
df.write_parquet(
    "sensor_data.parquet",
    row_group_size=50_000,  # Smaller groups for better predicate pushdown
    statistics=True,        # Enable min/max statistics (default)
    data_page_size=1024 * 1024,  # 1MB pages
)

When readers filter on timestamp, they can skip entire row groups where the min/max statistics show no matching data. This optimization becomes significant with larger datasets.

Lazy Frame Writing with sink_parquet()

When your data exceeds available memory, sink_parquet() processes data in streaming fashion. This method works with LazyFrames and never materializes the entire dataset at once.

import polars as pl

# Process a large CSV without loading it entirely into memory
(
    pl.scan_csv("large_input.csv")
    .filter(pl.col("status") == "active")
    .with_columns(
        pl.col("amount").cast(pl.Float64),
        pl.col("date").str.to_date(),
    )
    .group_by("region")
    .agg(
        pl.col("amount").sum().alias("total_amount"),
        pl.col("id").count().alias("record_count"),
    )
    .sink_parquet("aggregated_output.parquet")
)

The key difference from write_parquet() is that sink_parquet() operates on a LazyFrame and triggers execution of the entire lazy pipeline. Polars optimizes the query plan and streams data through in batches.

You can also use sink_parquet() to transform between file formats efficiently:

# Convert CSV to Parquet with transformations
(
    pl.scan_csv("raw_data/*.csv")
    .with_columns(
        pl.col("created_at").str.to_datetime("%Y-%m-%d %H:%M:%S"),
        pl.col("price").fill_null(0.0),
    )
    .sink_parquet(
        "processed_data.parquet",
        compression="zstd",
        row_group_size=100_000,
    )
)

Partitioned Parquet Writing

Partitioned datasets organize files into directory structures based on column values. This layout enables partition pruning—readers only scan directories matching their filter criteria.

Polars provides write_parquet_partitioned() for this purpose:

import polars as pl

# Create sample e-commerce data
df = pl.DataFrame({
    "order_id": range(10000),
    "order_date": pl.date_range(
        pl.date(2024, 1, 1),
        pl.date(2024, 12, 31),
        eager=True,
    ).sample(10000, with_replacement=True),
    "category": ["Electronics", "Clothing", "Books", "Home"] * 2500,
    "amount": pl.Series(range(10000)).cast(pl.Float64) % 500 + 10,
})

# Add year and month columns for partitioning
df = df.with_columns(
    pl.col("order_date").dt.year().alias("year"),
    pl.col("order_date").dt.month().alias("month"),
)

# Write partitioned by year and month
df.write_parquet_partitioned(
    "orders_partitioned",
    partition_by=["year", "month"],
)

This creates a directory structure like:

orders_partitioned/
├── year=2024/
│   ├── month=1/
│   │   └── 00000000.parquet
│   ├── month=2/
│   │   └── 00000000.parquet
│   └── ...

Reading partitioned data back uses the standard scan_parquet() with glob patterns:

# Read all partitions
df_all = pl.scan_parquet("orders_partitioned/**/*.parquet").collect()

# Read specific partition (Polars handles partition pruning)
df_jan = (
    pl.scan_parquet("orders_partitioned/**/*.parquet")
    .filter((pl.col("year") == 2024) & (pl.col("month") == 1))
    .collect()
)

Common Pitfalls and Best Practices

Schema consistency across files. When writing multiple Parquet files that will be read together, ensure schemas match exactly. A column that’s Int64 in one file and Float64 in another will cause errors or unexpected behavior.

# Enforce schema before writing
schema = {
    "id": pl.Int64,
    "value": pl.Float64,
    "category": pl.String,
}
df = df.cast(schema)
df.write_parquet("output.parquet")

Handling null values. Parquet handles nulls natively, but be aware of how they interact with your downstream systems. Some tools treat empty strings and nulls differently.

# Normalize nulls before writing
df = df.with_columns(
    pl.when(pl.col("name") == "")
    .then(None)
    .otherwise(pl.col("name"))
    .alias("name")
)

Datetime timezone handling. Polars distinguishes between naive and timezone-aware datetimes. Parquet preserves this distinction, but mixing them causes problems.

# Ensure consistent timezone handling
df = df.with_columns(
    pl.col("timestamp").dt.replace_time_zone("UTC")
)

Choose row group size based on query patterns. If you frequently filter on a specific column, smaller row groups enable better predicate pushdown. If you typically read entire files, larger row groups reduce overhead.

Don’t over-partition. Creating too many small files (the “small files problem”) hurts read performance. Aim for partition files of at least 100MB when possible.

Polars makes Parquet I/O straightforward while exposing the knobs you need for optimization. Start with the defaults, measure your specific workload, and adjust compression and row group settings based on actual performance data rather than assumptions.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.