Polars vs Pandas: Performance Comparison
Pandas has dominated Python data manipulation for over fifteen years. Its intuitive API and tight integration with NumPy, Matplotlib, and scikit-learn made it the default choice for data scientists...
Key Insights
- Polars consistently outperforms Pandas by 5-20x on common operations due to its Rust foundation, Apache Arrow memory format, and automatic multi-threading—making it the clear choice for datasets exceeding 100MB.
- Lazy evaluation in Polars enables query optimization that Pandas simply cannot match, automatically reordering operations, pushing down predicates, and eliminating unnecessary computations.
- Despite Polars’ performance advantages, Pandas remains the pragmatic choice for small datasets, rapid prototyping, and projects heavily dependent on the existing Python data science ecosystem.
The DataFrame Landscape
Pandas has dominated Python data manipulation for over fifteen years. Its intuitive API and tight integration with NumPy, Matplotlib, and scikit-learn made it the default choice for data scientists and engineers alike. But Pandas was designed in an era of smaller datasets and single-core processing.
Polars emerged in 2020 as a ground-up reimagining of the DataFrame concept. Written in Rust with Python bindings, it was built for modern hardware: multi-core CPUs, large memory capacities, and datasets that don’t fit comfortably in RAM. The question isn’t whether Polars is faster—it demonstrably is. The question is whether that speed matters for your use case.
This article provides concrete benchmarks and practical guidance for making that decision.
Architecture Differences
The performance gap between Pandas and Polars stems from fundamental architectural choices.
Pandas stores data in NumPy arrays, executing operations eagerly and single-threaded by default. Each operation typically creates a copy of the data, and the GIL (Global Interpreter Lock) prevents true parallelism within a single process.
Polars uses Apache Arrow as its memory format, enabling zero-copy data sharing and columnar storage optimized for analytical queries. It evaluates lazily when possible, building a query plan that can be optimized before execution. Most importantly, it parallelizes automatically across all available cores.
import pandas as pd
import polars as pl
import sys
# Create identical data
data = {
"id": range(1_000_000),
"value": [float(i) for i in range(1_000_000)],
"category": ["A", "B", "C", "D"] * 250_000
}
# Pandas DataFrame
pdf = pd.DataFrame(data)
pandas_memory = pdf.memory_usage(deep=True).sum() / 1024**2
# Polars DataFrame
plf = pl.DataFrame(data)
polars_memory = plf.estimated_size() / 1024**2
print(f"Pandas memory: {pandas_memory:.2f} MB")
print(f"Polars memory: {polars_memory:.2f} MB")
print(f"Ratio: {pandas_memory / polars_memory:.2f}x")
Output on a typical system:
Pandas memory: 68.66 MB
Polars memory: 19.07 MB
Ratio: 3.60x
The memory difference comes primarily from string handling. Pandas stores strings as Python objects with significant overhead. Polars uses Arrow’s dictionary encoding for categorical-like string columns, dramatically reducing memory consumption.
Read/Write Performance
File I/O often dominates data pipeline execution time. Let’s benchmark realistic scenarios with a 2-million-row dataset.
import time
import pandas as pd
import polars as pl
from functools import wraps
def benchmark(func):
@wraps(func)
def wrapper(*args, **kwargs):
start = time.perf_counter()
result = func(*args, **kwargs)
elapsed = time.perf_counter() - start
print(f"{func.__name__}: {elapsed:.3f}s")
return result
return wrapper
# Generate test file (run once)
def create_test_data():
import numpy as np
n = 2_000_000
df = pd.DataFrame({
"id": range(n),
"timestamp": pd.date_range("2020-01-01", periods=n, freq="s"),
"value": np.random.randn(n),
"category": np.random.choice(["A", "B", "C", "D", "E"], n),
"flag": np.random.choice([True, False], n)
})
df.to_csv("benchmark.csv", index=False)
df.to_parquet("benchmark.parquet", index=False)
@benchmark
def pandas_read_csv():
return pd.read_csv("benchmark.csv")
@benchmark
def polars_read_csv():
return pl.read_csv("benchmark.csv")
@benchmark
def pandas_read_parquet():
return pd.read_parquet("benchmark.parquet")
@benchmark
def polars_read_parquet():
return pl.read_parquet("benchmark.parquet")
# Run benchmarks
print("CSV Reading:")
pandas_read_csv()
polars_read_csv()
print("\nParquet Reading:")
pandas_read_parquet()
polars_read_parquet()
Typical results on an 8-core machine:
CSV Reading:
pandas_read_csv: 3.847s
polars_read_csv: 0.412s
Parquet Reading:
pandas_read_parquet: 0.623s
polars_read_parquet: 0.089s
Polars achieves 9x faster CSV reads and 7x faster Parquet reads. The gap widens with more cores—Polars scales linearly while Pandas remains single-threaded.
Data Transformation Benchmarks
Real-world pipelines involve filtering, grouping, and joining. These operations reveal the most dramatic performance differences.
import numpy as np
import pandas as pd
import polars as pl
import time
# Setup: 5 million rows
n = 5_000_000
np.random.seed(42)
pandas_df = pd.DataFrame({
"user_id": np.random.randint(0, 100_000, n),
"product_id": np.random.randint(0, 10_000, n),
"amount": np.random.uniform(10, 1000, n),
"quantity": np.random.randint(1, 10, n),
"region": np.random.choice(["NA", "EU", "APAC", "LATAM"], n)
})
polars_df = pl.DataFrame(pandas_df)
def time_operation(name, pandas_op, polars_op):
# Pandas
start = time.perf_counter()
pandas_result = pandas_op()
pandas_time = time.perf_counter() - start
# Polars
start = time.perf_counter()
polars_result = polars_op()
polars_time = time.perf_counter() - start
speedup = pandas_time / polars_time
print(f"{name}:")
print(f" Pandas: {pandas_time:.3f}s | Polars: {polars_time:.3f}s | Speedup: {speedup:.1f}x")
# Filter operation
time_operation(
"Filter (amount > 500 AND region == 'NA')",
lambda: pandas_df[(pandas_df["amount"] > 500) & (pandas_df["region"] == "NA")],
lambda: polars_df.filter((pl.col("amount") > 500) & (pl.col("region") == "NA"))
)
# GroupBy aggregation
time_operation(
"GroupBy with multiple aggregations",
lambda: pandas_df.groupby(["region", "product_id"]).agg({
"amount": ["sum", "mean"],
"quantity": "sum",
"user_id": "nunique"
}),
lambda: polars_df.group_by(["region", "product_id"]).agg([
pl.col("amount").sum().alias("amount_sum"),
pl.col("amount").mean().alias("amount_mean"),
pl.col("quantity").sum().alias("quantity_sum"),
pl.col("user_id").n_unique().alias("unique_users")
])
)
# Window function
time_operation(
"Window function (running sum per user)",
lambda: pandas_df.assign(
running_total=pandas_df.groupby("user_id")["amount"].cumsum()
),
lambda: polars_df.with_columns(
pl.col("amount").cum_sum().over("user_id").alias("running_total")
)
)
Typical output:
Filter (amount > 500 AND region == 'NA'):
Pandas: 0.156s | Polars: 0.023s | Speedup: 6.8x
GroupBy with multiple aggregations:
Pandas: 1.247s | Polars: 0.089s | Speedup: 14.0x
Window function (running sum per user):
Pandas: 4.823s | Polars: 0.234s | Speedup: 20.6x
GroupBy and window operations show the largest speedups because they benefit most from parallelization and optimized algorithms.
Lazy Evaluation and Query Optimization
Polars’ lazy API is where it truly shines. Instead of executing operations immediately, it builds a query plan that can be optimized holistically.
import polars as pl
# Create a lazy frame
lf = pl.scan_csv("benchmark.csv")
# Build a complex query
query = (
lf
.filter(pl.col("value") > 0)
.filter(pl.col("category").is_in(["A", "B"]))
.with_columns(
(pl.col("value") * 2).alias("doubled")
)
.group_by("category")
.agg([
pl.col("doubled").mean().alias("avg_doubled"),
pl.col("id").count().alias("count")
])
.filter(pl.col("count") > 1000)
)
# Inspect the optimized plan
print("Optimized Query Plan:")
print(query.explain())
Output:
Optimized Query Plan:
FILTER [(col("count")) > (1000)] FROM
AGGREGATE
[col("doubled").mean().alias("avg_doubled"), col("id").count().alias("count")]
BY [col("category")]
FROM
WITH_COLUMNS:
[[(col("value")) * (2.0)].alias("doubled")]
CSV SCAN benchmark.csv
PROJECT 3/5 COLUMNS
SELECTION: [(col("value")) > (0.0)] & [(col("category").is_in([Series]))]
Notice what Polars did automatically:
- Predicate pushdown: Both filter conditions are pushed down to the CSV scan, reducing rows read from disk.
- Projection pushdown: Only 3 of 5 columns are loaded—those actually needed for the query.
- Filter combination: Multiple filters are merged into a single predicate.
Pandas cannot do this. Each operation executes immediately, creating intermediate DataFrames that may be discarded moments later.
Memory Efficiency
Beyond raw speed, memory efficiency determines whether your pipeline runs at all on limited hardware.
import pandas as pd
import polars as pl
import numpy as np
# Mixed-type DataFrame with nulls
n = 1_000_000
data = {
"int_col": np.random.randint(0, 1000, n),
"float_col": np.random.randn(n),
"string_col": np.random.choice(["alpha", "beta", "gamma", None], n),
"bool_col": np.random.choice([True, False, None], n),
}
# Introduce nulls
data["int_col"] = np.where(np.random.random(n) < 0.1, None, data["int_col"])
data["float_col"] = np.where(np.random.random(n) < 0.1, np.nan, data["float_col"])
pdf = pd.DataFrame(data)
plf = pl.DataFrame(data)
print("Memory Usage Breakdown:")
print("\nPandas:")
print(pdf.memory_usage(deep=True))
print(f"Total: {pdf.memory_usage(deep=True).sum() / 1024**2:.2f} MB")
print("\nPolars:")
print(f"Total: {plf.estimated_size() / 1024**2:.2f} MB")
Polars’ advantages compound with nullable types. Pandas historically used object dtype for nullable integers, wasting memory. While Pandas now offers nullable dtypes (Int64, string), they’re not the default and require explicit specification. Polars handles nulls natively in Arrow format with minimal overhead.
Practical Recommendations
Stick with Pandas when:
- Your datasets consistently fit in memory with room to spare (under 100MB)
- You’re prototyping or doing exploratory analysis where iteration speed matters more than execution speed
- Your pipeline depends heavily on libraries that expect Pandas DataFrames (many ML libraries, visualization tools)
- Your team knows Pandas well and the project timeline doesn’t allow for learning a new API
Switch to Polars when:
- You’re processing datasets over 100MB regularly
- Pipeline execution time directly impacts user experience or costs
- You’re starting a new project without legacy Pandas dependencies
- You need to process data that approaches or exceeds available RAM
- You’re running on multi-core machines and want automatic parallelization
Migration tips:
Polars intentionally mirrors much of Pandas’ API, but with important differences. Method chaining is idiomatic in Polars. Column selection uses pl.col() expressions rather than bracket notation. The lazy API (scan_* functions) should be your default for any non-trivial pipeline.
Start by identifying your slowest operations and converting those to Polars. Both libraries interoperate via Arrow, making gradual migration feasible:
# Convert between libraries with minimal overhead
polars_df = pl.from_pandas(pandas_df)
pandas_df = polars_df.to_pandas()
The performance numbers speak for themselves. For production data pipelines processing significant volumes, Polars isn’t just faster—it’s a fundamentally better tool for modern hardware. For quick scripts and small datasets, Pandas remains perfectly adequate. Choose based on your actual requirements, not hype.