How to Convert Pandas to Polars
Pandas has been the backbone of Python data analysis for over a decade, but it's showing its age. Built on NumPy with single-threaded execution and eager evaluation, pandas struggles with datasets...
Key Insights
- Polars isn’t a drop-in replacement for pandas—it requires rethinking how you approach data manipulation, particularly around indexes and lazy evaluation
- Start migration with read-heavy analytical workloads where Polars’ parallel execution provides immediate performance gains, not complex data cleaning pipelines
- The expression API is Polars’ killer feature; learning to think in expressions rather than method chains unlocks the real performance benefits
Why Switch to Polars?
Pandas has been the backbone of Python data analysis for over a decade, but it’s showing its age. Built on NumPy with single-threaded execution and eager evaluation, pandas struggles with datasets that exceed available RAM and can’t leverage modern multi-core processors effectively.
Polars addresses these limitations directly. Written in Rust with a focus on performance, it offers lazy evaluation (build a query plan, optimize it, then execute), automatic parallelization across CPU cores, and memory efficiency through Apache Arrow’s columnar format. Benchmarks consistently show Polars outperforming pandas by 5-20x on common operations.
But performance alone isn’t a reason to migrate. Consider switching when you’re hitting memory limits with pandas, spending significant time optimizing slow transformations, or starting new projects where you can design around Polars’ paradigms from the start.
Key Syntax Differences at a Glance
The most jarring difference for pandas users is Polars’ expression-based API. Where pandas encourages chained method calls on DataFrames, Polars uses a context-based system with select, filter, and with_columns.
Here’s the same analytical query in both libraries:
# Pandas approach
import pandas as pd
result = (
df[df["status"] == "active"]
.groupby("category")
.agg(
total_revenue=("revenue", "sum"),
avg_quantity=("quantity", "mean"),
order_count=("order_id", "count")
)
.sort_values("total_revenue", ascending=False)
.head(10)
)
# Polars approach
import polars as pl
result = (
df.filter(pl.col("status") == "active")
.group_by("category")
.agg(
pl.col("revenue").sum().alias("total_revenue"),
pl.col("quantity").mean().alias("avg_quantity"),
pl.col("order_id").count().alias("order_count")
)
.sort("total_revenue", descending=True)
.head(10)
)
Notice the key differences: pl.col() expressions replace string column references, filter replaces boolean indexing, and aggregations are explicit expressions rather than tuples. This verbosity pays off—Polars can optimize and parallelize these expressions automatically.
Converting Data Structures
When you need interoperability between libraries, Polars provides straightforward conversion methods:
import pandas as pd
import polars as pl
import numpy as np
# Pandas DataFrame to Polars
pandas_df = pd.DataFrame({
"id": [1, 2, 3],
"value": [10.5, 20.3, 30.1],
"category": ["A", "B", "A"]
})
polars_df = pl.from_pandas(pandas_df)
# Polars DataFrame to Pandas
back_to_pandas = polars_df.to_pandas()
# NumPy array to Polars (via dictionary)
arr = np.random.randn(1000, 3)
polars_from_numpy = pl.DataFrame({
"col_a": arr[:, 0],
"col_b": arr[:, 1],
"col_c": arr[:, 2]
})
# Polars Series from pandas Series
pandas_series = pd.Series([1, 2, 3], name="numbers")
polars_series = pl.from_pandas(pandas_series)
# Direct NumPy extraction from Polars
numpy_array = polars_df["value"].to_numpy()
One gotcha: pandas’ nullable integer types (Int64) convert cleanly, but pandas’ object dtype columns with mixed types will cause issues. Clean your data types before conversion.
Translating Common Operations
Here’s a practical reference for the operations you’ll translate most frequently:
import polars as pl
import pandas as pd
# Assume df is a Polars DataFrame
# 1. Column selection
# Pandas: df[["col1", "col2"]] or df.loc[:, ["col1", "col2"]]
# Polars:
df.select("col1", "col2")
df.select(pl.col("col1", "col2"))
# 2. Filtering rows
# Pandas: df[df["age"] > 30]
# Polars:
df.filter(pl.col("age") > 30)
# 3. Creating new columns
# Pandas: df["new_col"] = df["a"] + df["b"]
# Polars:
df.with_columns((pl.col("a") + pl.col("b")).alias("new_col"))
# 4. Apply/map functions
# Pandas: df["col"].apply(lambda x: x.upper())
# Polars (prefer native expressions when possible):
df.with_columns(pl.col("col").str.to_uppercase())
# For custom functions (slower, avoid if possible):
df.with_columns(pl.col("col").map_elements(lambda x: custom_func(x), return_dtype=pl.Utf8))
# 5. Merging/joining
# Pandas: pd.merge(df1, df2, on="key", how="left")
# Polars:
df1.join(df2, on="key", how="left")
# 6. Pivot tables
# Pandas: df.pivot_table(values="revenue", index="region", columns="product", aggfunc="sum")
# Polars:
df.pivot(on="product", index="region", values="revenue", aggregate_function="sum")
# 7. String operations
# Pandas: df["name"].str.contains("test")
# Polars:
df.filter(pl.col("name").str.contains("test"))
# 8. DateTime operations
# Pandas: df["date"].dt.year
# Polars:
df.with_columns(pl.col("date").dt.year().alias("year"))
The pattern to internalize: Polars separates contexts (select, filter, with_columns, group_by) from expressions (pl.col(), pl.lit(), pl.when()). Expressions describe what to compute; contexts describe where to apply them.
Handling Pandas-Specific Patterns
The biggest conceptual shift is that Polars has no index. This is intentional—indexes add complexity and prevent certain optimizations. Here’s how to handle common index-dependent patterns:
import polars as pl
# Pattern 1: Using index for lookups
# Pandas: df.loc["row_label"] or df.iloc[5]
# Polars: Use filter or row index
df.filter(pl.col("id") == "row_label") # if you have an ID column
df.row(5) # get single row by position (returns tuple)
df.slice(5, 1) # get single row as DataFrame
# Pattern 2: Index-based alignment in operations
# Pandas automatically aligns on index during operations
# Polars: Use explicit joins
df1.join(df2, on="key_column", how="inner")
# Pattern 3: Reset index / set index
# Pandas: df.reset_index() / df.set_index("col")
# Polars: Just use the column directly, or add row numbers
df.with_row_index("index") # adds 0-based row numbers
# Pattern 4: Multi-index groupby results
# Pandas groupby returns MultiIndex by default
# Polars: Results are flat DataFrames (cleaner for further processing)
result = df.group_by("category", "subcategory").agg(pl.col("value").sum())
# Access directly: result.filter(pl.col("category") == "A")
# Pattern 5: In-place modifications
# Pandas: df["col"] = df["col"] * 2 (modifies in place)
# Polars: Always returns new DataFrame (immutable)
df = df.with_columns(pl.col("col") * 2)
Null handling differs too. Pandas uses NaN (a float) for missing values, which causes type coercion issues. Polars uses proper null values that work with any dtype:
# Checking for nulls
# Pandas: df["col"].isna()
# Polars:
df.filter(pl.col("col").is_null())
df.filter(pl.col("col").is_not_null())
# Filling nulls
# Pandas: df["col"].fillna(0)
# Polars:
df.with_columns(pl.col("col").fill_null(0))
df.with_columns(pl.col("col").fill_null(strategy="forward"))
Leveraging Polars-Specific Features
Once you’re comfortable with basic translation, embrace Polars’ unique capabilities. Lazy evaluation is the most impactful:
import polars as pl
# Eager execution (like pandas) - processes immediately
eager_result = (
pl.read_csv("large_file.csv")
.filter(pl.col("date") > "2023-01-01")
.group_by("region")
.agg(pl.col("sales").sum())
)
# Lazy execution - builds query plan, optimizes, then executes
lazy_result = (
pl.scan_csv("large_file.csv") # scan instead of read
.filter(pl.col("date") > "2023-01-01")
.group_by("region")
.agg(pl.col("sales").sum())
.collect() # triggers execution
)
The lazy version can push filters down to the scan phase (predicate pushdown), project only needed columns (projection pushdown), and optimize the entire query plan before executing. For large datasets, this often means 10x+ performance improvements.
The expression API also enables powerful patterns:
# Apply same operation to multiple columns
df.with_columns(pl.col(pl.Float64).round(2)) # round all float columns
df.with_columns(pl.col("^sales_.*$").fill_null(0)) # regex column selection
# Window functions without groupby
df.with_columns(
pl.col("value").sum().over("category").alias("category_total"),
pl.col("value").rank().over("category").alias("rank_in_category")
)
# Conditional expressions
df.with_columns(
pl.when(pl.col("score") > 90)
.then(pl.lit("A"))
.when(pl.col("score") > 80)
.then(pl.lit("B"))
.otherwise(pl.lit("C"))
.alias("grade")
)
Migration Strategy and Gotchas
Don’t attempt a big-bang migration. Instead:
- Start with isolated analytical scripts where you can validate results against pandas
- Use conversion functions at boundaries — keep pandas for I/O with libraries that require it, convert to Polars for heavy computation
- Write comparison tests that verify your Polars code produces identical results to the pandas original
Common pitfalls to watch for:
- Column ordering isn’t guaranteed in some operations. Don’t rely on positional column access.
- Type strictness — Polars won’t silently convert types. A column defined as
Int64won’t accept strings. - No automatic broadcasting of scalars in all contexts. Use
pl.lit()explicitly. - Different sort stability — Polars’ sort is not stable by default. Use
maintain_order=Trueif needed.
Keep pandas when you need tight integration with scikit-learn, matplotlib, or other libraries that expect pandas DataFrames. The conversion overhead is minimal for reasonably-sized data, so hybrid workflows are perfectly valid.
The investment in learning Polars pays compound returns. Once you internalize the expression API, you’ll write cleaner, faster data transformations—and wonder why you tolerated pandas’ quirks for so long.