How to Apply a Function in Polars
Polars has rapidly become the go-to DataFrame library for Python developers who need speed. Built on Rust with a lazy execution engine, it outperforms pandas in most benchmarks by significant...
Key Insights
- Polars expressions should be your first choice—they’re vectorized, parallelized, and orders of magnitude faster than Python UDFs
- Use
map_elementsfor row-wise operations when you genuinely need Python logic, but accept the performance penalty - Use
map_batcheswhen you need to apply NumPy or other vectorized libraries to entire columns at once
Polars has rapidly become the go-to DataFrame library for Python developers who need speed. Built on Rust with a lazy execution engine, it outperforms pandas in most benchmarks by significant margins. But eventually, you’ll hit a wall: you need custom logic that doesn’t fit neatly into Polars’ built-in functions.
This is where applying custom functions comes in. The catch? Doing it wrong can obliterate the performance gains that drew you to Polars in the first place. Let’s walk through the right way to apply functions in Polars, starting with the approach you should use 90% of the time.
Understanding Polars Expressions vs. Apply
Polars’ expression API is its superpower. Expressions are lazy, composable operations that Polars can optimize and parallelize. When you chain expressions together, Polars builds a query plan and executes it efficiently across all available CPU cores.
The moment you drop into Python with map_elements or map_batches, you’re telling Polars to stop optimizing and just run your Python code row by row or batch by batch. This is slow. Really slow.
Here’s a concrete example. Say you want to calculate a 10% bonus on a salary column:
import polars as pl
import time
# Create a DataFrame with 1 million rows
df = pl.DataFrame({
"salary": list(range(50000, 1050000))
})
# The WRONG way: using map_elements
start = time.perf_counter()
result_apply = df.with_columns(
pl.col("salary").map_elements(lambda x: x * 1.1, return_dtype=pl.Float64).alias("with_bonus")
)
apply_time = time.perf_counter() - start
# The RIGHT way: using expressions
start = time.perf_counter()
result_expr = df.with_columns(
(pl.col("salary") * 1.1).alias("with_bonus")
)
expr_time = time.perf_counter() - start
print(f"map_elements: {apply_time:.4f}s")
print(f"Expression: {expr_time:.4f}s")
print(f"Expression is {apply_time / expr_time:.1f}x faster")
On my machine, this outputs something like:
map_elements: 0.8234s
Expression: 0.0012s
Expression is 686.2x faster
That’s not a typo. The expression approach is hundreds of times faster. Before reaching for map_elements, always ask: can I express this with Polars’ built-in functions? The answer is usually yes.
Using map_elements for Row-wise Operations
Sometimes you genuinely need custom Python logic. Maybe you’re calling an external API, applying a complex regex, or using a domain-specific library. This is where map_elements earns its place.
map_elements applies a Python function to each element in a column individually. Here’s a practical example with custom string transformation:
import polars as pl
import re
df = pl.DataFrame({
"raw_phone": [
"555-123-4567",
"(555) 987 6543",
"555.456.7890",
"5551234567"
]
})
def normalize_phone(phone: str) -> str:
"""Strip all non-digits and format as XXX-XXX-XXXX."""
digits = re.sub(r'\D', '', phone)
if len(digits) == 10:
return f"{digits[:3]}-{digits[3:6]}-{digits[6:]}"
return phone # Return original if not 10 digits
result = df.with_columns(
pl.col("raw_phone")
.map_elements(normalize_phone, return_dtype=pl.String)
.alias("normalized_phone")
)
print(result)
Output:
shape: (4, 2)
┌────────────────┬──────────────────┐
│ raw_phone ┆ normalized_phone │
│ --- ┆ --- │
│ str ┆ str │
╞════════════════╪══════════════════╡
│ 555-123-4567 ┆ 555-123-4567 │
│ (555) 987 6543 ┆ 555-987-6543 │
│ 555.456.7890 ┆ 555-456-7890 │
│ 5551234567 ┆ 555-123-4567 │
└────────────────┴──────────────────┘
The return_dtype parameter is critical. Polars needs to know what type your function returns to properly construct the resulting Series. If you omit it, Polars will try to infer the type, which adds overhead and can cause errors.
For functions that might return different types or None, handle it explicitly:
def parse_score(value: str) -> int | None:
"""Parse a score string, returning None for invalid values."""
try:
score = int(value.strip())
return score if 0 <= score <= 100 else None
except (ValueError, AttributeError):
return None
df = pl.DataFrame({
"score_raw": ["85", "92", "invalid", "105", " 78 ", None]
})
result = df.with_columns(
pl.col("score_raw")
.map_elements(parse_score, return_dtype=pl.Int64)
.alias("score_clean")
)
Using map_batches for Column-wise Operations
When you need to apply a vectorized operation from NumPy or another library, map_batches is your tool. Instead of processing elements one at a time, it passes the entire column (as a Polars Series) to your function.
import polars as pl
import numpy as np
df = pl.DataFrame({
"values": [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0]
})
def rolling_zscore(s: pl.Series) -> pl.Series:
"""Calculate z-score using NumPy."""
arr = s.to_numpy()
mean = np.mean(arr)
std = np.std(arr)
z_scores = (arr - mean) / std
return pl.Series(z_scores)
result = df.with_columns(
pl.col("values")
.map_batches(rolling_zscore)
.alias("z_score")
)
print(result)
The key difference: map_batches receives the entire Series at once, letting you use NumPy’s vectorized operations. This is significantly faster than map_elements when working with numerical computations.
Here’s a more practical example using SciPy for statistical calculations:
import polars as pl
from scipy import stats
import numpy as np
df = pl.DataFrame({
"measurements": np.random.normal(100, 15, 1000).tolist()
})
def winsorize_column(s: pl.Series) -> pl.Series:
"""Winsorize outliers at 5th and 95th percentiles."""
arr = s.to_numpy()
winsorized = stats.mstats.winsorize(arr, limits=[0.05, 0.05])
return pl.Series(np.array(winsorized))
result = df.with_columns(
pl.col("measurements")
.map_batches(winsorize_column)
.alias("measurements_winsorized")
)
Applying Functions Across Multiple Columns
Real-world transformations often need data from multiple columns. Polars handles this elegantly with struct, which bundles columns together into a single structured column that your function can unpack.
import polars as pl
df = pl.DataFrame({
"base_price": [100.0, 250.0, 75.0, 500.0],
"quantity": [2, 1, 5, 1],
"customer_tier": ["gold", "silver", "bronze", "gold"]
})
def calculate_total(row: dict) -> float:
"""Calculate total with tier-based discount."""
discounts = {"gold": 0.20, "silver": 0.10, "bronze": 0.05}
base = row["base_price"] * row["quantity"]
discount = discounts.get(row["customer_tier"], 0)
return base * (1 - discount)
result = df.with_columns(
pl.struct(["base_price", "quantity", "customer_tier"])
.map_elements(calculate_total, return_dtype=pl.Float64)
.alias("total_price")
)
print(result)
Output:
shape: (4, 4)
┌────────────┬──────────┬───────────────┬─────────────┐
│ base_price ┆ quantity ┆ customer_tier ┆ total_price │
│ --- ┆ --- ┆ --- ┆ --- │
│ f64 ┆ i64 ┆ str ┆ f64 │
╞════════════╪══════════╪═══════════════╪═════════════╡
│ 100.0 ┆ 2 ┆ gold ┆ 160.0 │
│ 250.0 ┆ 1 ┆ silver ┆ 225.0 │
│ 75.0 ┆ 5 ┆ bronze ┆ 356.25 │
│ 500.0 ┆ 1 ┆ gold ┆ 400.0 │
└────────────┴──────────┴───────────────┴─────────────┘
The struct approach passes each row as a dictionary to your function. This is intuitive but slow. When possible, rewrite using expressions:
# Faster: Expression-based approach
result = df.with_columns(
(
pl.col("base_price") * pl.col("quantity") *
pl.when(pl.col("customer_tier") == "gold").then(0.80)
.when(pl.col("customer_tier") == "silver").then(0.90)
.when(pl.col("customer_tier") == "bronze").then(0.95)
.otherwise(1.0)
).alias("total_price")
)
Performance Considerations and Best Practices
Let’s quantify the performance differences with a proper benchmark:
import polars as pl
import numpy as np
import time
# Generate test data
n_rows = 1_000_000
df = pl.DataFrame({
"value": np.random.randn(n_rows).tolist()
})
def benchmark(name: str, func):
start = time.perf_counter()
result = func()
elapsed = time.perf_counter() - start
print(f"{name}: {elapsed:.4f}s")
return result
# Method 1: Pure expression
benchmark("Expression", lambda: df.with_columns(
(pl.col("value").abs() ** 2).alias("result")
))
# Method 2: map_batches with NumPy
def numpy_transform(s: pl.Series) -> pl.Series:
arr = s.to_numpy()
return pl.Series(np.abs(arr) ** 2)
benchmark("map_batches", lambda: df.with_columns(
pl.col("value").map_batches(numpy_transform).alias("result")
))
# Method 3: map_elements (slowest)
benchmark("map_elements", lambda: df.with_columns(
pl.col("value").map_elements(
lambda x: abs(x) ** 2,
return_dtype=pl.Float64
).alias("result")
))
Typical output:
Expression: 0.0089s
map_batches: 0.0156s
map_elements: 2.1847s
Follow these rules:
- Expressions first: Always try to solve the problem with Polars expressions
map_batchessecond: When you need external libraries, use vectorized batch operationsmap_elementslast: Only when you truly need row-by-row Python logic- Always specify
return_dtype: It prevents type inference overhead and catches errors early - Use lazy mode: Wrap operations in
lazy()andcollect()to enable query optimization
Conclusion
Polars gives you escape hatches to Python when you need them, but those hatches come with a performance cost. The expression API should handle the vast majority of your data transformations—it’s fast, readable, and optimizable.
When you do need custom functions, choose wisely: map_batches for vectorized operations on entire columns, map_elements for row-wise logic that can’t be expressed any other way. Always benchmark your choices, and remember that the fastest code is often the code that stays in Polars’ native expression system.