How to Apply a Function in Polars

Polars has rapidly become the go-to DataFrame library for Python developers who need speed. Built on Rust with a lazy execution engine, it outperforms pandas in most benchmarks by significant...

Key Insights

  • Polars expressions should be your first choice—they’re vectorized, parallelized, and orders of magnitude faster than Python UDFs
  • Use map_elements for row-wise operations when you genuinely need Python logic, but accept the performance penalty
  • Use map_batches when you need to apply NumPy or other vectorized libraries to entire columns at once

Polars has rapidly become the go-to DataFrame library for Python developers who need speed. Built on Rust with a lazy execution engine, it outperforms pandas in most benchmarks by significant margins. But eventually, you’ll hit a wall: you need custom logic that doesn’t fit neatly into Polars’ built-in functions.

This is where applying custom functions comes in. The catch? Doing it wrong can obliterate the performance gains that drew you to Polars in the first place. Let’s walk through the right way to apply functions in Polars, starting with the approach you should use 90% of the time.

Understanding Polars Expressions vs. Apply

Polars’ expression API is its superpower. Expressions are lazy, composable operations that Polars can optimize and parallelize. When you chain expressions together, Polars builds a query plan and executes it efficiently across all available CPU cores.

The moment you drop into Python with map_elements or map_batches, you’re telling Polars to stop optimizing and just run your Python code row by row or batch by batch. This is slow. Really slow.

Here’s a concrete example. Say you want to calculate a 10% bonus on a salary column:

import polars as pl
import time

# Create a DataFrame with 1 million rows
df = pl.DataFrame({
    "salary": list(range(50000, 1050000))
})

# The WRONG way: using map_elements
start = time.perf_counter()
result_apply = df.with_columns(
    pl.col("salary").map_elements(lambda x: x * 1.1, return_dtype=pl.Float64).alias("with_bonus")
)
apply_time = time.perf_counter() - start

# The RIGHT way: using expressions
start = time.perf_counter()
result_expr = df.with_columns(
    (pl.col("salary") * 1.1).alias("with_bonus")
)
expr_time = time.perf_counter() - start

print(f"map_elements: {apply_time:.4f}s")
print(f"Expression: {expr_time:.4f}s")
print(f"Expression is {apply_time / expr_time:.1f}x faster")

On my machine, this outputs something like:

map_elements: 0.8234s
Expression: 0.0012s
Expression is 686.2x faster

That’s not a typo. The expression approach is hundreds of times faster. Before reaching for map_elements, always ask: can I express this with Polars’ built-in functions? The answer is usually yes.

Using map_elements for Row-wise Operations

Sometimes you genuinely need custom Python logic. Maybe you’re calling an external API, applying a complex regex, or using a domain-specific library. This is where map_elements earns its place.

map_elements applies a Python function to each element in a column individually. Here’s a practical example with custom string transformation:

import polars as pl
import re

df = pl.DataFrame({
    "raw_phone": [
        "555-123-4567",
        "(555) 987 6543",
        "555.456.7890",
        "5551234567"
    ]
})

def normalize_phone(phone: str) -> str:
    """Strip all non-digits and format as XXX-XXX-XXXX."""
    digits = re.sub(r'\D', '', phone)
    if len(digits) == 10:
        return f"{digits[:3]}-{digits[3:6]}-{digits[6:]}"
    return phone  # Return original if not 10 digits

result = df.with_columns(
    pl.col("raw_phone")
    .map_elements(normalize_phone, return_dtype=pl.String)
    .alias("normalized_phone")
)

print(result)

Output:

shape: (4, 2)
┌────────────────┬──────────────────┐
│ raw_phone      ┆ normalized_phone │
│ ---            ┆ ---              │
│ str            ┆ str              │
╞════════════════╪══════════════════╡
│ 555-123-4567   ┆ 555-123-4567     │
│ (555) 987 6543 ┆ 555-987-6543     │
│ 555.456.7890   ┆ 555-456-7890     │
│ 5551234567     ┆ 555-123-4567     │
└────────────────┴──────────────────┘

The return_dtype parameter is critical. Polars needs to know what type your function returns to properly construct the resulting Series. If you omit it, Polars will try to infer the type, which adds overhead and can cause errors.

For functions that might return different types or None, handle it explicitly:

def parse_score(value: str) -> int | None:
    """Parse a score string, returning None for invalid values."""
    try:
        score = int(value.strip())
        return score if 0 <= score <= 100 else None
    except (ValueError, AttributeError):
        return None

df = pl.DataFrame({
    "score_raw": ["85", "92", "invalid", "105", "  78  ", None]
})

result = df.with_columns(
    pl.col("score_raw")
    .map_elements(parse_score, return_dtype=pl.Int64)
    .alias("score_clean")
)

Using map_batches for Column-wise Operations

When you need to apply a vectorized operation from NumPy or another library, map_batches is your tool. Instead of processing elements one at a time, it passes the entire column (as a Polars Series) to your function.

import polars as pl
import numpy as np

df = pl.DataFrame({
    "values": [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0]
})

def rolling_zscore(s: pl.Series) -> pl.Series:
    """Calculate z-score using NumPy."""
    arr = s.to_numpy()
    mean = np.mean(arr)
    std = np.std(arr)
    z_scores = (arr - mean) / std
    return pl.Series(z_scores)

result = df.with_columns(
    pl.col("values")
    .map_batches(rolling_zscore)
    .alias("z_score")
)

print(result)

The key difference: map_batches receives the entire Series at once, letting you use NumPy’s vectorized operations. This is significantly faster than map_elements when working with numerical computations.

Here’s a more practical example using SciPy for statistical calculations:

import polars as pl
from scipy import stats
import numpy as np

df = pl.DataFrame({
    "measurements": np.random.normal(100, 15, 1000).tolist()
})

def winsorize_column(s: pl.Series) -> pl.Series:
    """Winsorize outliers at 5th and 95th percentiles."""
    arr = s.to_numpy()
    winsorized = stats.mstats.winsorize(arr, limits=[0.05, 0.05])
    return pl.Series(np.array(winsorized))

result = df.with_columns(
    pl.col("measurements")
    .map_batches(winsorize_column)
    .alias("measurements_winsorized")
)

Applying Functions Across Multiple Columns

Real-world transformations often need data from multiple columns. Polars handles this elegantly with struct, which bundles columns together into a single structured column that your function can unpack.

import polars as pl

df = pl.DataFrame({
    "base_price": [100.0, 250.0, 75.0, 500.0],
    "quantity": [2, 1, 5, 1],
    "customer_tier": ["gold", "silver", "bronze", "gold"]
})

def calculate_total(row: dict) -> float:
    """Calculate total with tier-based discount."""
    discounts = {"gold": 0.20, "silver": 0.10, "bronze": 0.05}
    
    base = row["base_price"] * row["quantity"]
    discount = discounts.get(row["customer_tier"], 0)
    return base * (1 - discount)

result = df.with_columns(
    pl.struct(["base_price", "quantity", "customer_tier"])
    .map_elements(calculate_total, return_dtype=pl.Float64)
    .alias("total_price")
)

print(result)

Output:

shape: (4, 4)
┌────────────┬──────────┬───────────────┬─────────────┐
│ base_price ┆ quantity ┆ customer_tier ┆ total_price │
│ ---        ┆ ---      ┆ ---           ┆ ---         │
│ f64        ┆ i64      ┆ str           ┆ f64         │
╞════════════╪══════════╪═══════════════╪═════════════╡
│ 100.0      ┆ 2        ┆ gold          ┆ 160.0       │
│ 250.0      ┆ 1        ┆ silver        ┆ 225.0       │
│ 75.0       ┆ 5        ┆ bronze        ┆ 356.25      │
│ 500.0      ┆ 1        ┆ gold          ┆ 400.0       │
└────────────┴──────────┴───────────────┴─────────────┘

The struct approach passes each row as a dictionary to your function. This is intuitive but slow. When possible, rewrite using expressions:

# Faster: Expression-based approach
result = df.with_columns(
    (
        pl.col("base_price") * pl.col("quantity") * 
        pl.when(pl.col("customer_tier") == "gold").then(0.80)
        .when(pl.col("customer_tier") == "silver").then(0.90)
        .when(pl.col("customer_tier") == "bronze").then(0.95)
        .otherwise(1.0)
    ).alias("total_price")
)

Performance Considerations and Best Practices

Let’s quantify the performance differences with a proper benchmark:

import polars as pl
import numpy as np
import time

# Generate test data
n_rows = 1_000_000
df = pl.DataFrame({
    "value": np.random.randn(n_rows).tolist()
})

def benchmark(name: str, func):
    start = time.perf_counter()
    result = func()
    elapsed = time.perf_counter() - start
    print(f"{name}: {elapsed:.4f}s")
    return result

# Method 1: Pure expression
benchmark("Expression", lambda: df.with_columns(
    (pl.col("value").abs() ** 2).alias("result")
))

# Method 2: map_batches with NumPy
def numpy_transform(s: pl.Series) -> pl.Series:
    arr = s.to_numpy()
    return pl.Series(np.abs(arr) ** 2)

benchmark("map_batches", lambda: df.with_columns(
    pl.col("value").map_batches(numpy_transform).alias("result")
))

# Method 3: map_elements (slowest)
benchmark("map_elements", lambda: df.with_columns(
    pl.col("value").map_elements(
        lambda x: abs(x) ** 2, 
        return_dtype=pl.Float64
    ).alias("result")
))

Typical output:

Expression: 0.0089s
map_batches: 0.0156s
map_elements: 2.1847s

Follow these rules:

  1. Expressions first: Always try to solve the problem with Polars expressions
  2. map_batches second: When you need external libraries, use vectorized batch operations
  3. map_elements last: Only when you truly need row-by-row Python logic
  4. Always specify return_dtype: It prevents type inference overhead and catches errors early
  5. Use lazy mode: Wrap operations in lazy() and collect() to enable query optimization

Conclusion

Polars gives you escape hatches to Python when you need them, but those hatches come with a performance cost. The expression API should handle the vast majority of your data transformations—it’s fast, readable, and optimizable.

When you do need custom functions, choose wisely: map_batches for vectorized operations on entire columns, map_elements for row-wise logic that can’t be expressed any other way. Always benchmark your choices, and remember that the fastest code is often the code that stays in Polars’ native expression system.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.