How to Use When/Then/Otherwise in Polars
Conditional logic is fundamental to data transformation. Whether you're categorizing values, applying business rules, or cleaning data, you need a way to say 'if this, then that.' In Polars, the...
Key Insights
- Polars’
when/then/otherwiseis the idiomatic way to implement conditional logic, operating on entire columns at once rather than row-by-row likeapply(), making it significantly faster. - You can chain multiple
.when().then()calls before a final.otherwise()to handle complex multi-condition logic, similar to SQL’sCASE WHENstatements. - All branches of a
when/then/otherwiseexpression must return compatible types—mixing strings and integers will raise an error, so plan your return values carefully.
Introduction
Conditional logic is fundamental to data transformation. Whether you’re categorizing values, applying business rules, or cleaning data, you need a way to say “if this, then that.” In Polars, the idiomatic approach is the when/then/otherwise pattern.
If you’re coming from pandas, you might reach for apply() with a lambda function. Don’t. That approach forces Polars to iterate row-by-row, abandoning the columnar operations that make Polars fast. The when/then/otherwise pattern operates on entire columns simultaneously, leveraging SIMD instructions and parallel execution.
If you’re coming from SQL, you’ll feel right at home. Polars’ when/then/otherwise maps directly to SQL’s CASE WHEN statements:
-- SQL
CASE
WHEN score >= 90 THEN 'A'
WHEN score >= 80 THEN 'B'
ELSE 'C'
END
# Polars
pl.when(pl.col("score") >= 90).then(pl.lit("A"))
.when(pl.col("score") >= 80).then(pl.lit("B"))
.otherwise(pl.lit("C"))
Let’s dig into how this works in practice.
Basic Syntax and Structure
The fundamental pattern chains three methods: when() takes a boolean expression, then() specifies the value when true, and otherwise() handles the false case.
import polars as pl
df = pl.DataFrame({
"product": ["Widget", "Gadget", "Gizmo", "Thingamajig"],
"price": [25.00, 150.00, 75.00, 200.00]
})
# Classify products as "budget" or "premium" based on price
result = df.with_columns(
pl.when(pl.col("price") < 100)
.then(pl.lit("budget"))
.otherwise(pl.lit("premium"))
.alias("tier")
)
print(result)
Output:
shape: (4, 3)
┌─────────────┬───────┬─────────┐
│ product ┆ price ┆ tier │
│ --- ┆ --- ┆ --- │
│ str ┆ f64 ┆ str │
╞═════════════╪═══════╪═════════╡
│ Widget │ 25.0 │ budget │
│ Gadget │ 150.0 │ premium │
│ Gizmo │ 75.0 │ budget │
│ Thingamajig │ 200.0 │ premium │
└─────────────┴───────┴─────────┘
Notice the pl.lit() wrapper around string literals. This converts Python values into Polars expressions. For simple cases, Polars can often infer this, but being explicit avoids surprises.
Chaining Multiple Conditions
Real-world logic rarely involves just two outcomes. Chain additional .when().then() pairs before the final .otherwise() to handle multiple conditions. Polars evaluates conditions in order and returns the first match.
df = pl.DataFrame({
"student": ["Alice", "Bob", "Charlie", "Diana", "Eve"],
"score": [95, 82, 78, 65, 45]
})
# Convert numeric scores to letter grades
result = df.with_columns(
pl.when(pl.col("score") >= 90).then(pl.lit("A"))
.when(pl.col("score") >= 80).then(pl.lit("B"))
.when(pl.col("score") >= 70).then(pl.lit("C"))
.when(pl.col("score") >= 60).then(pl.lit("D"))
.otherwise(pl.lit("F"))
.alias("grade")
)
print(result)
Output:
shape: (5, 3)
┌─────────┬───────┬───────┐
│ student ┆ score ┆ grade │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ str │
╞═════════╪═══════╪═══════╡
│ Alice │ 95 │ A │
│ Bob │ 82 │ B │
│ Charlie │ 78 │ C │
│ Diana │ 65 │ D │
│ Eve │ 45 │ F │
└─────────┴───────┴───────┘
Order matters here. If you checked score >= 60 first, Alice’s 95 would match that condition and she’d get a D. Always order conditions from most specific to least specific.
Working with Multiple Columns
Conditions can reference any column in your DataFrame, and return values can come from columns too—not just literals.
df = pl.DataFrame({
"customer": ["Acme Corp", "StartupXYZ", "BigCo", "SmallBiz"],
"tier": ["gold", "silver", "gold", "bronze"],
"purchase_amount": [1000.00, 500.00, 2500.00, 150.00],
"gold_discount": [0.20, 0.20, 0.20, 0.20],
"silver_discount": [0.10, 0.10, 0.10, 0.10],
})
# Apply tier-specific discounts, with extra discount for large orders
result = df.with_columns(
pl.when((pl.col("tier") == "gold") & (pl.col("purchase_amount") > 2000))
.then(pl.col("purchase_amount") * (1 - pl.col("gold_discount") - 0.05))
.when(pl.col("tier") == "gold")
.then(pl.col("purchase_amount") * (1 - pl.col("gold_discount")))
.when(pl.col("tier") == "silver")
.then(pl.col("purchase_amount") * (1 - pl.col("silver_discount")))
.otherwise(pl.col("purchase_amount") * 0.95) # 5% for bronze
.alias("final_price")
)
print(result)
Output:
shape: (4, 6)
┌────────────┬────────┬─────────────────┬───────────────┬─────────────────┬─────────────┐
│ customer ┆ tier ┆ purchase_amount ┆ gold_discount ┆ silver_discount ┆ final_price │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ f64 ┆ f64 ┆ f64 ┆ f64 │
╞════════════╪════════╪═════════════════╪═══════════════╪═════════════════╪═════════════╡
│ Acme Corp │ gold │ 1000.0 │ 0.2 │ 0.1 │ 800.0 │
│ StartupXYZ │ silver │ 500.0 │ 0.2 │ 0.1 │ 450.0 │
│ BigCo │ gold │ 2500.0 │ 0.2 │ 0.1 │ 1875.0 │
│ SmallBiz │ bronze │ 150.0 │ 0.2 │ 0.1 │ 142.5 │
└────────────┴────────┴─────────────────┴───────────────┴─────────────────┴─────────────┘
This example shows conditions combining multiple columns with & (and) operators, return values computed from column expressions, and business logic that would be painful to express with simple filtering.
Nested and Complex Expressions
The when/then/otherwise pattern composes with other Polars expressions. You can use it inside select(), with_columns(), aggregations, and even nest conditional expressions within each other.
df = pl.DataFrame({
"name": ["alice smith", "BOB JONES", None, "Charlie Brown", ""],
"email": ["alice@example.com", None, "invalid", "charlie@test.org", "diana@company.io"],
"age": [25, 30, None, 45, 22]
})
# Clean and standardize data with conditional logic
result = df.with_columns(
# Normalize names: title case, handle nulls and empty strings
pl.when(pl.col("name").is_null() | (pl.col("name").str.len_chars() == 0))
.then(pl.lit("Unknown"))
.otherwise(pl.col("name").str.to_titlecase())
.alias("clean_name"),
# Validate emails and create a status field
pl.when(pl.col("email").is_null())
.then(pl.lit("missing"))
.when(~pl.col("email").str.contains("@"))
.then(pl.lit("invalid"))
.otherwise(pl.lit("valid"))
.alias("email_status"),
# Categorize age groups, handling nulls explicitly
pl.when(pl.col("age").is_null())
.then(pl.lit("unknown"))
.when(pl.col("age") < 30)
.then(pl.lit("young"))
.when(pl.col("age") < 50)
.then(pl.lit("middle"))
.otherwise(pl.lit("senior"))
.alias("age_group")
)
print(result)
Output:
shape: (5, 6)
┌───────────────┬──────────────────┬──────┬───────────────┬──────────────┬───────────┐
│ name ┆ email ┆ age ┆ clean_name ┆ email_status ┆ age_group │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ i64 ┆ str ┆ str ┆ str │
╞═══════════════╪══════════════════╪══════╪═══════════════╪══════════════╪═══════════╡
│ alice smith │ alice@example.com│ 25 │ Alice Smith │ valid │ young │
│ BOB JONES │ null │ 30 │ Bob Jones │ missing │ middle │
│ null │ invalid │ null │ Unknown │ invalid │ unknown │
│ Charlie Brown │ charlie@test.org │ 45 │ Charlie Brown │ valid │ middle │
│ │ diana@company.io │ 22 │ Unknown │ valid │ young │
└───────────────┴──────────────────┴──────┴───────────────┴──────────────┴───────────┘
Performance Considerations
The when/then/otherwise pattern isn’t just more readable than apply()—it’s dramatically faster. Here’s why, with benchmarks to prove it.
import polars as pl
import time
# Create a larger dataset for meaningful benchmarks
n_rows = 1_000_000
df = pl.DataFrame({
"value": list(range(n_rows))
})
# Method 1: when/then/otherwise (vectorized)
start = time.perf_counter()
result1 = df.with_columns(
pl.when(pl.col("value") < 250_000).then(pl.lit("low"))
.when(pl.col("value") < 500_000).then(pl.lit("medium-low"))
.when(pl.col("value") < 750_000).then(pl.lit("medium-high"))
.otherwise(pl.lit("high"))
.alias("category")
)
vectorized_time = time.perf_counter() - start
# Method 2: map_elements (row-by-row) - DON'T DO THIS
def categorize(val):
if val < 250_000:
return "low"
elif val < 500_000:
return "medium-low"
elif val < 750_000:
return "medium-high"
else:
return "high"
start = time.perf_counter()
result2 = df.with_columns(
pl.col("value").map_elements(categorize, return_dtype=pl.Utf8).alias("category")
)
apply_time = time.perf_counter() - start
print(f"Vectorized (when/then/otherwise): {vectorized_time:.4f} seconds")
print(f"Row-wise (map_elements): {apply_time:.4f} seconds")
print(f"Speedup: {apply_time / vectorized_time:.1f}x faster")
Typical output:
Vectorized (when/then/otherwise): 0.0156 seconds
Row-wise (map_elements): 2.3841 seconds
Speedup: 152.8x faster
The vectorized approach is over 100x faster because it processes data in batches using CPU vector instructions, avoids Python interpreter overhead for each row, and enables Polars’ query optimizer to work its magic.
When using lazy evaluation with LazyFrame, Polars can optimize even further—potentially reordering operations, eliminating unnecessary computations, and parallelizing across CPU cores.
Common Pitfalls and Tips
Type consistency is mandatory. Every branch must return the same type. This will fail:
# DON'T DO THIS - mixing types
pl.when(pl.col("value") > 0)
.then(pl.col("value")) # Returns integer
.otherwise(pl.lit("N/A")) # Returns string - ERROR!
Fix it by casting or using consistent types:
# Option 1: Use null instead of "N/A"
pl.when(pl.col("value") > 0)
.then(pl.col("value"))
.otherwise(pl.lit(None))
# Option 2: Cast everything to string
pl.when(pl.col("value") > 0)
.then(pl.col("value").cast(pl.Utf8))
.otherwise(pl.lit("N/A"))
Null handling requires explicit attention. Comparisons with null return null, not false. If you have nulls in your data, check for them first:
pl.when(pl.col("value").is_null())
.then(pl.lit("missing"))
.when(pl.col("value") > 100)
.then(pl.lit("high"))
.otherwise(pl.lit("low"))
Debug complex chains incrementally. When a multi-condition expression isn’t working as expected, test each condition separately:
# Debug by checking what each condition matches
df.with_columns(
(pl.col("score") >= 90).alias("is_A"),
(pl.col("score") >= 80).alias("is_B_or_above"),
# ... etc
)
Always include .otherwise(). While Polars technically allows omitting it (defaulting to null), explicit is better than implicit. Future you will thank present you for the clarity.
The when/then/otherwise pattern is your primary tool for conditional logic in Polars. Master it, and you’ll write faster, more readable data transformations.