How to Use When/Otherwise in PySpark
Conditional logic sits at the heart of most data transformations. Whether you're categorizing customers, flagging anomalies, or deriving new features, you need a reliable way to apply different logic...
Key Insights
- The
when().otherwise()pattern in PySpark provides SQL-like CASE WHEN functionality for conditional column transformations, returning null whenotherwise()is omitted - Chain multiple
when()clauses to handle complex if/elif/else logic, but always wrap compound conditions in parentheses to avoid operator precedence bugs - Prefer
when().otherwise()over UDFs for conditional logic—it executes natively in Spark’s Catalyst optimizer and avoids the serialization overhead of Python functions
Introduction
Conditional logic sits at the heart of most data transformations. Whether you’re categorizing customers, flagging anomalies, or deriving new features, you need a reliable way to apply different logic based on data values. In PySpark, the when().otherwise() pattern from pyspark.sql.functions serves this purpose.
Think of it as PySpark’s equivalent to SQL’s CASE WHEN statement or Python’s if/elif/else chains. Unlike Python’s native conditionals, when().otherwise() operates on entire DataFrame columns, enabling vectorized transformations that Spark can optimize and distribute across your cluster.
This article covers everything you need to use when().otherwise() effectively: basic syntax, chaining conditions, integration with other functions, and the pitfalls that trip up even experienced engineers.
Understanding the Syntax
The when() function takes two arguments: a condition and a value to return when that condition evaluates to true. The otherwise() method specifies the default value when no conditions match.
from pyspark.sql import SparkSession
from pyspark.sql.functions import when, col
spark = SparkSession.builder.appName("when_otherwise_demo").getOrCreate()
# Sample data
data = [("Alice", 85), ("Bob", 42), ("Charlie", 78), ("Diana", 91)]
df = spark.createDataFrame(data, ["name", "score"])
# Simple condition: classify scores as "high" or "low"
df_classified = df.withColumn(
"performance",
when(col("score") >= 70, "high").otherwise("low")
)
df_classified.show()
Output:
+-------+-----+-----------+
| name|score|performance|
+-------+-----+-----------+
| Alice| 85| high|
| Bob| 42| low|
|Charlie| 78| high|
| Diana| 91| high|
+-------+-----+-----------+
The return type of when().otherwise() is a Column object, which means you can use it anywhere PySpark expects a column expression: inside select(), withColumn(), filter(), or even nested within other functions.
One critical detail: when() and otherwise() must return compatible types. Mixing strings and integers in the same conditional chain will cause schema issues or runtime errors.
Chaining Multiple Conditions
Real-world transformations rarely involve binary choices. The when() function supports chaining for multi-way branching, evaluated top-to-bottom like Python’s elif statements.
from pyspark.sql.functions import when, col
# Sample data with ages
data = [
("Alice", 8), ("Bob", 15), ("Charlie", 32),
("Diana", 67), ("Eve", 19), ("Frank", 45)
]
df = spark.createDataFrame(data, ["name", "age"])
# Multi-tier age classification
df_age_groups = df.withColumn(
"age_group",
when(col("age") < 13, "child")
.when(col("age") < 20, "teen")
.when(col("age") < 65, "adult")
.otherwise("senior")
)
df_age_groups.show()
Output:
+-------+---+---------+
| name|age|age_group|
+-------+---+---------+
| Alice| 8| child|
| Bob| 15| teen|
|Charlie| 32| adult|
| Diana| 67| senior|
| Eve| 19| teen|
| Frank| 45| adult|
+-------+---+---------+
The order of conditions matters. Spark evaluates them sequentially and returns the value for the first matching condition. If you placed col("age") < 65 before col("age") < 13, everyone under 65 would be classified as “adult,” including children.
For readability with many conditions, I recommend breaking each .when() onto its own line. This makes the logic scannable and diffs cleaner in version control.
Using with Column Operations
The when().otherwise() pattern integrates seamlessly with PySpark’s column operations. You’ll most commonly use it within withColumn() to create new columns or select() to transform existing ones.
Here’s a more complex example that derives a column from multiple source columns:
from pyspark.sql.functions import when, col
# E-commerce order data
data = [
("ORD001", 150.00, "gold", True),
("ORD002", 45.00, "silver", False),
("ORD003", 500.00, "gold", True),
("ORD004", 75.00, "bronze", False),
("ORD005", 200.00, "silver", True),
]
df = spark.createDataFrame(
data,
["order_id", "amount", "customer_tier", "is_prime"]
)
# Calculate discount based on multiple factors
df_with_discount = df.withColumn(
"discount_pct",
when(
(col("customer_tier") == "gold") & (col("is_prime") == True),
0.20
)
.when(col("customer_tier") == "gold", 0.15)
.when(
(col("customer_tier") == "silver") & (col("is_prime") == True),
0.12
)
.when(col("customer_tier") == "silver", 0.08)
.when(col("is_prime") == True, 0.05)
.otherwise(0.0)
)
# Calculate final price
df_final = df_with_discount.withColumn(
"final_amount",
col("amount") * (1 - col("discount_pct"))
)
df_final.show()
Output:
+--------+------+-------------+--------+------------+------------+
|order_id|amount|customer_tier|is_prime|discount_pct|final_amount|
+--------+------+-------------+--------+------------+------------+
| ORD001| 150.0| gold| true| 0.2| 120.0|
| ORD002| 45.0| silver| false| 0.08| 41.4|
| ORD003| 500.0| gold| true| 0.2| 400.0|
| ORD004| 75.0| bronze| false| 0.0| 75.0|
| ORD005| 200.0| silver| true| 0.12| 176.0|
+--------+------+-------------+--------+------------+------------+
You can also use when().otherwise() directly in select() statements, which is useful when you’re reshaping the DataFrame anyway:
df.select(
col("order_id"),
col("amount"),
when(col("amount") > 100, "large").otherwise("small").alias("order_size")
).show()
Combining with Other Functions
The real power of when().otherwise() emerges when you combine it with PySpark’s rich function library. Null handling, membership testing, and string operations all integrate cleanly.
from pyspark.sql.functions import when, col, isnan, isnull, trim, lower
# Messy customer data
data = [
("Alice", "alice@email.com", 28),
("Bob", None, 35),
("Charlie", " ", 42),
(None, "diana@email.com", 55),
("Eve", "eve@email.com", None),
]
df = spark.createDataFrame(data, ["name", "email", "age"])
# Flag records with data quality issues
df_flagged = df.withColumn(
"data_quality_flag",
when(isnull(col("name")), "missing_name")
.when(isnull(col("email")) | (trim(col("email")) == ""), "missing_email")
.when(isnull(col("age")), "missing_age")
.otherwise("valid")
)
df_flagged.show()
Output:
+-------+---------------+----+-----------------+
| name| email| age|data_quality_flag|
+-------+---------------+----+-----------------+
| Alice|alice@email.com| 28| valid|
| Bob| null| 35| missing_email|
|Charlie| | 42| missing_email|
| null|diana@email.com| 55| missing_name|
| Eve| eve@email.com|null| missing_age|
+-------+---------------+----+-----------------+
For membership testing, isin() pairs naturally with when():
from pyspark.sql.functions import when, col
priority_customers = ["Alice", "Charlie", "Eve"]
df_priority = df.withColumn(
"is_priority",
when(col("name").isin(priority_customers), True).otherwise(False)
)
You can also nest when().otherwise() expressions, though I’d caution against deep nesting—it quickly becomes unreadable.
Common Pitfalls and Best Practices
Missing otherwise() Returns Null
The most common mistake: forgetting otherwise(). When no conditions match and there’s no default, PySpark returns null—not an error.
# Problematic: missing otherwise()
df_bad = df.withColumn(
"category",
when(col("score") >= 90, "excellent")
.when(col("score") >= 70, "good")
# Scores below 70 become null!
)
# Fixed: always include otherwise()
df_good = df.withColumn(
"category",
when(col("score") >= 90, "excellent")
.when(col("score") >= 70, "good")
.otherwise("needs_improvement")
)
Operator Precedence with Compound Conditions
This one bites everyone at least once. PySpark’s bitwise operators (&, |) have higher precedence than comparison operators. Without parentheses, your conditions won’t evaluate as expected.
# WRONG: operator precedence issue
df.withColumn(
"flag",
when(col("age") > 18 & col("score") > 50, "pass") # Evaluates as: age > (18 & score) > 50
.otherwise("fail")
)
# CORRECT: wrap each condition in parentheses
df.withColumn(
"flag",
when((col("age") > 18) & (col("score") > 50), "pass")
.otherwise("fail")
)
Performance Considerations
Long chains of when() clauses execute efficiently because Spark’s Catalyst optimizer handles them natively. However, if you find yourself with 20+ conditions, consider whether a join against a lookup table might be clearer:
# Instead of 50 when() clauses for state abbreviations...
# Use a lookup join
state_lookup = spark.createDataFrame([
("CA", "California"), ("NY", "New York"), # ...
], ["abbrev", "full_name"])
df.join(state_lookup, df.state == state_lookup.abbrev, "left")
Conclusion
The when().otherwise() pattern handles the vast majority of conditional transformation needs in PySpark. Master the basics—proper parentheses, always including otherwise(), and chaining conditions in the right order—and you’ll avoid the common debugging sessions.
For straightforward conditionals, when().otherwise() should be your default choice. It’s readable, performant, and integrates with Spark’s optimizer. Reserve UDFs for truly complex logic that can’t be expressed with built-in functions, and consider lookup joins when your conditions are really just mapping one set of values to another.