PySpark - Case When (Multiple Conditions)
When working with PySpark DataFrames, you can't use standard Python conditionals like `if-elif-else` directly on DataFrame columns. These constructs work with single values, not distributed column...
Key Insights
- PySpark’s
when()function enables SQL-like CASE WHEN logic for DataFrames, allowing you to chain multiple conditions that evaluate sequentially until a match is found - Complex boolean logic requires bitwise operators (
&,|,~) instead of Python’s standardand,or,not, with proper parentheses to control evaluation order - Chained
when()statements significantly outperform user-defined functions (UDFs) for conditional logic, as they execute natively in the Spark engine without Python serialization overhead
Introduction to PySpark’s when() Function
When working with PySpark DataFrames, you can’t use standard Python conditionals like if-elif-else directly on DataFrame columns. These constructs work with single values, not distributed column data. Instead, PySpark provides the when() function from pyspark.sql.functions to apply conditional logic across entire columns.
The when() function creates column expressions that Spark can optimize and execute in parallel across your cluster. Think of it as the DataFrame equivalent of SQL’s CASE WHEN statement—it evaluates conditions sequentially and returns the first matching result.
Here’s a simple example:
from pyspark.sql import SparkSession
from pyspark.sql.functions import when, col
spark = SparkSession.builder.appName("WhenExample").getOrCreate()
df = spark.createDataFrame([
(1, 85),
(2, 92),
(3, 78),
(4, 65)
], ["student_id", "score"])
df_with_grade = df.withColumn(
"grade",
when(col("score") >= 90, "A").otherwise("B")
)
df_with_grade.show()
This creates a new column based on a single condition, but real-world scenarios demand much more complexity.
Basic Syntax and Structure
The when() function follows this pattern:
when(condition, value)
You chain multiple when() calls to handle multiple conditions, and finish with otherwise() to define the default value when no conditions match. The structure looks like this:
when(condition1, value1)
.when(condition2, value2)
.when(condition3, value3)
.otherwise(default_value)
Conditions are evaluated in order from top to bottom. Once a condition matches, Spark returns that value and skips remaining conditions—just like SQL CASE WHEN or a series of if-elif statements.
Here’s a practical example with multiple conditions:
from pyspark.sql.functions import when, col
df = spark.createDataFrame([
(1, 95),
(2, 87),
(3, 78),
(4, 65),
(5, 42)
], ["student_id", "score"])
df_graded = df.withColumn(
"letter_grade",
when(col("score") >= 90, "A")
.when(col("score") >= 80, "B")
.when(col("score") >= 70, "C")
.when(col("score") >= 60, "D")
.otherwise("F")
)
df_graded.show()
The otherwise() clause is optional, but omitting it results in null for non-matching rows—usually not what you want.
Multiple Conditions with Chained when()
For complex business logic, you’ll often need extensive condition chains. Let’s build a realistic employee categorization system based on salary ranges:
from pyspark.sql.functions import when, col
employees = spark.createDataFrame([
("E001", "John", 145000),
("E002", "Sarah", 95000),
("E003", "Mike", 72000),
("E004", "Emily", 58000),
("E005", "David", 35000),
("E006", "Lisa", 180000)
], ["emp_id", "name", "salary"])
employees_categorized = employees.withColumn(
"salary_band",
when(col("salary") >= 150000, "Executive")
.when(col("salary") >= 100000, "Senior")
.when(col("salary") >= 75000, "Mid-Level")
.when(col("salary") >= 50000, "Junior")
.when(col("salary") >= 30000, "Entry-Level")
.otherwise("Contractor")
).withColumn(
"bonus_percentage",
when(col("salary") >= 150000, 0.20)
.when(col("salary") >= 100000, 0.15)
.when(col("salary") >= 75000, 0.10)
.when(col("salary") >= 50000, 0.05)
.otherwise(0.02)
)
employees_categorized.show(truncate=False)
You can also create conditions based on multiple columns simultaneously:
sales_data = spark.createDataFrame([
("Q1", "Electronics", 125000),
("Q1", "Clothing", 45000),
("Q2", "Electronics", 98000),
("Q2", "Clothing", 52000)
], ["quarter", "category", "revenue"])
sales_with_status = sales_data.withColumn(
"performance",
when((col("quarter") == "Q1") & (col("revenue") > 100000), "Q1 Excellent")
.when((col("quarter") == "Q1") & (col("revenue") > 50000), "Q1 Good")
.when((col("quarter") == "Q2") & (col("revenue") > 100000), "Q2 Excellent")
.when((col("quarter") == "Q2") & (col("revenue") > 50000), "Q2 Good")
.otherwise("Needs Improvement")
)
sales_with_status.show(truncate=False)
Using Complex Boolean Logic
PySpark uses bitwise operators for combining conditions because standard Python logical operators (and, or, not) don’t work with column expressions. Use these instead:
&for AND|for OR~for NOT
Critical: Always wrap individual conditions in parentheses when using these operators. Python’s operator precedence can cause unexpected behavior otherwise.
employees = spark.createDataFrame([
("Alice", 28, "Engineering", 85000),
("Bob", 35, "Engineering", 105000),
("Carol", 42, "Sales", 75000),
("Dave", 29, "Sales", 68000),
("Eve", 38, "Engineering", 95000)
], ["name", "age", "department", "salary"])
# Complex AND conditions
employees_filtered = employees.withColumn(
"promotion_eligible",
when(
(col("age") >= 30) &
(col("department") == "Engineering") &
(col("salary") < 100000),
"Yes"
).otherwise("No")
)
# Complex OR conditions
employees_with_bonus = employees.withColumn(
"special_bonus",
when(
(col("salary") > 100000) |
((col("department") == "Sales") & (col("age") < 30)),
5000
).otherwise(1000)
)
# NOT conditions
employees_with_flag = employees.withColumn(
"review_needed",
when(
~((col("department") == "Engineering") & (col("salary") > 90000)),
"Required"
).otherwise("Optional")
)
employees_with_flag.show()
You can nest conditions for even more complex logic:
complex_conditions = employees.withColumn(
"category",
when(
((col("age") < 30) & (col("salary") > 80000)) |
((col("age") >= 30) & (col("salary") > 100000)),
"High Performer"
).when(
(col("department") == "Sales") &
((col("age") >= 35) | (col("salary") < 70000)),
"Review Required"
).otherwise("Standard")
)
complex_conditions.show()
Column Expressions in when() Conditions
Beyond simple comparisons, you can use various column methods within when() conditions:
from pyspark.sql.functions import when, col, isnan, isnull
# Sample data with nulls
data_with_nulls = spark.createDataFrame([
(1, "Alice", 75000, "Engineering"),
(2, "Bob", None, "Sales"),
(3, None, 85000, "Engineering"),
(4, "Carol", 65000, None),
(5, "Dave", 95000, "Marketing")
], ["id", "name", "salary", "department"])
# Null handling
cleaned_data = data_with_nulls.withColumn(
"salary_status",
when(col("salary").isNull(), "No Salary Data")
.when(col("salary") < 70000, "Below Average")
.when(col("salary") >= 90000, "Above Average")
.otherwise("Average")
).withColumn(
"employee_complete",
when(
col("name").isNotNull() &
col("salary").isNotNull() &
col("department").isNotNull(),
"Complete"
).otherwise("Incomplete")
)
cleaned_data.show()
String operations work seamlessly in conditions:
from pyspark.sql.functions import when, col, lower, upper
products = spark.createDataFrame([
("LAPTOP-PRO-001", 1299.99),
("laptop-basic-002", 599.99),
("PHONE-X-003", 899.99),
("tablet-mini-004", 399.99)
], ["product_code", "price"])
products_categorized = products.withColumn(
"product_type",
when(lower(col("product_code")).contains("laptop"), "Computer")
.when(lower(col("product_code")).contains("phone"), "Mobile")
.when(lower(col("product_code")).contains("tablet"), "Tablet")
.otherwise("Other")
).withColumn(
"price_tier",
when(col("price") > 1000, "Premium")
.when(col("price") > 500, "Mid-Range")
.otherwise("Budget")
)
products_categorized.show(truncate=False)
Performance Considerations and Best Practices
Chained when() statements execute efficiently in Spark’s Catalyst optimizer. However, you can still optimize your code:
Avoid this (redundant evaluations):
# Poor approach - recalculates same condition
df_bad = df.withColumn(
"status",
when(col("price") * col("quantity") > 10000, "High Value")
.otherwise("Low Value")
).withColumn(
"discount",
when(col("price") * col("quantity") > 10000, 0.15)
.otherwise(0.05)
)
Do this instead (calculate once, reuse):
# Better approach - calculate once
from pyspark.sql.functions import when, col
df_good = df.withColumn("total_value", col("price") * col("quantity"))
df_good = df_good.withColumn(
"status",
when(col("total_value") > 10000, "High Value")
.otherwise("Low Value")
).withColumn(
"discount",
when(col("total_value") > 10000, 0.15)
.otherwise(0.05)
)
Never use UDFs for simple conditional logic. They’re dramatically slower:
# SLOW - Don't do this
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
@udf(returnType=StringType())
def categorize_salary(salary):
if salary >= 100000:
return "Senior"
elif salary >= 75000:
return "Mid"
else:
return "Junior"
# FAST - Do this instead
df_fast = df.withColumn(
"category",
when(col("salary") >= 100000, "Senior")
.when(col("salary") >= 75000, "Mid")
.otherwise("Junior")
)
Common Pitfalls and Troubleshooting
Pitfall 1: Missing parentheses with boolean operators
# WRONG - will cause unexpected behavior
when(col("age") > 30 & col("salary") > 80000, "Yes")
# CORRECT
when((col("age") > 30) & (col("salary") > 80000), "Yes")
Pitfall 2: Using Python operators instead of bitwise
# WRONG - won't work
when(col("age") > 30 and col("salary") > 80000, "Yes")
# CORRECT
when((col("age") > 30) & (col("salary") > 80000), "Yes")
Pitfall 3: Type mismatches
# WRONG - comparing string to integer
when(col("age") > "30", "Yes")
# CORRECT - use proper types
when(col("age") > 30, "Yes")
Pitfall 4: Forgetting otherwise() and getting unexpected nulls
# RISKY - returns null for non-matches
df.withColumn("status", when(col("score") > 90, "Pass"))
# SAFER - explicit default
df.withColumn("status", when(col("score") > 90, "Pass").otherwise("Fail"))
Master these patterns and you’ll handle any conditional logic PySpark throws at you. The when() function is your go-to tool for transforming data based on complex business rules—use it wisely and your Spark jobs will run efficiently at scale.