PySpark - Map Column Values Using when/otherwise
When working with large-scale data in PySpark, you'll frequently need to transform column values based on conditional logic. Whether you're categorizing continuous variables, cleaning data...
Key Insights
- PySpark’s
when/otherwiseprovides a distributed-friendly alternative to pandas’apply()or dictionary mapping, executing transformations in parallel across cluster nodes without serialization overhead - Chain multiple
when()clauses to build complex conditional logic that remains readable and maintainable, withotherwise()serving as the final fallback case - Explicit null handling in your conditions prevents unexpected behavior—PySpark doesn’t treat null comparisons the same way as Python’s None
Introduction to Column Mapping in PySpark
When working with large-scale data in PySpark, you’ll frequently need to transform column values based on conditional logic. Whether you’re categorizing continuous variables, cleaning data inconsistencies, or engineering features for machine learning, conditional mapping is a fundamental operation.
Unlike pandas where you might reach for apply() with a lambda function or dictionary mapping with map(), PySpark requires a different approach. The when() and otherwise() functions from pyspark.sql.functions are specifically designed for distributed environments. They generate optimized execution plans that Spark can parallelize across your cluster, avoiding the serialization costs and performance penalties of user-defined functions (UDFs).
Let’s start with a sample dataset we’ll use throughout this article:
from pyspark.sql import SparkSession
from pyspark.sql.functions import when, col
spark = SparkSession.builder.appName("ColumnMapping").getOrCreate()
data = [
(1, "John", 25, "premium", 1500),
(2, "Sarah", 17, "standard", 300),
(3, "Mike", 34, "premium", 2500),
(4, "Emma", 45, "standard", 800),
(5, "Alex", 16, None, 150),
(6, "Lisa", 29, "premium", 500)
]
df = spark.createDataFrame(data, ["id", "name", "age", "customer_type", "purchase_amount"])
df.show()
Basic when/otherwise Syntax
The when() function evaluates a boolean condition and returns a specified value when true. The otherwise() function provides the default value when none of the conditions match. Think of it as PySpark’s equivalent to SQL’s CASE WHEN statement or Python’s ternary operator, but designed for DataFrame columns.
The basic structure is:
when(condition, value).otherwise(default_value)
Here’s a simple example that categorizes people as adults or minors:
from pyspark.sql.functions import when
df_with_category = df.withColumn(
"age_category",
when(col("age") >= 18, "adult")
.otherwise("minor")
)
df_with_category.select("name", "age", "age_category").show()
Output:
+-----+---+------------+
| name|age|age_category|
+-----+---+------------+
| John| 25| adult|
|Sarah| 17| minor|
| Mike| 34| adult|
| Emma| 45| adult|
| Alex| 16| minor|
| Lisa| 29| adult|
+-----+---+------------+
Notice we use col("age") to reference the column. You can also use the DataFrame’s column accessor with df.age or df["age"], but col() is more explicit and works consistently across different contexts.
Chaining Multiple Conditions
Real-world scenarios rarely involve simple binary logic. You’ll often need to evaluate multiple conditions in sequence. Chain multiple when() calls together, with each subsequent condition evaluated only if the previous ones were false.
Let’s map numeric scores to letter grades:
# First, let's add a score column to our data
scores_data = [
(1, "John", 92),
(2, "Sarah", 87),
(3, "Mike", 76),
(4, "Emma", 64),
(5, "Alex", 45),
(6, "Lisa", 98)
]
scores_df = spark.createDataFrame(scores_data, ["id", "name", "score"])
graded_df = scores_df.withColumn(
"grade",
when(col("score") >= 90, "A")
.when(col("score") >= 80, "B")
.when(col("score") >= 70, "C")
.when(col("score") >= 60, "D")
.otherwise("F")
)
graded_df.show()
Output:
+---+-----+-----+-----+
| id| name|score|grade|
+---+-----+-----+-----+
| 1| John| 92| A|
| 2|Sarah| 87| B|
| 3| Mike| 76| C|
| 4| Emma| 64| D|
| 5| Alex| 45| F|
| 6| Lisa| 98| A|
+---+-----+-----+-----+
The order matters here. Conditions are evaluated top-to-bottom, and the first matching condition wins. If we reversed the order, every score above 60 would get a “D” grade because that condition would match first.
Working with Multiple Columns
Most business logic involves evaluating multiple columns together. PySpark’s column expressions let you combine conditions using standard boolean operators: & (and), | (or), and ~ (not).
Important: Always wrap individual conditions in parentheses when using boolean operators. Python’s operator precedence can cause unexpected behavior otherwise.
Let’s determine discount eligibility based on customer type and purchase amount:
df_with_discount = df.withColumn(
"discount_eligible",
when(
(col("customer_type") == "premium") & (col("purchase_amount") >= 1000),
"20% discount"
)
.when(
(col("customer_type") == "premium") & (col("purchase_amount") >= 500),
"10% discount"
)
.when(
(col("customer_type") == "standard") & (col("purchase_amount") >= 1000),
"5% discount"
)
.otherwise("no discount")
)
df_with_discount.select("name", "customer_type", "purchase_amount", "discount_eligible").show()
Output:
+-----+-------------+---------------+-----------------+
| name|customer_type|purchase_amount|discount_eligible|
+-----+-------------+---------------+-----------------+
| John| premium| 1500| 20% discount|
|Sarah| standard| 300| no discount|
| Mike| premium| 2500| 20% discount|
| Emma| standard| 800| no discount|
| Alex| null| 150| no discount|
| Lisa| premium| 500| 10% discount|
+-----+-------------+---------------+-----------------+
You can also use more complex column expressions, including mathematical operations:
df_with_value_score = df.withColumn(
"value_segment",
when(col("purchase_amount") / col("age") > 50, "high value")
.when(col("purchase_amount") / col("age") > 20, "medium value")
.otherwise("low value")
)
Common Patterns and Best Practices
Handling Null Values
Null handling is critical in PySpark. Unlike Python where None == None is True, SQL-style null comparisons always return null, not true or false. This means col("column") == None won’t work as expected.
Use isNull() and isNotNull() methods explicitly:
df_with_null_handling = df.withColumn(
"customer_status",
when(col("customer_type").isNull(), "unknown")
.when(col("customer_type") == "premium", "VIP")
.otherwise("regular")
)
df_with_null_handling.select("name", "customer_type", "customer_status").show()
Output:
+-----+-------------+---------------+
| name|customer_type|customer_status|
+-----+-------------+---------------+
| John| premium| VIP|
|Sarah| standard| regular|
| Mike| premium| VIP|
| Emma| standard| regular|
| Alex| null| unknown|
| Lisa| premium| VIP|
+-----+-------------+---------------+
Using isin() for Multiple Values
When checking if a column matches any value in a list, use isin() instead of chaining multiple OR conditions:
# Less readable
when((col("status") == "active") | (col("status") == "pending") | (col("status") == "trial"), "engaged")
# Better
when(col("status").isin(["active", "pending", "trial"]), "engaged")
Performance Considerations
when/otherwise compiles to Catalyst optimizations and executes as native Spark operations. This makes it significantly faster than Python UDFs, which require serialization between the JVM and Python processes.
Use when/otherwise when:
- Your logic can be expressed with column operations
- You’re working with standard data types
- Performance is critical
Consider UDFs only when:
- You need complex Python libraries (regex, date parsing, etc.)
- The logic is too complex for column expressions
- You’re willing to accept the performance trade-off
Real-World Use Case: Customer Segmentation
Let’s build a complete customer segmentation system that combines multiple business rules:
from pyspark.sql.functions import when, col, months_between, current_date, lit
# Extended customer data
customer_data = [
(1, "John Doe", "2020-01-15", 15, 12500, "premium", True),
(2, "Sarah Smith", "2023-06-20", 2, 450, "standard", False),
(3, "Mike Johnson", "2019-03-10", 24, 28000, "premium", True),
(4, "Emma Wilson", "2022-11-05", 8, 3200, "standard", True),
(5, "Alex Brown", None, 0, 0, None, False),
(6, "Lisa Davis", "2021-08-22", 18, 8900, "premium", False)
]
customer_df = spark.createDataFrame(
customer_data,
["customer_id", "name", "signup_date", "months_active", "lifetime_value", "tier", "email_verified"]
)
# Build comprehensive segmentation
segmented_df = customer_df.withColumn(
"segment",
when(
col("signup_date").isNull() | ~col("email_verified"),
"inactive"
)
.when(
(col("tier") == "premium") &
(col("lifetime_value") >= 20000) &
(col("months_active") >= 12),
"champion"
)
.when(
(col("tier") == "premium") &
(col("lifetime_value") >= 10000),
"loyal"
)
.when(
(col("months_active") >= 12) &
(col("lifetime_value") >= 5000),
"potential_loyalist"
)
.when(
col("months_active") <= 3,
"new"
)
.otherwise("standard")
).withColumn(
"priority",
when(col("segment").isin(["champion", "loyal"]), "high")
.when(col("segment").isin(["potential_loyalist", "new"]), "medium")
.otherwise("low")
)
segmented_df.select("name", "tier", "months_active", "lifetime_value", "segment", "priority").show(truncate=False)
Output:
+------------+--------+-------------+--------------+-------------------+--------+
|name |tier |months_active|lifetime_value|segment |priority|
+------------+--------+-------------+--------------+-------------------+--------+
|John Doe |premium |15 |12500 |loyal |high |
|Sarah Smith |standard|2 |450 |new |medium |
|Mike Johnson|premium |24 |28000 |champion |high |
|Emma Wilson |standard|8 |3200 |standard |low |
|Alex Brown |null |0 |0 |inactive |low |
|Lisa Davis |premium |18 |8900 |potential_loyalist |medium |
+------------+--------+-------------+--------------+-------------------+--------+
This segmentation system demonstrates several best practices: explicit null handling at the top of the condition chain, combining multiple column conditions with clear business logic, and building derived columns based on previously calculated segments.
The when/otherwise pattern scales efficiently to billions of rows while keeping your transformation logic readable and maintainable. Master this approach, and you’ll handle the majority of conditional transformations you’ll encounter in production PySpark applications.