PySpark - Map Column Values Using when/otherwise

When working with large-scale data in PySpark, you'll frequently need to transform column values based on conditional logic. Whether you're categorizing continuous variables, cleaning data...

Key Insights

  • PySpark’s when/otherwise provides a distributed-friendly alternative to pandas’ apply() or dictionary mapping, executing transformations in parallel across cluster nodes without serialization overhead
  • Chain multiple when() clauses to build complex conditional logic that remains readable and maintainable, with otherwise() serving as the final fallback case
  • Explicit null handling in your conditions prevents unexpected behavior—PySpark doesn’t treat null comparisons the same way as Python’s None

Introduction to Column Mapping in PySpark

When working with large-scale data in PySpark, you’ll frequently need to transform column values based on conditional logic. Whether you’re categorizing continuous variables, cleaning data inconsistencies, or engineering features for machine learning, conditional mapping is a fundamental operation.

Unlike pandas where you might reach for apply() with a lambda function or dictionary mapping with map(), PySpark requires a different approach. The when() and otherwise() functions from pyspark.sql.functions are specifically designed for distributed environments. They generate optimized execution plans that Spark can parallelize across your cluster, avoiding the serialization costs and performance penalties of user-defined functions (UDFs).

Let’s start with a sample dataset we’ll use throughout this article:

from pyspark.sql import SparkSession
from pyspark.sql.functions import when, col

spark = SparkSession.builder.appName("ColumnMapping").getOrCreate()

data = [
    (1, "John", 25, "premium", 1500),
    (2, "Sarah", 17, "standard", 300),
    (3, "Mike", 34, "premium", 2500),
    (4, "Emma", 45, "standard", 800),
    (5, "Alex", 16, None, 150),
    (6, "Lisa", 29, "premium", 500)
]

df = spark.createDataFrame(data, ["id", "name", "age", "customer_type", "purchase_amount"])
df.show()

Basic when/otherwise Syntax

The when() function evaluates a boolean condition and returns a specified value when true. The otherwise() function provides the default value when none of the conditions match. Think of it as PySpark’s equivalent to SQL’s CASE WHEN statement or Python’s ternary operator, but designed for DataFrame columns.

The basic structure is:

when(condition, value).otherwise(default_value)

Here’s a simple example that categorizes people as adults or minors:

from pyspark.sql.functions import when

df_with_category = df.withColumn(
    "age_category",
    when(col("age") >= 18, "adult")
    .otherwise("minor")
)

df_with_category.select("name", "age", "age_category").show()

Output:

+-----+---+------------+
| name|age|age_category|
+-----+---+------------+
| John| 25|       adult|
|Sarah| 17|       minor|
| Mike| 34|       adult|
| Emma| 45|       adult|
| Alex| 16|       minor|
| Lisa| 29|       adult|
+-----+---+------------+

Notice we use col("age") to reference the column. You can also use the DataFrame’s column accessor with df.age or df["age"], but col() is more explicit and works consistently across different contexts.

Chaining Multiple Conditions

Real-world scenarios rarely involve simple binary logic. You’ll often need to evaluate multiple conditions in sequence. Chain multiple when() calls together, with each subsequent condition evaluated only if the previous ones were false.

Let’s map numeric scores to letter grades:

# First, let's add a score column to our data
scores_data = [
    (1, "John", 92),
    (2, "Sarah", 87),
    (3, "Mike", 76),
    (4, "Emma", 64),
    (5, "Alex", 45),
    (6, "Lisa", 98)
]

scores_df = spark.createDataFrame(scores_data, ["id", "name", "score"])

graded_df = scores_df.withColumn(
    "grade",
    when(col("score") >= 90, "A")
    .when(col("score") >= 80, "B")
    .when(col("score") >= 70, "C")
    .when(col("score") >= 60, "D")
    .otherwise("F")
)

graded_df.show()

Output:

+---+-----+-----+-----+
| id| name|score|grade|
+---+-----+-----+-----+
|  1| John|   92|    A|
|  2|Sarah|   87|    B|
|  3| Mike|   76|    C|
|  4| Emma|   64|    D|
|  5| Alex|   45|    F|
|  6| Lisa|   98|    A|
+---+-----+-----+-----+

The order matters here. Conditions are evaluated top-to-bottom, and the first matching condition wins. If we reversed the order, every score above 60 would get a “D” grade because that condition would match first.

Working with Multiple Columns

Most business logic involves evaluating multiple columns together. PySpark’s column expressions let you combine conditions using standard boolean operators: & (and), | (or), and ~ (not).

Important: Always wrap individual conditions in parentheses when using boolean operators. Python’s operator precedence can cause unexpected behavior otherwise.

Let’s determine discount eligibility based on customer type and purchase amount:

df_with_discount = df.withColumn(
    "discount_eligible",
    when(
        (col("customer_type") == "premium") & (col("purchase_amount") >= 1000),
        "20% discount"
    )
    .when(
        (col("customer_type") == "premium") & (col("purchase_amount") >= 500),
        "10% discount"
    )
    .when(
        (col("customer_type") == "standard") & (col("purchase_amount") >= 1000),
        "5% discount"
    )
    .otherwise("no discount")
)

df_with_discount.select("name", "customer_type", "purchase_amount", "discount_eligible").show()

Output:

+-----+-------------+---------------+-----------------+
| name|customer_type|purchase_amount|discount_eligible|
+-----+-------------+---------------+-----------------+
| John|      premium|           1500|     20% discount|
|Sarah|     standard|            300|      no discount|
| Mike|      premium|           2500|     20% discount|
| Emma|     standard|            800|      no discount|
| Alex|         null|            150|      no discount|
| Lisa|      premium|            500|     10% discount|
+-----+-------------+---------------+-----------------+

You can also use more complex column expressions, including mathematical operations:

df_with_value_score = df.withColumn(
    "value_segment",
    when(col("purchase_amount") / col("age") > 50, "high value")
    .when(col("purchase_amount") / col("age") > 20, "medium value")
    .otherwise("low value")
)

Common Patterns and Best Practices

Handling Null Values

Null handling is critical in PySpark. Unlike Python where None == None is True, SQL-style null comparisons always return null, not true or false. This means col("column") == None won’t work as expected.

Use isNull() and isNotNull() methods explicitly:

df_with_null_handling = df.withColumn(
    "customer_status",
    when(col("customer_type").isNull(), "unknown")
    .when(col("customer_type") == "premium", "VIP")
    .otherwise("regular")
)

df_with_null_handling.select("name", "customer_type", "customer_status").show()

Output:

+-----+-------------+---------------+
| name|customer_type|customer_status|
+-----+-------------+---------------+
| John|      premium|            VIP|
|Sarah|     standard|        regular|
| Mike|      premium|            VIP|
| Emma|     standard|        regular|
| Alex|         null|        unknown|
| Lisa|      premium|            VIP|
+-----+-------------+---------------+

Using isin() for Multiple Values

When checking if a column matches any value in a list, use isin() instead of chaining multiple OR conditions:

# Less readable
when((col("status") == "active") | (col("status") == "pending") | (col("status") == "trial"), "engaged")

# Better
when(col("status").isin(["active", "pending", "trial"]), "engaged")

Performance Considerations

when/otherwise compiles to Catalyst optimizations and executes as native Spark operations. This makes it significantly faster than Python UDFs, which require serialization between the JVM and Python processes.

Use when/otherwise when:

  • Your logic can be expressed with column operations
  • You’re working with standard data types
  • Performance is critical

Consider UDFs only when:

  • You need complex Python libraries (regex, date parsing, etc.)
  • The logic is too complex for column expressions
  • You’re willing to accept the performance trade-off

Real-World Use Case: Customer Segmentation

Let’s build a complete customer segmentation system that combines multiple business rules:

from pyspark.sql.functions import when, col, months_between, current_date, lit

# Extended customer data
customer_data = [
    (1, "John Doe", "2020-01-15", 15, 12500, "premium", True),
    (2, "Sarah Smith", "2023-06-20", 2, 450, "standard", False),
    (3, "Mike Johnson", "2019-03-10", 24, 28000, "premium", True),
    (4, "Emma Wilson", "2022-11-05", 8, 3200, "standard", True),
    (5, "Alex Brown", None, 0, 0, None, False),
    (6, "Lisa Davis", "2021-08-22", 18, 8900, "premium", False)
]

customer_df = spark.createDataFrame(
    customer_data,
    ["customer_id", "name", "signup_date", "months_active", "lifetime_value", "tier", "email_verified"]
)

# Build comprehensive segmentation
segmented_df = customer_df.withColumn(
    "segment",
    when(
        col("signup_date").isNull() | ~col("email_verified"),
        "inactive"
    )
    .when(
        (col("tier") == "premium") & 
        (col("lifetime_value") >= 20000) & 
        (col("months_active") >= 12),
        "champion"
    )
    .when(
        (col("tier") == "premium") & 
        (col("lifetime_value") >= 10000),
        "loyal"
    )
    .when(
        (col("months_active") >= 12) & 
        (col("lifetime_value") >= 5000),
        "potential_loyalist"
    )
    .when(
        col("months_active") <= 3,
        "new"
    )
    .otherwise("standard")
).withColumn(
    "priority",
    when(col("segment").isin(["champion", "loyal"]), "high")
    .when(col("segment").isin(["potential_loyalist", "new"]), "medium")
    .otherwise("low")
)

segmented_df.select("name", "tier", "months_active", "lifetime_value", "segment", "priority").show(truncate=False)

Output:

+------------+--------+-------------+--------------+-------------------+--------+
|name        |tier    |months_active|lifetime_value|segment            |priority|
+------------+--------+-------------+--------------+-------------------+--------+
|John Doe    |premium |15           |12500         |loyal              |high    |
|Sarah Smith |standard|2            |450           |new                |medium  |
|Mike Johnson|premium |24           |28000         |champion           |high    |
|Emma Wilson |standard|8            |3200          |standard           |low     |
|Alex Brown  |null    |0            |0             |inactive           |low     |
|Lisa Davis  |premium |18           |8900          |potential_loyalist |medium  |
+------------+--------+-------------+--------------+-------------------+--------+

This segmentation system demonstrates several best practices: explicit null handling at the top of the condition chain, combining multiple column conditions with clear business logic, and building derived columns based on previously calculated segments.

The when/otherwise pattern scales efficiently to billions of rows while keeping your transformation logic readable and maintainable. Master this approach, and you’ll handle the majority of conditional transformations you’ll encounter in production PySpark applications.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.