R dplyr - case_when() Examples | Application Architect

Key Insights

case_when() provides vectorized if-else logic in dplyr, evaluating conditions sequentially and returning values when the first TRUE condition is met
Unlike nested ifelse() statements, case_when() maintains readability with multiple conditions and automatically handles type consistency across all branches
The function integrates seamlessly with mutate() for creating new columns and supports complex logical operations including NA handling with .default parameter

Basic Syntax and Simple Conditions

The case_when() function evaluates conditions from top to bottom, returning the right-hand side value when a condition evaluates to TRUE. Each condition follows the formula syntax: condition ~ result.

library(dplyr)

# Create sample dataset
employees <- tibble(
  name = c("Alice", "Bob", "Charlie", "Diana", "Eve"),
  salary = c(45000, 65000, 85000, 120000, 95000)
)

# Categorize salaries
employees %>%
  mutate(salary_band = case_when(
    salary < 50000 ~ "Entry Level",
    salary < 80000 ~ "Mid Level",
    salary < 100000 ~ "Senior",
    salary >= 100000 ~ "Executive"
  ))

# A tibble: 5 × 3
  name    salary salary_band
  <chr>    <dbl> <chr>      
1 Alice    45000 Entry Level
2 Bob      65000 Mid Level  
3 Charlie  85000 Senior     
4 Diana   120000 Executive  
5 Eve      95000 Senior

The order matters. Once a condition evaluates to TRUE, case_when() returns that result and stops checking subsequent conditions for that row.

Multiple Conditions with Logical Operators

Combine multiple criteria using & (AND), | (OR), and ! (NOT) operators within each condition.

sales_data <- tibble(
  product = c("Laptop", "Mouse", "Monitor", "Keyboard", "Tablet"),
  price = c(1200, 25, 350, 75, 600),
  units_sold = c(50, 500, 120, 300, 80),
  category = c("Electronics", "Accessories", "Electronics", "Accessories", "Electronics")
)

sales_data %>%
  mutate(
    revenue = price * units_sold,
    performance = case_when(
      revenue > 50000 & category == "Electronics" ~ "High Performer",
      revenue > 20000 & category == "Electronics" ~ "Good Performer",
      revenue > 10000 & category == "Accessories" ~ "Strong Accessory",
      revenue > 5000 ~ "Average",
      TRUE ~ "Needs Improvement"
    )
  ) %>%
  select(product, revenue, performance)

# A tibble: 5 × 3
  product  revenue performance      
  <chr>      <dbl> <chr>            
1 Laptop     60000 High Performer   
2 Mouse      12500 Strong Accessory 
3 Monitor    42000 Good Performer   
4 Keyboard   22500 Strong Accessory 
5 Tablet     48000 Good Performer

The TRUE ~ "value" pattern serves as a catch-all else clause, matching any rows that didn’t meet previous conditions.

Handling Missing Values

case_when() requires explicit handling of NA values. By default, unmatched conditions return NA. Use is.na() checks or the .default parameter.

customer_data <- tibble(
  customer_id = 1:6,
  age = c(25, NA, 45, 67, 33, NA),
  purchase_amount = c(150, 200, NA, 300, 75, 50)
)

# Explicit NA handling
customer_data %>%
  mutate(
    age_group = case_when(
      is.na(age) ~ "Unknown",
      age < 30 ~ "Young Adult",
      age < 50 ~ "Middle Aged",
      age >= 50 ~ "Senior"
    ),
    purchase_category = case_when(
      is.na(purchase_amount) ~ "No Purchase Data",
      purchase_amount < 100 ~ "Low",
      purchase_amount < 250 ~ "Medium",
      .default = "High"
    )
  )

# A tibble: 6 × 5
  customer_id   age purchase_amount age_group    purchase_category
        <int> <dbl>           <dbl> <chr>        <chr>            
1           1    25             150 Young Adult  Medium           
2           2    NA             200 Unknown      Medium           
3           3    45              NA Middle Aged  No Purchase Data 
4           4    67             300 Senior       High             
5           5    33              75 Middle Aged  Low              
6           6    NA              50 Unknown      Low

The .default parameter (introduced in dplyr 1.1.0) provides a cleaner alternative to TRUE ~ for default values.

String Pattern Matching

Combine case_when() with string functions like str_detect(), str_starts(), or grepl() for pattern-based categorization.

library(stringr)

products <- tibble(
  sku = c("ELEC-LAP-001", "FURN-DESK-045", "ELEC-PHONE-123", 
          "CLOTH-SHIRT-789", "FURN-CHAIR-056"),
  description = c("Gaming Laptop", "Standing Desk", "Smartphone Pro",
                  "Cotton T-Shirt", "Ergonomic Chair")
)

products %>%
  mutate(
    department = case_when(
      str_starts(sku, "ELEC") ~ "Electronics",
      str_starts(sku, "FURN") ~ "Furniture",
      str_starts(sku, "CLOTH") ~ "Clothing",
      .default = "Other"
    ),
    priority = case_when(
      str_detect(description, "Pro|Gaming|Ergonomic") ~ "Premium",
      str_detect(description, "Standing|Smart") ~ "High",
      .default = "Standard"
    )
  )

# A tibble: 5 × 4
  sku              description      department  priority
  <chr>            <chr>            <chr>       <chr>   
1 ELEC-LAP-001     Gaming Laptop    Electronics Premium 
2 FURN-DESK-045    Standing Desk    Furniture   High    
3 ELEC-PHONE-123   Smartphone Pro   Electronics Premium 
4 CLOTH-SHIRT-789  Cotton T-Shirt   Clothing    Standard
5 FURN-CHAIR-056   Ergonomic Chair  Furniture   Premium

Date-Based Conditions

Apply temporal logic using date comparison operators and functions from the lubridate package.

library(lubridate)

transactions <- tibble(
  transaction_id = 1:5,
  date = as.Date(c("2024-01-15", "2024-03-22", "2024-06-10", 
                   "2024-09-05", "2024-11-30")),
  amount = c(500, 750, 1200, 300, 950)
)

transactions %>%
  mutate(
    quarter = case_when(
      month(date) <= 3 ~ "Q1",
      month(date) <= 6 ~ "Q2",
      month(date) <= 9 ~ "Q3",
      month(date) <= 12 ~ "Q4"
    ),
    season = case_when(
      month(date) %in% c(12, 1, 2) ~ "Winter",
      month(date) %in% c(3, 4, 5) ~ "Spring",
      month(date) %in% c(6, 7, 8) ~ "Summer",
      month(date) %in% c(9, 10, 11) ~ "Fall"
    ),
    days_ago = as.numeric(Sys.Date() - date),
    recency = case_when(
      days_ago <= 30 ~ "Recent",
      days_ago <= 90 ~ "Moderate",
      days_ago <= 180 ~ "Old",
      .default = "Very Old"
    )
  )

Nested Calculations and Complex Logic

Reference other mutated columns within the same mutate() call or create intermediate calculations.

store_metrics <- tibble(
  store_id = 1:5,
  revenue = c(150000, 95000, 200000, 75000, 180000),
  costs = c(120000, 85000, 150000, 70000, 140000),
  customer_count = c(1200, 800, 1500, 600, 1400)
)

store_metrics %>%
  mutate(
    profit = revenue - costs,
    profit_margin = (profit / revenue) * 100,
    avg_customer_value = revenue / customer_count,
    
    performance_tier = case_when(
      profit_margin > 25 & avg_customer_value > 120 ~ "Tier 1: Excellent",
      profit_margin > 20 & avg_customer_value > 100 ~ "Tier 2: Very Good",
      profit_margin > 15 | avg_customer_value > 80 ~ "Tier 3: Good",
      profit > 0 ~ "Tier 4: Profitable",
      .default = "Tier 5: Needs Attention"
    ),
    
    action_required = case_when(
      performance_tier == "Tier 5: Needs Attention" ~ "Immediate Review",
      performance_tier == "Tier 4: Profitable" ~ "Monitor Closely",
      profit_margin < 20 ~ "Optimize Costs",
      .default = "Continue Current Strategy"
    )
  ) %>%
  select(store_id, profit_margin, avg_customer_value, performance_tier, action_required)

# A tibble: 5 × 5
  store_id profit_margin avg_customer_value performance_tier      action_required        
     <int>         <dbl>              <dbl> <chr>                 <chr>                  
1        1          20                 125  Tier 2: Very Good     Continue Current Strat…
2        2          10.5                119  Tier 3: Good          Optimize Costs         
3        3          25                 133. Tier 1: Excellent     Continue Current Strat…
4        4           6.67               125  Tier 3: Good          Optimize Costs         
5        5          22.2                129. Tier 1: Excellent     Continue Current Strat…

Performance Considerations

case_when() evaluates all conditions for every row, but stops at the first match. Order conditions from most specific to most general, and place frequently matched conditions first when possible.

# Less efficient: generic conditions first
mutate(category = case_when(
  value > 0 ~ "Positive",
  value > 100 ~ "High",      # Never reached
  value > 1000 ~ "Very High" # Never reached
))

# More efficient: specific conditions first
mutate(category = case_when(
  value > 1000 ~ "Very High",
  value > 100 ~ "High",
  value > 0 ~ "Positive",
  .default = "Non-positive"
))

For large datasets with many conditions, consider benchmarking against alternatives like lookup tables with left_join() when dealing with exact matches rather than conditional logic.