R dplyr - case_when() Examples

The `case_when()` function evaluates conditions from top to bottom, returning the right-hand side value when a condition evaluates to TRUE. Each condition follows the formula syntax: `condition ~...

Key Insights

  • case_when() provides vectorized if-else logic in dplyr, evaluating conditions sequentially and returning values when the first TRUE condition is met
  • Unlike nested ifelse() statements, case_when() maintains readability with multiple conditions and automatically handles type consistency across all branches
  • The function integrates seamlessly with mutate() for creating new columns and supports complex logical operations including NA handling with .default parameter

Basic Syntax and Simple Conditions

The case_when() function evaluates conditions from top to bottom, returning the right-hand side value when a condition evaluates to TRUE. Each condition follows the formula syntax: condition ~ result.

library(dplyr)

# Create sample dataset
employees <- tibble(
  name = c("Alice", "Bob", "Charlie", "Diana", "Eve"),
  salary = c(45000, 65000, 85000, 120000, 95000)
)

# Categorize salaries
employees %>%
  mutate(salary_band = case_when(
    salary < 50000 ~ "Entry Level",
    salary < 80000 ~ "Mid Level",
    salary < 100000 ~ "Senior",
    salary >= 100000 ~ "Executive"
  ))
# A tibble: 5 × 3
  name    salary salary_band
  <chr>    <dbl> <chr>      
1 Alice    45000 Entry Level
2 Bob      65000 Mid Level  
3 Charlie  85000 Senior     
4 Diana   120000 Executive  
5 Eve      95000 Senior     

The order matters. Once a condition evaluates to TRUE, case_when() returns that result and stops checking subsequent conditions for that row.

Multiple Conditions with Logical Operators

Combine multiple criteria using & (AND), | (OR), and ! (NOT) operators within each condition.

sales_data <- tibble(
  product = c("Laptop", "Mouse", "Monitor", "Keyboard", "Tablet"),
  price = c(1200, 25, 350, 75, 600),
  units_sold = c(50, 500, 120, 300, 80),
  category = c("Electronics", "Accessories", "Electronics", "Accessories", "Electronics")
)

sales_data %>%
  mutate(
    revenue = price * units_sold,
    performance = case_when(
      revenue > 50000 & category == "Electronics" ~ "High Performer",
      revenue > 20000 & category == "Electronics" ~ "Good Performer",
      revenue > 10000 & category == "Accessories" ~ "Strong Accessory",
      revenue > 5000 ~ "Average",
      TRUE ~ "Needs Improvement"
    )
  ) %>%
  select(product, revenue, performance)
# A tibble: 5 × 3
  product  revenue performance      
  <chr>      <dbl> <chr>            
1 Laptop     60000 High Performer   
2 Mouse      12500 Strong Accessory 
3 Monitor    42000 Good Performer   
4 Keyboard   22500 Strong Accessory 
5 Tablet     48000 Good Performer   

The TRUE ~ "value" pattern serves as a catch-all else clause, matching any rows that didn’t meet previous conditions.

Handling Missing Values

case_when() requires explicit handling of NA values. By default, unmatched conditions return NA. Use is.na() checks or the .default parameter.

customer_data <- tibble(
  customer_id = 1:6,
  age = c(25, NA, 45, 67, 33, NA),
  purchase_amount = c(150, 200, NA, 300, 75, 50)
)

# Explicit NA handling
customer_data %>%
  mutate(
    age_group = case_when(
      is.na(age) ~ "Unknown",
      age < 30 ~ "Young Adult",
      age < 50 ~ "Middle Aged",
      age >= 50 ~ "Senior"
    ),
    purchase_category = case_when(
      is.na(purchase_amount) ~ "No Purchase Data",
      purchase_amount < 100 ~ "Low",
      purchase_amount < 250 ~ "Medium",
      .default = "High"
    )
  )
# A tibble: 6 × 5
  customer_id   age purchase_amount age_group    purchase_category
        <int> <dbl>           <dbl> <chr>        <chr>            
1           1    25             150 Young Adult  Medium           
2           2    NA             200 Unknown      Medium           
3           3    45              NA Middle Aged  No Purchase Data 
4           4    67             300 Senior       High             
5           5    33              75 Middle Aged  Low              
6           6    NA              50 Unknown      Low              

The .default parameter (introduced in dplyr 1.1.0) provides a cleaner alternative to TRUE ~ for default values.

String Pattern Matching

Combine case_when() with string functions like str_detect(), str_starts(), or grepl() for pattern-based categorization.

library(stringr)

products <- tibble(
  sku = c("ELEC-LAP-001", "FURN-DESK-045", "ELEC-PHONE-123", 
          "CLOTH-SHIRT-789", "FURN-CHAIR-056"),
  description = c("Gaming Laptop", "Standing Desk", "Smartphone Pro",
                  "Cotton T-Shirt", "Ergonomic Chair")
)

products %>%
  mutate(
    department = case_when(
      str_starts(sku, "ELEC") ~ "Electronics",
      str_starts(sku, "FURN") ~ "Furniture",
      str_starts(sku, "CLOTH") ~ "Clothing",
      .default = "Other"
    ),
    priority = case_when(
      str_detect(description, "Pro|Gaming|Ergonomic") ~ "Premium",
      str_detect(description, "Standing|Smart") ~ "High",
      .default = "Standard"
    )
  )
# A tibble: 5 × 4
  sku              description      department  priority
  <chr>            <chr>            <chr>       <chr>   
1 ELEC-LAP-001     Gaming Laptop    Electronics Premium 
2 FURN-DESK-045    Standing Desk    Furniture   High    
3 ELEC-PHONE-123   Smartphone Pro   Electronics Premium 
4 CLOTH-SHIRT-789  Cotton T-Shirt   Clothing    Standard
5 FURN-CHAIR-056   Ergonomic Chair  Furniture   Premium 

Date-Based Conditions

Apply temporal logic using date comparison operators and functions from the lubridate package.

library(lubridate)

transactions <- tibble(
  transaction_id = 1:5,
  date = as.Date(c("2024-01-15", "2024-03-22", "2024-06-10", 
                   "2024-09-05", "2024-11-30")),
  amount = c(500, 750, 1200, 300, 950)
)

transactions %>%
  mutate(
    quarter = case_when(
      month(date) <= 3 ~ "Q1",
      month(date) <= 6 ~ "Q2",
      month(date) <= 9 ~ "Q3",
      month(date) <= 12 ~ "Q4"
    ),
    season = case_when(
      month(date) %in% c(12, 1, 2) ~ "Winter",
      month(date) %in% c(3, 4, 5) ~ "Spring",
      month(date) %in% c(6, 7, 8) ~ "Summer",
      month(date) %in% c(9, 10, 11) ~ "Fall"
    ),
    days_ago = as.numeric(Sys.Date() - date),
    recency = case_when(
      days_ago <= 30 ~ "Recent",
      days_ago <= 90 ~ "Moderate",
      days_ago <= 180 ~ "Old",
      .default = "Very Old"
    )
  )

Nested Calculations and Complex Logic

Reference other mutated columns within the same mutate() call or create intermediate calculations.

store_metrics <- tibble(
  store_id = 1:5,
  revenue = c(150000, 95000, 200000, 75000, 180000),
  costs = c(120000, 85000, 150000, 70000, 140000),
  customer_count = c(1200, 800, 1500, 600, 1400)
)

store_metrics %>%
  mutate(
    profit = revenue - costs,
    profit_margin = (profit / revenue) * 100,
    avg_customer_value = revenue / customer_count,
    
    performance_tier = case_when(
      profit_margin > 25 & avg_customer_value > 120 ~ "Tier 1: Excellent",
      profit_margin > 20 & avg_customer_value > 100 ~ "Tier 2: Very Good",
      profit_margin > 15 | avg_customer_value > 80 ~ "Tier 3: Good",
      profit > 0 ~ "Tier 4: Profitable",
      .default = "Tier 5: Needs Attention"
    ),
    
    action_required = case_when(
      performance_tier == "Tier 5: Needs Attention" ~ "Immediate Review",
      performance_tier == "Tier 4: Profitable" ~ "Monitor Closely",
      profit_margin < 20 ~ "Optimize Costs",
      .default = "Continue Current Strategy"
    )
  ) %>%
  select(store_id, profit_margin, avg_customer_value, performance_tier, action_required)
# A tibble: 5 × 5
  store_id profit_margin avg_customer_value performance_tier      action_required        
     <int>         <dbl>              <dbl> <chr>                 <chr>                  
1        1          20                 125  Tier 2: Very Good     Continue Current Strat…
2        2          10.5                119  Tier 3: Good          Optimize Costs         
3        3          25                 133. Tier 1: Excellent     Continue Current Strat…
4        4           6.67               125  Tier 3: Good          Optimize Costs         
5        5          22.2                129. Tier 1: Excellent     Continue Current Strat…

Performance Considerations

case_when() evaluates all conditions for every row, but stops at the first match. Order conditions from most specific to most general, and place frequently matched conditions first when possible.

# Less efficient: generic conditions first
mutate(category = case_when(
  value > 0 ~ "Positive",
  value > 100 ~ "High",      # Never reached
  value > 1000 ~ "Very High" # Never reached
))

# More efficient: specific conditions first
mutate(category = case_when(
  value > 1000 ~ "Very High",
  value > 100 ~ "High",
  value > 0 ~ "Positive",
  .default = "Non-positive"
))

For large datasets with many conditions, consider benchmarking against alternatives like lookup tables with left_join() when dealing with exact matches rather than conditional logic.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.