R dplyr - case_when() Examples
The `case_when()` function evaluates conditions from top to bottom, returning the right-hand side value when a condition evaluates to TRUE. Each condition follows the formula syntax: `condition ~...
Key Insights
case_when()provides vectorized if-else logic in dplyr, evaluating conditions sequentially and returning values when the first TRUE condition is met- Unlike nested
ifelse()statements,case_when()maintains readability with multiple conditions and automatically handles type consistency across all branches - The function integrates seamlessly with
mutate()for creating new columns and supports complex logical operations including NA handling with.defaultparameter
Basic Syntax and Simple Conditions
The case_when() function evaluates conditions from top to bottom, returning the right-hand side value when a condition evaluates to TRUE. Each condition follows the formula syntax: condition ~ result.
library(dplyr)
# Create sample dataset
employees <- tibble(
name = c("Alice", "Bob", "Charlie", "Diana", "Eve"),
salary = c(45000, 65000, 85000, 120000, 95000)
)
# Categorize salaries
employees %>%
mutate(salary_band = case_when(
salary < 50000 ~ "Entry Level",
salary < 80000 ~ "Mid Level",
salary < 100000 ~ "Senior",
salary >= 100000 ~ "Executive"
))
# A tibble: 5 × 3
name salary salary_band
<chr> <dbl> <chr>
1 Alice 45000 Entry Level
2 Bob 65000 Mid Level
3 Charlie 85000 Senior
4 Diana 120000 Executive
5 Eve 95000 Senior
The order matters. Once a condition evaluates to TRUE, case_when() returns that result and stops checking subsequent conditions for that row.
Multiple Conditions with Logical Operators
Combine multiple criteria using & (AND), | (OR), and ! (NOT) operators within each condition.
sales_data <- tibble(
product = c("Laptop", "Mouse", "Monitor", "Keyboard", "Tablet"),
price = c(1200, 25, 350, 75, 600),
units_sold = c(50, 500, 120, 300, 80),
category = c("Electronics", "Accessories", "Electronics", "Accessories", "Electronics")
)
sales_data %>%
mutate(
revenue = price * units_sold,
performance = case_when(
revenue > 50000 & category == "Electronics" ~ "High Performer",
revenue > 20000 & category == "Electronics" ~ "Good Performer",
revenue > 10000 & category == "Accessories" ~ "Strong Accessory",
revenue > 5000 ~ "Average",
TRUE ~ "Needs Improvement"
)
) %>%
select(product, revenue, performance)
# A tibble: 5 × 3
product revenue performance
<chr> <dbl> <chr>
1 Laptop 60000 High Performer
2 Mouse 12500 Strong Accessory
3 Monitor 42000 Good Performer
4 Keyboard 22500 Strong Accessory
5 Tablet 48000 Good Performer
The TRUE ~ "value" pattern serves as a catch-all else clause, matching any rows that didn’t meet previous conditions.
Handling Missing Values
case_when() requires explicit handling of NA values. By default, unmatched conditions return NA. Use is.na() checks or the .default parameter.
customer_data <- tibble(
customer_id = 1:6,
age = c(25, NA, 45, 67, 33, NA),
purchase_amount = c(150, 200, NA, 300, 75, 50)
)
# Explicit NA handling
customer_data %>%
mutate(
age_group = case_when(
is.na(age) ~ "Unknown",
age < 30 ~ "Young Adult",
age < 50 ~ "Middle Aged",
age >= 50 ~ "Senior"
),
purchase_category = case_when(
is.na(purchase_amount) ~ "No Purchase Data",
purchase_amount < 100 ~ "Low",
purchase_amount < 250 ~ "Medium",
.default = "High"
)
)
# A tibble: 6 × 5
customer_id age purchase_amount age_group purchase_category
<int> <dbl> <dbl> <chr> <chr>
1 1 25 150 Young Adult Medium
2 2 NA 200 Unknown Medium
3 3 45 NA Middle Aged No Purchase Data
4 4 67 300 Senior High
5 5 33 75 Middle Aged Low
6 6 NA 50 Unknown Low
The .default parameter (introduced in dplyr 1.1.0) provides a cleaner alternative to TRUE ~ for default values.
String Pattern Matching
Combine case_when() with string functions like str_detect(), str_starts(), or grepl() for pattern-based categorization.
library(stringr)
products <- tibble(
sku = c("ELEC-LAP-001", "FURN-DESK-045", "ELEC-PHONE-123",
"CLOTH-SHIRT-789", "FURN-CHAIR-056"),
description = c("Gaming Laptop", "Standing Desk", "Smartphone Pro",
"Cotton T-Shirt", "Ergonomic Chair")
)
products %>%
mutate(
department = case_when(
str_starts(sku, "ELEC") ~ "Electronics",
str_starts(sku, "FURN") ~ "Furniture",
str_starts(sku, "CLOTH") ~ "Clothing",
.default = "Other"
),
priority = case_when(
str_detect(description, "Pro|Gaming|Ergonomic") ~ "Premium",
str_detect(description, "Standing|Smart") ~ "High",
.default = "Standard"
)
)
# A tibble: 5 × 4
sku description department priority
<chr> <chr> <chr> <chr>
1 ELEC-LAP-001 Gaming Laptop Electronics Premium
2 FURN-DESK-045 Standing Desk Furniture High
3 ELEC-PHONE-123 Smartphone Pro Electronics Premium
4 CLOTH-SHIRT-789 Cotton T-Shirt Clothing Standard
5 FURN-CHAIR-056 Ergonomic Chair Furniture Premium
Date-Based Conditions
Apply temporal logic using date comparison operators and functions from the lubridate package.
library(lubridate)
transactions <- tibble(
transaction_id = 1:5,
date = as.Date(c("2024-01-15", "2024-03-22", "2024-06-10",
"2024-09-05", "2024-11-30")),
amount = c(500, 750, 1200, 300, 950)
)
transactions %>%
mutate(
quarter = case_when(
month(date) <= 3 ~ "Q1",
month(date) <= 6 ~ "Q2",
month(date) <= 9 ~ "Q3",
month(date) <= 12 ~ "Q4"
),
season = case_when(
month(date) %in% c(12, 1, 2) ~ "Winter",
month(date) %in% c(3, 4, 5) ~ "Spring",
month(date) %in% c(6, 7, 8) ~ "Summer",
month(date) %in% c(9, 10, 11) ~ "Fall"
),
days_ago = as.numeric(Sys.Date() - date),
recency = case_when(
days_ago <= 30 ~ "Recent",
days_ago <= 90 ~ "Moderate",
days_ago <= 180 ~ "Old",
.default = "Very Old"
)
)
Nested Calculations and Complex Logic
Reference other mutated columns within the same mutate() call or create intermediate calculations.
store_metrics <- tibble(
store_id = 1:5,
revenue = c(150000, 95000, 200000, 75000, 180000),
costs = c(120000, 85000, 150000, 70000, 140000),
customer_count = c(1200, 800, 1500, 600, 1400)
)
store_metrics %>%
mutate(
profit = revenue - costs,
profit_margin = (profit / revenue) * 100,
avg_customer_value = revenue / customer_count,
performance_tier = case_when(
profit_margin > 25 & avg_customer_value > 120 ~ "Tier 1: Excellent",
profit_margin > 20 & avg_customer_value > 100 ~ "Tier 2: Very Good",
profit_margin > 15 | avg_customer_value > 80 ~ "Tier 3: Good",
profit > 0 ~ "Tier 4: Profitable",
.default = "Tier 5: Needs Attention"
),
action_required = case_when(
performance_tier == "Tier 5: Needs Attention" ~ "Immediate Review",
performance_tier == "Tier 4: Profitable" ~ "Monitor Closely",
profit_margin < 20 ~ "Optimize Costs",
.default = "Continue Current Strategy"
)
) %>%
select(store_id, profit_margin, avg_customer_value, performance_tier, action_required)
# A tibble: 5 × 5
store_id profit_margin avg_customer_value performance_tier action_required
<int> <dbl> <dbl> <chr> <chr>
1 1 20 125 Tier 2: Very Good Continue Current Strat…
2 2 10.5 119 Tier 3: Good Optimize Costs
3 3 25 133. Tier 1: Excellent Continue Current Strat…
4 4 6.67 125 Tier 3: Good Optimize Costs
5 5 22.2 129. Tier 1: Excellent Continue Current Strat…
Performance Considerations
case_when() evaluates all conditions for every row, but stops at the first match. Order conditions from most specific to most general, and place frequently matched conditions first when possible.
# Less efficient: generic conditions first
mutate(category = case_when(
value > 0 ~ "Positive",
value > 100 ~ "High", # Never reached
value > 1000 ~ "Very High" # Never reached
))
# More efficient: specific conditions first
mutate(category = case_when(
value > 1000 ~ "Very High",
value > 100 ~ "High",
value > 0 ~ "Positive",
.default = "Non-positive"
))
For large datasets with many conditions, consider benchmarking against alternatives like lookup tables with left_join() when dealing with exact matches rather than conditional logic.