R dplyr - mutate() - Add/Modify Columns

Key Insights

mutate() creates new columns or modifies existing ones while preserving all rows, making it essential for feature engineering and data transformation pipelines
Supports vectorized operations, window functions, and conditional logic through integration with case_when(), if_else(), and across() for complex transformations
Combines seamlessly with group_by() to perform group-wise calculations, enabling aggregations that maintain row-level granularity

Basic Column Creation and Modification

The mutate() function from dplyr adds new variables or transforms existing ones in your data frame. Unlike base R’s approach of modifying columns with $ or [], mutate() keeps your data pipeline clean and readable.

library(dplyr)

# Sample sales data
sales <- data.frame(
  product = c("Laptop", "Mouse", "Keyboard", "Monitor"),
  price = c(1200, 25, 75, 350),
  quantity = c(5, 50, 30, 10),
  discount_pct = c(10, 5, 0, 15)
)

# Create new columns
sales_enhanced <- sales %>%
  mutate(
    revenue = price * quantity,
    discount_amount = price * (discount_pct / 100),
    final_price = price - discount_amount,
    total_revenue = final_price * quantity
  )

print(sales_enhanced)

Each expression in mutate() can reference columns created earlier in the same call. The total_revenue calculation uses final_price, which was just created in the previous line.

Modifying Existing Columns

You can overwrite existing columns by using the same column name. This is useful for data cleaning and standardization.

# Transform existing columns
sales_cleaned <- sales %>%
  mutate(
    product = toupper(product),           # Convert to uppercase
    price = round(price * 1.08, 2),       # Add 8% tax
    discount_pct = discount_pct / 100     # Convert to decimal
  )

# Type conversion
sales_typed <- sales %>%
  mutate(
    product = as.factor(product),
    quantity = as.integer(quantity),
    price = as.numeric(price)
  )

Conditional Mutations with if_else and case_when

For logic-based transformations, combine mutate() with if_else() for binary conditions or case_when() for multiple conditions.

# Binary condition with if_else
sales_categorized <- sales %>%
  mutate(
    price_tier = if_else(price > 100, "Premium", "Standard"),
    bulk_order = if_else(quantity >= 30, TRUE, FALSE)
  )

# Multiple conditions with case_when
sales_complex <- sales %>%
  mutate(
    product_category = case_when(
      price > 1000 ~ "High-End Electronics",
      price > 100 ~ "Mid-Range Electronics",
      price > 50 ~ "Accessories",
      TRUE ~ "Budget Items"
    ),
    discount_tier = case_when(
      discount_pct >= 15 ~ "Excellent Deal",
      discount_pct >= 10 ~ "Good Deal",
      discount_pct > 0 ~ "Fair Deal",
      TRUE ~ "No Discount"
    ),
    priority = case_when(
      quantity > 40 & price < 50 ~ "High Volume Low Value",
      quantity > 20 & price > 500 ~ "High Volume High Value",
      quantity <= 10 ~ "Low Volume",
      TRUE ~ "Standard"
    )
  )

The case_when() function evaluates conditions sequentially and returns the first match. Always include TRUE ~ default_value as the final condition to handle unmatched cases.

Group-Wise Mutations

Combining mutate() with group_by() performs calculations within groups while maintaining all rows. This differs from summarize(), which collapses groups.

# Group-wise calculations
sales_grouped <- sales %>%
  group_by(product_category = if_else(price > 100, "Electronics", "Accessories")) %>%
  mutate(
    avg_category_price = mean(price),
    price_vs_avg = price - avg_category_price,
    pct_of_category_total = quantity / sum(quantity) * 100,
    rank_in_category = rank(desc(price))
  ) %>%
  ungroup()

# Multiple grouping variables
monthly_sales <- data.frame(
  month = rep(c("Jan", "Feb", "Mar"), each = 4),
  region = rep(c("North", "South", "East", "West"), 3),
  revenue = runif(12, 1000, 5000)
)

monthly_analysis <- monthly_sales %>%
  group_by(month, region) %>%
  mutate(
    total_month_revenue = sum(revenue),
    region_contribution = revenue / total_month_revenue * 100
  ) %>%
  group_by(region) %>%
  mutate(
    avg_region_revenue = mean(revenue),
    vs_region_avg = revenue - avg_region_revenue
  ) %>%
  ungroup()

Window Functions and Ranking

Window functions operate on a set of rows related to the current row, perfect for time-series analysis and ranking.

# Time series data
time_series <- data.frame(
  date = seq.Date(as.Date("2024-01-01"), by = "day", length.out = 10),
  value = c(100, 105, 103, 110, 115, 112, 120, 125, 122, 130)
)

# Window function transformations
time_series_enhanced <- time_series %>%
  mutate(
    prev_value = lag(value, 1),
    next_value = lead(value, 1),
    change = value - prev_value,
    pct_change = (value - prev_value) / prev_value * 100,
    moving_avg_3 = (lag(value, 1) + value + lead(value, 1)) / 3,
    cumulative_sum = cumsum(value),
    row_number = row_number(),
    value_rank = min_rank(desc(value))
  )

Operating Across Multiple Columns

The across() function applies transformations to multiple columns simultaneously, reducing code repetition.

# Sample dataset with multiple numeric columns
metrics <- data.frame(
  product = c("A", "B", "C"),
  sales_q1 = c(1000, 1500, 1200),
  sales_q2 = c(1100, 1400, 1300),
  sales_q3 = c(1200, 1600, 1250),
  sales_q4 = c(1300, 1700, 1400)
)

# Apply function across columns
metrics_normalized <- metrics %>%
  mutate(
    across(starts_with("sales"), ~ . / 1000, .names = "{.col}_k"),
    across(starts_with("sales"), ~ round(., -2), .names = "{.col}_rounded")
  )

# Multiple transformations with across
metrics_summary <- metrics %>%
  rowwise() %>%
  mutate(
    total_sales = sum(c_across(starts_with("sales"))),
    avg_quarterly = mean(c_across(starts_with("sales"))),
    max_quarter = max(c_across(starts_with("sales")))
  ) %>%
  ungroup()

# Conditional across
metrics_conditional <- metrics %>%
  mutate(
    across(where(is.numeric), ~ if_else(. > 1200, "High", "Low"), .names = "{.col}_tier")
  )

Performance Considerations and Best Practices

When working with large datasets, consider these optimization strategies:

# Efficient: Single mutate call
efficient <- sales %>%
  mutate(
    col1 = price * 2,
    col2 = quantity + 10,
    col3 = col1 * col2
  )

# Less efficient: Multiple mutate calls
less_efficient <- sales %>%
  mutate(col1 = price * 2) %>%
  mutate(col2 = quantity + 10) %>%
  mutate(col3 = col1 * col2)

# Use mutate with data.table for very large datasets
library(dtplyr)
large_sales <- lazy_dt(sales) %>%
  mutate(
    revenue = price * quantity,
    profit_margin = revenue * 0.2
  ) %>%
  as_tibble()

Always use ungroup() after grouped operations to prevent unexpected behavior in subsequent operations. For complex transformations, break logic into intermediate variables within a single mutate() call rather than chaining multiple mutate() statements.

When dealing with missing values, combine mutate() with coalesce() or replace_na() to handle NAs explicitly rather than letting them propagate through calculations.