R dplyr - across() - Apply Function Across Columns

Key Insights

across() replaces older scoped verbs like mutate_at(), summarise_if(), and select_all(), providing a unified interface for applying functions to multiple columns simultaneously
Combines with where(), starts_with(), ends_with(), and other tidyselect helpers to target columns by type, name pattern, or custom conditions
Supports anonymous functions, purrr-style lambda syntax, and named lists for creating multiple transformations with automatic column naming

Basic across() Syntax

The across() function operates within dplyr verbs like mutate(), summarise(), and filter(). Its basic structure takes a column selection and a function to apply:

library(dplyr)

# Sample dataset
df <- data.frame(
  id = 1:5,
  height = c(170, 165, 180, 175, 168),
  weight = c(70, 65, 85, 80, 72),
  age = c(25, 30, 35, 28, 32)
)

# Apply function to specific columns
df %>%
  mutate(across(c(height, weight), ~ .x / 100))

The tilde ~ creates an anonymous function where .x represents each column. This converts height and weight from centimeters/kilograms to meters/decagrams.

Column Selection Methods

across() integrates with tidyselect helpers for flexible column targeting:

# Select by type
df %>%
  mutate(across(where(is.numeric), round))

# Select by name pattern
df %>%
  mutate(across(starts_with("h"), log))

# Select by name ending
df %>%
  mutate(across(ends_with("ght"), sqrt))

# Select by name containing pattern
df %>%
  mutate(across(contains("ei"), as.integer))

# Combine selectors
df %>%
  mutate(across(where(is.numeric) & !id, scale))

# Everything except specific columns
df %>%
  mutate(across(!c(id, age), ~ .x * 2))

The where() helper evaluates a predicate function on each column, selecting those that return TRUE. This enables type-based selection without hardcoding column names.

Multiple Functions with Named Lists

Apply multiple transformations simultaneously using named lists. The names become suffixes for new columns:

df %>%
  summarise(across(
    c(height, weight),
    list(
      mean = mean,
      sd = sd,
      median = median
    ),
    .names = "{.col}_{.fn}"
  ))

Output structure:

  height_mean height_sd height_median weight_mean weight_sd weight_median
1       171.6  5.940885           170        74.4  7.829432            72

The .names argument controls output naming. Available glue specifications:

{.col}: original column name
{.fn}: function name from list
Custom text and separators

# Custom naming pattern
df %>%
  summarise(across(
    where(is.numeric),
    list(avg = mean, total = sum),
    .names = "{.fn}_of_{.col}"
  ))

Anonymous Functions and Lambda Syntax

Three equivalent ways to define functions within across():

# Standard anonymous function
df %>%
  mutate(across(c(height, weight), function(x) x - mean(x)))

# Formula syntax (purrr-style)
df %>%
  mutate(across(c(height, weight), ~ .x - mean(.x)))

# Inline function definition
df %>%
  mutate(across(c(height, weight), \(x) x - mean(x)))

The formula syntax with ~ is most concise for simple operations. Use standard functions for complex logic:

normalize <- function(x) {
  (x - min(x)) / (max(x) - min(x))
}

df %>%
  mutate(across(where(is.numeric), normalize))

Grouped Operations

across() respects grouping from group_by(), applying functions within each group:

# Extended dataset with groups
df_grouped <- data.frame(
  category = rep(c("A", "B"), each = 5),
  value1 = c(10, 15, 20, 25, 30, 12, 18, 22, 28, 35),
  value2 = c(100, 150, 200, 250, 300, 120, 180, 220, 280, 350)
)

# Group-wise standardization
df_grouped %>%
  group_by(category) %>%
  mutate(across(starts_with("value"), scale)) %>%
  ungroup()

# Group-wise summaries
df_grouped %>%
  group_by(category) %>%
  summarise(across(
    starts_with("value"),
    list(mean = mean, sum = sum, n = length)
  ))

Conditional Transformations with if_else

Combine across() with conditional logic for sophisticated transformations:

df %>%
  mutate(across(
    where(is.numeric),
    ~ if_else(.x > mean(.x), "above", "below")
  ))

# Multiple conditions
df %>%
  mutate(across(
    c(height, weight),
    ~ case_when(
      .x < quantile(.x, 0.25) ~ "low",
      .x > quantile(.x, 0.75) ~ "high",
      TRUE ~ "medium"
    )
  ))

Working with Missing Values

Handle NA values during transformation:

# Data with missing values
df_na <- data.frame(
  x = c(1, 2, NA, 4, 5),
  y = c(NA, 2, 3, 4, NA),
  z = c(1, NA, 3, NA, 5)
)

# Remove NA before calculation
df_na %>%
  mutate(across(
    everything(),
    ~ if_else(is.na(.x), mean(.x, na.rm = TRUE), .x)
  ))

# Count missing values
df_na %>%
  summarise(across(
    everything(),
    ~ sum(is.na(.x))
  ))

# Replace NA with specific value
df_na %>%
  mutate(across(everything(), ~ replace_na(.x, 0)))

Type Conversions

across() excels at batch type conversions:

df_mixed <- data.frame(
  id = c("1", "2", "3"),
  value = c("10.5", "20.3", "15.7"),
  flag = c("TRUE", "FALSE", "TRUE"),
  date = c("2024-01-01", "2024-01-02", "2024-01-03")
)

df_mixed %>%
  mutate(
    across(c(id), as.integer),
    across(c(value), as.numeric),
    across(c(flag), as.logical),
    across(c(date), as.Date)
  )

# Convert all character columns to factors
df_mixed %>%
  mutate(across(where(is.character), as.factor))

Performance Considerations

across() with multiple columns processes efficiently but consider these patterns:

library(microbenchmark)

# Efficient: single across() call
method1 <- function(df) {
  df %>% mutate(across(c(height, weight, age), scale))
}

# Less efficient: multiple mutate calls
method2 <- function(df) {
  df %>%
    mutate(height = scale(height)) %>%
    mutate(weight = scale(weight)) %>%
    mutate(age = scale(age))
}

microbenchmark(
  method1(df),
  method2(df),
  times = 1000
)

For large datasets, across() reduces overhead from repeated data frame modifications.

Common Patterns

Round numeric columns to specific decimals:

df %>%
  mutate(across(where(is.numeric), ~ round(.x, 2)))

Create z-scores for analysis:

df %>%
  mutate(across(
    where(is.numeric),
    ~ (.x - mean(.x)) / sd(.x),
    .names = "{.col}_zscore"
  ))

Log-transform skewed distributions:

df %>%
  mutate(across(
    c(height, weight),
    ~ log1p(.x),
    .names = "log_{.col}"
  ))

The across() function consolidates column-wise operations into readable, maintainable code. It eliminates the need for loops or multiple similar statements, making data transformation pipelines more efficient and expressive.