R dplyr - select() Columns

The `select()` function from dplyr extracts columns from data frames using intuitive syntax. Unlike base R's bracket notation, `select()` returns a tibble and allows unquoted column names.

Key Insights

  • The select() function provides over a dozen helper functions for column selection including starts_with(), ends_with(), contains(), and matches() for pattern-based selection
  • Selection syntax supports ranges with :, exclusion with - or !, and the where() predicate for type-based filtering
  • Advanced techniques like all_of() and any_of() enable programmatic column selection from character vectors, critical for production code

Basic Column Selection

The select() function from dplyr extracts columns from data frames using intuitive syntax. Unlike base R’s bracket notation, select() returns a tibble and allows unquoted column names.

library(dplyr)

# Sample dataset
df <- data.frame(
  id = 1:5,
  name = c("Alice", "Bob", "Charlie", "David", "Eve"),
  age = c(25, 30, 35, 28, 32),
  salary = c(50000, 60000, 75000, 55000, 70000),
  department = c("IT", "HR", "IT", "Finance", "HR")
)

# Select single column
df %>% select(name)

# Select multiple columns
df %>% select(name, age, salary)

# Select by position
df %>% select(2, 3)

The positional selection works but breaks when column order changes. Use names for maintainable code.

Range and Exclusion Syntax

Select consecutive columns using the colon operator. Exclude columns with the minus sign or exclamation mark.

# Select range of columns
df %>% select(name:salary)

# Everything except specific columns
df %>% select(-id)
df %>% select(!id)

# Exclude multiple columns
df %>% select(-c(id, department))

# Exclude range
df %>% select(-(age:salary))

The everything() helper selects all remaining columns, useful for reordering:

# Move department to front
df %>% select(department, everything())

# Move id to end
df %>% select(-id, everything(), id)

Pattern Matching Helpers

Selection helpers filter columns based on name patterns. These functions work inside select() and accept regular expressions where applicable.

# Create dataset with patterned names
sales_data <- data.frame(
  id = 1:3,
  sales_2021_q1 = c(100, 150, 200),
  sales_2021_q2 = c(110, 160, 210),
  sales_2022_q1 = c(120, 170, 220),
  revenue_2021 = c(1000, 1500, 2000),
  revenue_2022 = c(1200, 1700, 2200),
  cost_total = c(500, 600, 700)
)

# Columns starting with prefix
sales_data %>% select(starts_with("sales"))

# Columns ending with suffix
sales_data %>% select(ends_with("q1"))

# Columns containing string
sales_data %>% select(contains("2021"))

# Multiple patterns
sales_data %>% select(starts_with("sales") | starts_with("revenue"))

For complex patterns, use matches() with regular expressions:

# Columns matching regex pattern
sales_data %>% select(matches("_\\d{4}_"))

# Match either sales or revenue with year
sales_data %>% select(matches("(sales|revenue)_2022"))

Type-Based Selection

The where() function filters columns by data type or custom predicates. This approach prevents hardcoding column names when structure varies.

# Select numeric columns only
df %>% select(where(is.numeric))

# Select character columns
df %>% select(where(is.character))

# Select factor columns
df %>% select(where(is.factor))

# Custom predicate: columns with no missing values
df_with_na <- df
df_with_na$salary[2] <- NA

df_with_na %>% select(where(~!any(is.na(.))))

# Numeric columns with mean > 1000
df %>% select(where(~is.numeric(.) && mean(.) > 1000))

Combine where() with other helpers:

# Numeric columns starting with 's'
df %>% select(starts_with("s") & where(is.numeric))

Programmatic Selection

Production code often requires dynamic column selection from variables. The all_of() and any_of() functions handle character vectors safely.

# Column names in a vector
cols_to_select <- c("name", "age")
df %>% select(all_of(cols_to_select))

# all_of() errors if column missing
cols_with_typo <- c("name", "ages")  # typo: ages
# df %>% select(all_of(cols_with_typo))  # Error

# any_of() silently ignores missing columns
df %>% select(any_of(cols_with_typo))  # Returns just 'name'

Use all_of() when columns must exist (fail fast). Use any_of() for optional columns:

# Configuration-driven selection
required_cols <- c("id", "name")
optional_cols <- c("email", "phone", "address")

df %>% select(all_of(required_cols), any_of(optional_cols))

Renaming During Selection

Rename columns inline using new_name = old_name syntax:

# Rename while selecting
df %>% select(
  identifier = id,
  employee_name = name,
  employee_age = age
)

# Rename with helpers
sales_data %>% select(
  id,
  q1_2021 = sales_2021_q1,
  q2_2021 = sales_2021_q2
)

For bulk renaming, combine select() with rename_with():

# Convert all names to uppercase
df %>% select(name, age, salary) %>%
  rename_with(toupper)

# Add prefix to selected columns
df %>% select(age, salary) %>%
  rename_with(~paste0("employee_", .))

Advanced Selection Techniques

Combine multiple selection methods using Boolean operators:

# AND: columns starting with 's' that are numeric
df %>% select(starts_with("s") & where(is.numeric))

# OR: columns starting with 's' or ending with 'e'
df %>% select(starts_with("s") | ends_with("e"))

# NOT: all columns except those containing 'a'
df %>% select(!contains("a"))

Use num_range() for numbered column sequences:

# Dataset with numbered columns
survey <- data.frame(
  id = 1:3,
  q1 = c(5, 4, 3),
  q2 = c(3, 4, 5),
  q3 = c(4, 3, 4),
  q4 = c(5, 5, 4)
)

# Select q1 through q3
survey %>% select(id, num_range("q", 1:3))

# With leading zeros
survey_padded <- setNames(survey, c("id", "q01", "q02", "q03", "q04"))
survey_padded %>% select(num_range("q", 1:3, width = 2))

Performance Considerations

Selection operations have minimal overhead, but patterns matter at scale:

library(microbenchmark)

large_df <- as.data.frame(matrix(rnorm(1000000), ncol = 100))
names(large_df) <- paste0("var_", 1:100)

microbenchmark(
  direct = large_df %>% select(var_1, var_2, var_3),
  range = large_df %>% select(var_1:var_3),
  helper = large_df %>% select(starts_with("var_")),
  times = 100
)

Direct column names perform fastest. Pattern matching adds negligible overhead for typical datasets but matters with thousands of columns.

Integration with Tidyverse Workflows

The select() function chains naturally with other dplyr verbs:

# Complete pipeline
df %>%
  select(name, age, salary, department) %>%
  filter(age > 28) %>%
  mutate(salary_k = salary / 1000) %>%
  group_by(department) %>%
  summarize(avg_salary = mean(salary_k))

# Select after grouping preserves groups
df %>%
  group_by(department) %>%
  select(name, age) %>%
  summarize(avg_age = mean(age))

For column selection with row filtering in one step, use select() before filter() to reduce intermediate data size:

# Efficient: select first
df %>%
  select(age, salary) %>%
  filter(age > 30)

# Less efficient: processes all columns
df %>%
  filter(age > 30) %>%
  select(age, salary)

The select() function provides declarative, readable column selection that scales from interactive analysis to production pipelines. Master the helper functions and programmatic selection for maintainable data transformation code.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.