R dplyr - select() Helpers (starts_with, ends_with, contains)

Key Insights

• The select() function in dplyr offers helper functions that match column names by patterns, eliminating tedious manual column specification and reducing errors in data manipulation workflows • Helper functions like starts_with(), ends_with(), contains(), and matches() work with string patterns while everything(), last_col(), and where() provide structural selection capabilities • Combining multiple helpers with set operations (&, |, !) enables complex column selection logic that adapts dynamically to changing datasets

Understanding select() Helpers

The select() function in dplyr provides a clean interface for choosing columns from data frames. While you can specify columns by name, select helpers become essential when working with datasets containing dozens or hundreds of columns. These helpers match patterns in column names, saving time and making code more maintainable.

library(dplyr)

# Sample dataset with various column naming patterns
df <- data.frame(
  customer_id = 1:5,
  customer_name = c("Alice", "Bob", "Charlie", "Diana", "Eve"),
  customer_age = c(25, 30, 35, 28, 42),
  order_date = as.Date("2024-01-01") + 0:4,
  order_total = c(150.50, 200.00, 175.25, 300.00, 125.75),
  order_status = c("shipped", "pending", "shipped", "delivered", "pending"),
  total_purchases = c(5, 3, 8, 12, 2),
  total_revenue = c(750, 600, 1400, 3600, 250)
)

starts_with() - Prefix Matching

The starts_with() helper selects columns beginning with a specific string. This is particularly useful for datasets with consistent naming conventions where related columns share a common prefix.

# Select all customer-related columns
customer_cols <- df %>%
  select(starts_with("customer"))

print(customer_cols)
#   customer_id customer_name customer_age
# 1           1         Alice           25
# 2           2           Bob           30
# 3           3       Charlie           35
# 4           4         Diana           28
# 5           5           Eve           42

# Select multiple prefixes
transaction_data <- df %>%
  select(starts_with("order"), starts_with("total"))

# Case-insensitive matching
df %>%
  select(starts_with("CUSTOMER", ignore.case = TRUE))

The ignore.case parameter proves valuable when working with data from multiple sources where naming conventions may vary in capitalization.

ends_with() - Suffix Matching

The ends_with() helper identifies columns by their suffix, commonly used for selecting columns with shared data types or measurement units.

# Select all columns ending with specific suffixes
df %>%
  select(ends_with("_id"))
#   customer_id

# Combine with other selections
df %>%
  select(customer_name, ends_with("status"), ends_with("date"))
#   customer_name order_status order_date
# 1         Alice      shipped 2024-01-01
# 2           Bob      pending 2024-01-02
# 3       Charlie      shipped 2024-01-03
# 4         Diana    delivered 2024-01-04
# 5           Eve      pending 2024-01-05

# Multiple suffix patterns
metrics <- df %>%
  select(ends_with("_revenue") | ends_with("_total"))

contains() - Substring Matching

The contains() helper matches any column name containing a specified substring, regardless of position. This provides maximum flexibility for pattern matching.

# Select columns containing "order"
df %>%
  select(contains("order"))
#   order_date order_total order_status
# 1 2024-01-01      150.50      shipped
# 2 2024-01-02      200.00      pending
# 3 2024-01-03      175.25      shipped
# 4 2024-01-04      300.00    delivered
# 5 2024-01-05      125.75      pending

# Find all numeric-related columns by pattern
df %>%
  select(contains("total") | contains("age"))
#   customer_age order_total total_purchases total_revenue
# 1           25      150.50               5           750
# 2           30      200.00               3           600
# 3           35      175.25               8          1400
# 4           28      300.00              12          3600
# 5           42      125.75               2           250

matches() - Regular Expression Matching

For advanced pattern matching, matches() accepts regular expressions, enabling sophisticated column selection logic.

# Select columns with underscore followed by any word
df %>%
  select(matches("_\\w+"))

# Columns ending with numbers or specific patterns
df %>%
  select(matches("(id|date|status)$"))
#   customer_id order_date order_status
# 1           1 2024-01-01      shipped
# 2           2 2024-01-02      pending
# 3           3 2024-01-03      shipped
# 4           4 2024-01-04    delivered
# 5           5 2024-01-05      pending

# Complex pattern: columns starting with "order" or containing "total"
df %>%
  select(matches("^order|total"))

everything() and Positional Helpers

The everything() helper selects all remaining columns, useful for reordering columns while keeping all data.

# Move specific columns to front
df %>%
  select(customer_id, customer_name, everything())

# Move to end
df %>%
  select(-order_status, everything(), order_status)

# Use last_col() for relative positioning
df %>%
  select(last_col(), everything())

# Select range and add more
df %>%
  select(customer_id:customer_age, starts_with("order"))

where() - Type-Based Selection

The where() helper selects columns based on a predicate function, typically used for selecting by data type.

# Select all numeric columns
numeric_cols <- df %>%
  select(where(is.numeric))

print(numeric_cols)
#   customer_id customer_age order_total total_purchases total_revenue
# 1           1           25      150.50               5           750
# 2           2           30      200.00               3           600
# 3           3           35      175.25               8          1400
# 4           4           28      300.00              12          3600
# 5           5           42      125.75               2           250

# Select character columns
df %>%
  select(where(is.character))

# Custom predicate function
has_many_unique <- function(x) length(unique(x)) > 4

df %>%
  select(where(has_many_unique))

Combining Helpers with Set Operations

Select helpers support Boolean operations for complex selection logic.

# AND operation: columns starting with "customer" AND ending with "id"
df %>%
  select(starts_with("customer") & ends_with("id"))
#   customer_id

# OR operation: multiple patterns
df %>%
  select(starts_with("customer") | starts_with("order"))

# NOT operation: exclude patterns
df %>%
  select(!contains("total"))
#   customer_id customer_name customer_age order_date order_status

# Complex combinations
df %>%
  select(
    (starts_with("order") | starts_with("total")) & 
    !ends_with("status")
  )
#   order_date order_total total_purchases total_revenue

Practical Applications

Real-world scenarios often require combining multiple selection strategies.

# Data cleaning: remove ID columns and select metrics
analysis_df <- df %>%
  select(
    !ends_with("_id"),
    where(is.numeric),
    order_status
  )

# Reporting: select specific column groups
report_cols <- df %>%
  select(
    customer_id,
    customer_name,
    starts_with("order"),
    -contains("status")
  )

# Dynamic selection for reshaping
wide_to_long_prep <- df %>%
  select(
    customer_id,
    matches("^(order|total)_")
  )

# Exclude temporary or computed columns
clean_export <- df %>%
  select(
    !matches("^(temp_|calc_|tmp_)")
  )

Performance Considerations

Select helpers operate on column names, not data, making them extremely fast even on large datasets. However, combining with where() requires evaluating the predicate function on each column.

# Efficient: pattern matching only
df %>%
  select(starts_with("customer"))

# Less efficient: evaluates function on each column
df %>%
  select(where(~ n_distinct(.) > 3))

# Optimize by combining approaches
df %>%
  select(
    starts_with("order"),  # Fast pattern match first
    where(is.numeric)      # Then type check
  )

Select helpers transform column selection from error-prone manual specification into declarative, maintainable code. They excel in production environments where column names may change or expand, and in exploratory analysis where quick subsetting accelerates insight discovery.