R tidyr - unite() Columns into One | Application Architect

Key Insights

The unite() function combines multiple columns into a single column with a specified separator, useful for creating composite identifiers, full names, or date strings from separate components
Unlike paste(), unite() integrates seamlessly into tidyverse pipelines and automatically handles NA values with the na.rm parameter
Combined with separate() or separate_wider_delim(), unite() enables bidirectional column transformations for flexible data reshaping workflows

Basic unite() Syntax

The unite() function from the tidyr package merges multiple columns into one. The basic syntax requires the data frame, the name of the new column, and the columns to combine.

library(tidyr)
library(dplyr)

# Sample dataset
employees <- data.frame(
  first_name = c("John", "Sarah", "Michael"),
  last_name = c("Smith", "Johnson", "Williams"),
  department = c("IT", "HR", "Finance")
)

# Unite first and last names
employees %>%
  unite(col = "full_name", first_name, last_name, sep = " ")

      full_name department
1    John Smith         IT
2 Sarah Johnson         HR
3 Michael Williams    Finance

By default, unite() removes the original columns. The sep parameter controls the separator between values—defaulting to underscore if not specified.

Preserving Original Columns

Use the remove parameter to keep source columns after uniting them. This proves valuable when you need both the combined and individual values for different analyses.

employees %>%
  unite(col = "full_name", first_name, last_name, sep = " ", remove = FALSE)

  first_name last_name      full_name department
1       John     Smith     John Smith         IT
2      Sarah   Johnson  Sarah Johnson         HR
3    Michael  Williams Michael Williams    Finance

The original first_name and last_name columns remain intact alongside the new full_name column.

Handling Missing Values

The na.rm parameter controls how unite() processes NA values. By default, NA values appear in the output string. Setting na.rm = TRUE excludes them entirely.

# Dataset with missing values
customer_data <- data.frame(
  prefix = c("Dr.", NA, "Ms."),
  first = c("Jane", "Bob", "Alice"),
  middle = c("Marie", NA, "Kay"),
  last = c("Doe", "Smith", "Brown")
)

# Default behavior - NAs included
customer_data %>%
  unite(col = "full_name", prefix, first, middle, last, sep = " ")

              full_name
1    Dr. Jane Marie Doe
2       NA Bob NA Smith
3 Ms. Alice Kay Brown

# Remove NAs from output
customer_data %>%
  unite(col = "full_name", prefix, first, middle, last, sep = " ", na.rm = TRUE)

           full_name
1 Dr. Jane Marie Doe
2          Bob Smith
3  Ms. Alice Kay Brown

With na.rm = TRUE, the second row shows “Bob Smith” instead of “NA Bob NA Smith”, producing cleaner output for incomplete records.

Creating Composite Keys

Unite multiple identifier columns to create unique composite keys for joins or grouping operations.

# Transaction data
transactions <- data.frame(
  year = c(2023, 2023, 2024, 2024),
  month = c(1, 2, 1, 2),
  day = c(15, 20, 10, 25),
  store_id = c("A01", "A02", "A01", "A03"),
  amount = c(150.50, 200.75, 175.25, 300.00)
)

# Create date and transaction ID
transactions %>%
  unite(col = "date", year, month, day, sep = "-", remove = FALSE) %>%
  unite(col = "transaction_id", date, store_id, sep = "_")

   transaction_id year month day amount
1  2023-1-15_A01 2023     1  15 150.50
2  2023-2-20_A02 2023     2  20 200.75
3  2024-1-10_A01 2024     1  10 175.25
4  2024-2-25_A03 2024     2  25 300.00

This creates human-readable transaction identifiers combining temporal and location data.

Dynamic Column Selection

Use tidyselect helpers to unite columns matching patterns or positions without explicitly naming each column.

# Wide format survey data
survey <- data.frame(
  respondent_id = 1:3,
  q1_rating = c(4, 5, 3),
  q2_rating = c(5, 4, 4),
  q3_rating = c(3, 5, 5),
  q1_comment = c("Good", "Great", "OK"),
  q2_comment = c("Excellent", "Nice", "Fine")
)

# Unite all rating columns
survey %>%
  unite(col = "all_ratings", ends_with("_rating"), sep = "|")

  respondent_id all_ratings q1_comment q2_comment
1             1       4|5|3       Good  Excellent
2             2       5|4|5      Great       Nice
3             3       3|4|5         OK       Fine

Common tidyselect helpers include starts_with(), ends_with(), contains(), and matches() for pattern-based selection.

Combining unite() with separate()

Round-trip transformations let you reshape data in both directions. This pattern appears frequently in data cleaning pipelines.

# Start with separated data
address_data <- data.frame(
  id = 1:3,
  street = c("123 Main St", "456 Oak Ave", "789 Pine Rd"),
  city = c("Boston", "Seattle", "Austin"),
  state = c("MA", "WA", "TX"),
  zip = c("02101", "98101", "78701")
)

# Unite into full address
unified <- address_data %>%
  unite(col = "full_address", street, city, state, zip, sep = ", ")

print(unified)

  id                        full_address
1  1  123 Main St, Boston, MA, 02101
2  2 456 Oak Ave, Seattle, WA, 98101
3  3   789 Pine Rd, Austin, TX, 78701

# Separate back to components
unified %>%
  separate(col = full_address, 
           into = c("street", "city", "state", "zip"), 
           sep = ", ")

  id      street    city state   zip
1  1 123 Main St  Boston    MA 02101
2  2 456 Oak Ave Seattle    WA 98101
3  3 789 Pine Rd  Austin    TX 78701

This bidirectional capability enables flexible data transformations based on analysis requirements.

Real-World Use Case: Log File Processing

Process server log data by combining timestamp components and request parameters into structured fields.

# Simulated log entries
logs <- data.frame(
  year = c(2024, 2024, 2024),
  month = c(3, 3, 3),
  day = c(15, 15, 16),
  hour = c(14, 15, 9),
  minute = c(23, 45, 12),
  second = c(10, 33, 55),
  server = c("web01", "web02", "web01"),
  endpoint = c("/api/users", "/api/orders", "/api/products"),
  status = c(200, 404, 200)
)

# Create ISO timestamp and request identifier
logs %>%
  unite(col = "timestamp", year, month, day, hour, minute, second, 
        sep = "-", remove = FALSE) %>%
  unite(col = "request_id", server, endpoint, sep = "", remove = FALSE) %>%
  select(timestamp, request_id, status, server, endpoint)

            timestamp         request_id status server      endpoint
1 2024-3-15-14-23-10  web01/api/users    200  web01    /api/users
2 2024-3-15-15-45-33 web02/api/orders    404  web02   /api/orders
3  2024-3-16-9-12-55 web01/api/products  200  web01 /api/products

This approach creates searchable identifiers while preserving granular components for filtering and aggregation.

Performance Considerations

For large datasets, unite() performs efficiently as part of tidyverse operations. However, when working with millions of rows, consider these optimizations:

# Efficient for large datasets - single unite operation
large_data %>%
  unite(col = "combined", col1, col2, col3, sep = "_")

# Less efficient - multiple unite operations
large_data %>%
  unite(col = "temp", col1, col2, sep = "_") %>%
  unite(col = "combined", temp, col3, sep = "_")

Combine all columns in a single unite() call rather than chaining multiple operations. The function handles multiple columns efficiently in one pass through the data.

Integration with dplyr Workflows

The unite() function chains naturally with other tidyverse operations for complex transformations.

# Complex pipeline example
sales_data <- data.frame(
  region = c("North", "South", "North", "West"),
  year = c(2023, 2023, 2024, 2024),
  quarter = c("Q1", "Q2", "Q1", "Q2"),
  revenue = c(50000, 75000, 60000, 80000)
)

sales_data %>%
  unite(col = "period", year, quarter, sep = "-") %>%
  group_by(region, period) %>%
  summarise(total_revenue = sum(revenue), .groups = "drop") %>%
  arrange(desc(total_revenue))

  region  period total_revenue
1   West 2024-Q2         80000
2  South 2023-Q2         75000
3  North 2024-Q1         60000
4  North 2023-Q1         50000

This demonstrates unite() as one component in a larger analytical pipeline, combining columns before aggregation and sorting operations.