R - merge() Data Frames

The `merge()` function combines two data frames based on common columns, similar to SQL JOIN operations. The basic syntax requires at least two data frames, with optional parameters controlling join...

Key Insights

  • R provides merge() for SQL-style joins between data frames, supporting inner, left, right, and full outer joins through the all, all.x, and all.y parameters
  • The function automatically detects common column names for merging, but explicit control via by, by.x, and by.y parameters prevents ambiguity and improves code reliability
  • Performance degrades significantly with large datasets; data.table or dplyr alternatives offer 10-100x speed improvements for production workloads

Understanding merge() Fundamentals

The merge() function combines two data frames based on common columns, similar to SQL JOIN operations. The basic syntax requires at least two data frames, with optional parameters controlling join behavior and column matching.

# Create sample data frames
employees <- data.frame(
  emp_id = c(101, 102, 103, 104, 105),
  name = c("Alice", "Bob", "Carol", "David", "Eve"),
  dept_id = c(1, 2, 1, 3, 2)
)

departments <- data.frame(
  dept_id = c(1, 2, 3, 4),
  dept_name = c("Engineering", "Sales", "Marketing", "HR"),
  location = c("NYC", "LA", "Chicago", "Boston")
)

# Basic merge on common column
result <- merge(employees, departments, by = "dept_id")
print(result)

Output:

  dept_id emp_id   name     dept_name location
1       1    101  Alice  Engineering      NYC
2       1    103  Carol  Engineering      NYC
3       2    102    Bob        Sales       LA
4       2    105    Eve        Sales       LA
5       3    104  David    Marketing  Chicago

By default, merge() performs an inner join, returning only rows with matching keys in both data frames. Department 4 (HR) doesn’t appear because no employees belong to it.

Join Types: Inner, Left, Right, and Full Outer

Control join behavior through the all, all.x, and all.y parameters. These boolean flags determine which unmatched rows to retain.

# Inner join (default)
inner <- merge(employees, departments, by = "dept_id")
cat("Inner join rows:", nrow(inner), "\n")

# Left join - keep all employees
left <- merge(employees, departments, by = "dept_id", all.x = TRUE)
cat("Left join rows:", nrow(left), "\n")

# Right join - keep all departments
right <- merge(employees, departments, by = "dept_id", all.y = TRUE)
cat("Right join rows:", nrow(right), "\n")

# Full outer join - keep all rows
full <- merge(employees, departments, by = "dept_id", all = TRUE)
cat("Full outer join rows:", nrow(full), "\n")

# Examine right join with unmatched department
print(right[is.na(right$emp_id), ])

Output:

Inner join rows: 5
Left join rows: 5
Right join rows: 6
Full outer join rows: 6

  dept_id emp_id name dept_name location
6       4     NA <NA>        HR   Boston

The right join includes the HR department with NA values for employee columns. This pattern is essential for identifying missing relationships in your data.

Merging on Different Column Names

When key columns have different names across data frames, use by.x and by.y to specify the matching columns explicitly.

# Create data with different column names
sales <- data.frame(
  employee_id = c(101, 102, 103, 104),
  quarter = c("Q1", "Q1", "Q1", "Q1"),
  revenue = c(50000, 75000, 60000, 45000)
)

emp_info <- data.frame(
  id = c(101, 102, 103, 104, 105),
  employee_name = c("Alice", "Bob", "Carol", "David", "Eve"),
  region = c("East", "West", "East", "South", "West")
)

# Merge with different column names
sales_detail <- merge(
  sales, 
  emp_info, 
  by.x = "employee_id", 
  by.y = "id",
  all.x = TRUE
)

print(sales_detail)

Output:

  employee_id quarter revenue employee_name region
1         101      Q1   50000         Alice   East
2         102      Q1   75000           Bob   West
3         103      Q1   60000         Carol   East
4         104      Q1   45000         David  South

Merging on Multiple Columns

Composite keys require matching on multiple columns simultaneously. Pass a character vector to the by parameter.

# Create data with composite keys
transactions <- data.frame(
  store_id = c(1, 1, 2, 2, 3),
  product_id = c(100, 101, 100, 102, 100),
  quantity = c(5, 3, 8, 2, 6)
)

inventory <- data.frame(
  store_id = c(1, 1, 2, 2, 3, 3),
  product_id = c(100, 101, 100, 102, 100, 103),
  stock = c(50, 30, 80, 20, 60, 40),
  warehouse = c("A", "A", "B", "B", "C", "C")
)

# Merge on multiple columns
stock_check <- merge(
  transactions, 
  inventory, 
  by = c("store_id", "product_id")
)

# Calculate stock after transactions
stock_check$remaining <- stock_check$stock - stock_check$quantity
print(stock_check)

Output:

  store_id product_id quantity stock warehouse remaining
1        1        100        5    50         A        45
2        1        101        3    30         A        27
3        2        100        8    80         B        72
4        2        102        2    20         B        18
5        3        100        6    60         C        54

Handling Duplicate Column Names

When data frames contain columns with identical names (other than merge keys), merge() appends .x and .y suffixes. Use the suffixes parameter to customize these.

# Create data with overlapping column names
q1_sales <- data.frame(
  emp_id = c(101, 102, 103),
  sales = c(50000, 75000, 60000),
  quota = c(45000, 70000, 55000)
)

q2_sales <- data.frame(
  emp_id = c(101, 102, 104),
  sales = c(55000, 80000, 45000),
  quota = c(50000, 75000, 40000)
)

# Merge with custom suffixes
quarterly_comparison <- merge(
  q1_sales, 
  q2_sales, 
  by = "emp_id", 
  suffixes = c("_q1", "_q2"),
  all = TRUE
)

print(quarterly_comparison)

Output:

  emp_id sales_q1 quota_q1 sales_q2 quota_q2
1    101    50000    45000    55000    50000
2    102    75000    70000    80000    75000
3    103    60000    55000       NA       NA
4    104       NA       NA    45000    40000

Performance Considerations and Alternatives

The merge() function works well for small to medium datasets but becomes prohibitively slow with large data. For datasets exceeding 100,000 rows, consider alternatives.

# Performance comparison (requires data.table package)
library(data.table)
library(microbenchmark)

# Create larger datasets
set.seed(42)
large_df1 <- data.frame(
  id = 1:10000,
  value1 = rnorm(10000)
)

large_df2 <- data.frame(
  id = sample(1:10000, 8000),
  value2 = rnorm(8000)
)

# Convert to data.table
dt1 <- as.data.table(large_df1)
dt2 <- as.data.table(large_df2)
setkey(dt1, id)
setkey(dt2, id)

# Benchmark
comparison <- microbenchmark(
  base_merge = merge(large_df1, large_df2, by = "id"),
  dt_merge = dt1[dt2],
  times = 10
)

print(comparison)

For production systems processing large datasets, data.table provides dramatic speed improvements through optimized C code and efficient memory usage. The dplyr package offers similar performance with more readable syntax through left_join(), inner_join(), and related functions.

Practical Pattern: Chaining Multiple Merges

Real-world scenarios often require joining multiple data frames sequentially. Build pipelines by chaining merge operations.

# Create related datasets
orders <- data.frame(
  order_id = c(1001, 1002, 1003),
  customer_id = c(501, 502, 501),
  product_id = c(200, 201, 200)
)

customers <- data.frame(
  customer_id = c(501, 502, 503),
  customer_name = c("TechCorp", "DataInc", "CloudCo"),
  tier = c("Gold", "Silver", "Gold")
)

products <- data.frame(
  product_id = c(200, 201, 202),
  product_name = c("Database", "Analytics", "Storage"),
  price = c(5000, 3000, 2000)
)

# Chain merges
result <- merge(orders, customers, by = "customer_id")
result <- merge(result, products, by = "product_id")

# Calculate order values
result$order_value <- result$price
print(result[, c("order_id", "customer_name", "product_name", "order_value", "tier")])

This pattern creates comprehensive views by progressively enriching data with additional context from related tables, mirroring common database query patterns in R’s data frame paradigm.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.