R dplyr - left_join, right_join, inner_join, full_join

Key Insights

R’s dplyr package provides four join functions that mirror SQL join operations, each handling missing matches differently: left_join() preserves all left table rows, right_join() preserves all right table rows, inner_join() keeps only matching rows, and full_join() keeps everything
Join performance and memory usage vary significantly based on key cardinality and data size—use inner_join() when possible to reduce output size, and always ensure join keys are indexed or pre-sorted for large datasets
One-to-many and many-to-many joins create row multiplication that can silently explode your dataset; validate join cardinality with count() before joining and use multiple and unmatched arguments in dplyr 1.1.0+ for explicit control

Understanding Join Mechanics

Joins combine two dataframes based on shared key columns. Each join type handles non-matching rows differently, which directly impacts your result set size and content.

library(dplyr)

# Sample datasets
customers <- tibble(
  customer_id = c(1, 2, 3, 4),
  name = c("Alice", "Bob", "Charlie", "Diana"),
  region = c("West", "East", "West", "South")
)

orders <- tibble(
  order_id = c(101, 102, 103, 104, 105),
  customer_id = c(1, 1, 2, 5, 5),
  amount = c(250, 180, 420, 310, 95)
)

print(customers)
print(orders)

Customer 4 (Diana) has no orders. Customer 5 has orders but doesn’t exist in the customers table. This mismatch demonstrates how each join type behaves.

left_join: Preserve Left Table Integrity

left_join() keeps all rows from the left dataframe and adds matching columns from the right. Non-matching right rows produce NA values.

# Keep all customers, add order data where available
result <- customers %>%
  left_join(orders, by = "customer_id")

print(result)
#   customer_id name    region order_id amount
# 1           1 Alice   West        101    250
# 2           1 Alice   West        102    180
# 3           2 Bob     East        103    420
# 4           3 Charlie West         NA     NA
# 5           4 Diana   South        NA     NA

Notice customer 1 (Alice) appears twice because she has two orders—this is row multiplication from a one-to-many relationship. Customers 3 and 4 appear once with NA values because they have no orders.

Use left_join() when:

You need a complete customer list regardless of activity
The left table is your primary reference dataset
Missing matches should be explicitly visible as NA

# Calculate total spending per customer, including zero-spend customers
customer_spending <- customers %>%
  left_join(orders, by = "customer_id") %>%
  group_by(customer_id, name) %>%
  summarise(
    total_spent = sum(amount, na.rm = TRUE),
    order_count = sum(!is.na(order_id)),
    .groups = "drop"
  )

print(customer_spending)

right_join: Mirror of left_join

right_join() preserves all right table rows. It’s functionally equivalent to swapping table positions in left_join().

# Keep all orders, add customer data where available
result <- customers %>%
  right_join(orders, by = "customer_id")

print(result)
#   customer_id name  region order_id amount
# 1           1 Alice West        101    250
# 2           1 Alice West        102    180
# 3           2 Bob   East        103    420
# 4           5 NA    NA          104    310
# 5           5 NA    NA          105     95

Orders 104 and 105 belong to non-existent customer 5, resulting in NA values for customer attributes. This reveals data integrity issues—orphaned records in your orders table.

Most developers prefer left_join() for readability and place the primary table first. Use right_join() when:

Piping operations make right-side preservation more natural
You’re explicitly checking for orphaned child records

# Identify orphaned orders (orders without valid customers)
orphaned_orders <- customers %>%
  right_join(orders, by = "customer_id") %>%
  filter(is.na(name))

print(orphaned_orders)

inner_join: Intersection Only

inner_join() returns only rows with matches in both tables. This is the most restrictive join and typically produces the smallest result set.

result <- customers %>%
  inner_join(orders, by = "customer_id")

print(result)
#   customer_id name  region order_id amount
# 1           1 Alice West        101    250
# 2           1 Alice West        102    180
# 3           2 Bob   East        103    420

Customer 3 (Charlie) and 4 (Diana) are excluded because they have no orders. Orders for customer 5 are excluded because customer 5 doesn’t exist.

Use inner_join() when:

You only want complete records with data in both tables
You’re filtering out incomplete or invalid relationships
Performance matters and you want to minimize output size

# Analysis requiring both customer and order data
customer_order_analysis <- customers %>%
  inner_join(orders, by = "customer_id") %>%
  mutate(
    high_value = amount > 300,
    customer_region_label = paste(name, region, sep = " - ")
  ) %>%
  select(customer_region_label, order_id, amount, high_value)

print(customer_order_analysis)

full_join: Complete Union

full_join() keeps all rows from both tables, filling unmatched columns with NA. This produces the largest result set.

result <- customers %>%
  full_join(orders, by = "customer_id")

print(result)
#   customer_id name    region order_id amount
# 1           1 Alice   West        101    250
# 2           1 Alice   West        102    180
# 3           2 Bob     East        103    420
# 4           3 Charlie West         NA     NA
# 5           4 Diana   South        NA     NA
# 6           5 NA      NA          104    310
# 7           5 NA      NA          105     95

All customers appear (including those without orders), and all orders appear (including orphaned ones). This is essential for data auditing.

Use full_join() when:

You need complete visibility into both datasets
You’re performing data quality checks
You’re reconciling two partial datasets

# Data quality report
quality_report <- customers %>%
  full_join(orders, by = "customer_id") %>%
  mutate(
    record_status = case_when(
      is.na(name) ~ "Orphaned Order",
      is.na(order_id) ~ "Customer Without Orders",
      TRUE ~ "Valid Record"
    )
  ) %>%
  count(record_status)

print(quality_report)

Multi-Column Joins

Real-world joins often require multiple key columns. Specify them as a character vector.

products <- tibble(
  product_id = c(1, 1, 2, 2),
  warehouse = c("A", "B", "A", "B"),
  stock = c(100, 50, 200, 75)
)

shipments <- tibble(
  product_id = c(1, 1, 2),
  warehouse = c("A", "B", "A"),
  quantity = c(20, 15, 30)
)

# Join on both product_id AND warehouse
result <- products %>%
  left_join(shipments, by = c("product_id", "warehouse"))

print(result)

Handling Different Column Names

When join keys have different names across tables, use a named vector.

customer_addresses <- tibble(
  cust_id = c(1, 2, 3, 4),
  address = c("123 Main St", "456 Oak Ave", "789 Pine Rd", "321 Elm St")
)

# customer_id in customers maps to cust_id in customer_addresses
result <- customers %>%
  left_join(customer_addresses, by = c("customer_id" = "cust_id"))

print(result)

Controlling Join Behavior

dplyr 1.1.0 introduced relationship, multiple, and unmatched arguments for explicit join control.

# Expect one-to-many relationship, error if violated
safe_join <- customers %>%
  left_join(
    orders, 
    by = "customer_id",
    relationship = "one-to-many"
  )

# Keep only first match for duplicates
first_order_only <- customers %>%
  left_join(
    orders, 
    by = "customer_id",
    multiple = "first"
  )

# Error if left table has unmatched rows
strict_inner <- customers %>%
  inner_join(
    orders,
    by = "customer_id",
    unmatched = "error"
  )

Performance Considerations

Join performance degrades with dataset size and key cardinality. Profile your joins and optimize accordingly.

# Check join cardinality before joining large tables
customers %>% count(customer_id) %>% filter(n > 1)  # Should be empty
orders %>% count(customer_id) %>% arrange(desc(n))  # Shows distribution

# For large datasets, use data.table for better performance
library(data.table)
customers_dt <- as.data.table(customers)
orders_dt <- as.data.table(orders)
setkey(customers_dt, customer_id)
setkey(orders_dt, customer_id)

# Merge syntax similar to dplyr
result_dt <- merge(customers_dt, orders_dt, all.x = TRUE)  # left join

Choose your join type based on business logic, not convenience. Validate assumptions about cardinality and match rates. Monitor result set sizes to catch unexpected row multiplication early.