R dplyr - left_join, right_join, inner_join, full_join

Joins combine two dataframes based on shared key columns. Each join type handles non-matching rows differently, which directly impacts your result set size and content.

Key Insights

  • R’s dplyr package provides four join functions that mirror SQL join operations, each handling missing matches differently: left_join() preserves all left table rows, right_join() preserves all right table rows, inner_join() keeps only matching rows, and full_join() keeps everything
  • Join performance and memory usage vary significantly based on key cardinality and data size—use inner_join() when possible to reduce output size, and always ensure join keys are indexed or pre-sorted for large datasets
  • One-to-many and many-to-many joins create row multiplication that can silently explode your dataset; validate join cardinality with count() before joining and use multiple and unmatched arguments in dplyr 1.1.0+ for explicit control

Understanding Join Mechanics

Joins combine two dataframes based on shared key columns. Each join type handles non-matching rows differently, which directly impacts your result set size and content.

library(dplyr)

# Sample datasets
customers <- tibble(
  customer_id = c(1, 2, 3, 4),
  name = c("Alice", "Bob", "Charlie", "Diana"),
  region = c("West", "East", "West", "South")
)

orders <- tibble(
  order_id = c(101, 102, 103, 104, 105),
  customer_id = c(1, 1, 2, 5, 5),
  amount = c(250, 180, 420, 310, 95)
)

print(customers)
print(orders)

Customer 4 (Diana) has no orders. Customer 5 has orders but doesn’t exist in the customers table. This mismatch demonstrates how each join type behaves.

left_join: Preserve Left Table Integrity

left_join() keeps all rows from the left dataframe and adds matching columns from the right. Non-matching right rows produce NA values.

# Keep all customers, add order data where available
result <- customers %>%
  left_join(orders, by = "customer_id")

print(result)
#   customer_id name    region order_id amount
# 1           1 Alice   West        101    250
# 2           1 Alice   West        102    180
# 3           2 Bob     East        103    420
# 4           3 Charlie West         NA     NA
# 5           4 Diana   South        NA     NA

Notice customer 1 (Alice) appears twice because she has two orders—this is row multiplication from a one-to-many relationship. Customers 3 and 4 appear once with NA values because they have no orders.

Use left_join() when:

  • You need a complete customer list regardless of activity
  • The left table is your primary reference dataset
  • Missing matches should be explicitly visible as NA
# Calculate total spending per customer, including zero-spend customers
customer_spending <- customers %>%
  left_join(orders, by = "customer_id") %>%
  group_by(customer_id, name) %>%
  summarise(
    total_spent = sum(amount, na.rm = TRUE),
    order_count = sum(!is.na(order_id)),
    .groups = "drop"
  )

print(customer_spending)

right_join: Mirror of left_join

right_join() preserves all right table rows. It’s functionally equivalent to swapping table positions in left_join().

# Keep all orders, add customer data where available
result <- customers %>%
  right_join(orders, by = "customer_id")

print(result)
#   customer_id name  region order_id amount
# 1           1 Alice West        101    250
# 2           1 Alice West        102    180
# 3           2 Bob   East        103    420
# 4           5 NA    NA          104    310
# 5           5 NA    NA          105     95

Orders 104 and 105 belong to non-existent customer 5, resulting in NA values for customer attributes. This reveals data integrity issues—orphaned records in your orders table.

Most developers prefer left_join() for readability and place the primary table first. Use right_join() when:

  • Piping operations make right-side preservation more natural
  • You’re explicitly checking for orphaned child records
# Identify orphaned orders (orders without valid customers)
orphaned_orders <- customers %>%
  right_join(orders, by = "customer_id") %>%
  filter(is.na(name))

print(orphaned_orders)

inner_join: Intersection Only

inner_join() returns only rows with matches in both tables. This is the most restrictive join and typically produces the smallest result set.

result <- customers %>%
  inner_join(orders, by = "customer_id")

print(result)
#   customer_id name  region order_id amount
# 1           1 Alice West        101    250
# 2           1 Alice West        102    180
# 3           2 Bob   East        103    420

Customer 3 (Charlie) and 4 (Diana) are excluded because they have no orders. Orders for customer 5 are excluded because customer 5 doesn’t exist.

Use inner_join() when:

  • You only want complete records with data in both tables
  • You’re filtering out incomplete or invalid relationships
  • Performance matters and you want to minimize output size
# Analysis requiring both customer and order data
customer_order_analysis <- customers %>%
  inner_join(orders, by = "customer_id") %>%
  mutate(
    high_value = amount > 300,
    customer_region_label = paste(name, region, sep = " - ")
  ) %>%
  select(customer_region_label, order_id, amount, high_value)

print(customer_order_analysis)

full_join: Complete Union

full_join() keeps all rows from both tables, filling unmatched columns with NA. This produces the largest result set.

result <- customers %>%
  full_join(orders, by = "customer_id")

print(result)
#   customer_id name    region order_id amount
# 1           1 Alice   West        101    250
# 2           1 Alice   West        102    180
# 3           2 Bob     East        103    420
# 4           3 Charlie West         NA     NA
# 5           4 Diana   South        NA     NA
# 6           5 NA      NA          104    310
# 7           5 NA      NA          105     95

All customers appear (including those without orders), and all orders appear (including orphaned ones). This is essential for data auditing.

Use full_join() when:

  • You need complete visibility into both datasets
  • You’re performing data quality checks
  • You’re reconciling two partial datasets
# Data quality report
quality_report <- customers %>%
  full_join(orders, by = "customer_id") %>%
  mutate(
    record_status = case_when(
      is.na(name) ~ "Orphaned Order",
      is.na(order_id) ~ "Customer Without Orders",
      TRUE ~ "Valid Record"
    )
  ) %>%
  count(record_status)

print(quality_report)

Multi-Column Joins

Real-world joins often require multiple key columns. Specify them as a character vector.

products <- tibble(
  product_id = c(1, 1, 2, 2),
  warehouse = c("A", "B", "A", "B"),
  stock = c(100, 50, 200, 75)
)

shipments <- tibble(
  product_id = c(1, 1, 2),
  warehouse = c("A", "B", "A"),
  quantity = c(20, 15, 30)
)

# Join on both product_id AND warehouse
result <- products %>%
  left_join(shipments, by = c("product_id", "warehouse"))

print(result)

Handling Different Column Names

When join keys have different names across tables, use a named vector.

customer_addresses <- tibble(
  cust_id = c(1, 2, 3, 4),
  address = c("123 Main St", "456 Oak Ave", "789 Pine Rd", "321 Elm St")
)

# customer_id in customers maps to cust_id in customer_addresses
result <- customers %>%
  left_join(customer_addresses, by = c("customer_id" = "cust_id"))

print(result)

Controlling Join Behavior

dplyr 1.1.0 introduced relationship, multiple, and unmatched arguments for explicit join control.

# Expect one-to-many relationship, error if violated
safe_join <- customers %>%
  left_join(
    orders, 
    by = "customer_id",
    relationship = "one-to-many"
  )

# Keep only first match for duplicates
first_order_only <- customers %>%
  left_join(
    orders, 
    by = "customer_id",
    multiple = "first"
  )

# Error if left table has unmatched rows
strict_inner <- customers %>%
  inner_join(
    orders,
    by = "customer_id",
    unmatched = "error"
  )

Performance Considerations

Join performance degrades with dataset size and key cardinality. Profile your joins and optimize accordingly.

# Check join cardinality before joining large tables
customers %>% count(customer_id) %>% filter(n > 1)  # Should be empty
orders %>% count(customer_id) %>% arrange(desc(n))  # Shows distribution

# For large datasets, use data.table for better performance
library(data.table)
customers_dt <- as.data.table(customers)
orders_dt <- as.data.table(orders)
setkey(customers_dt, customer_id)
setkey(orders_dt, customer_id)

# Merge syntax similar to dplyr
result_dt <- merge(customers_dt, orders_dt, all.x = TRUE)  # left join

Choose your join type based on business logic, not convenience. Validate assumptions about cardinality and match rates. Monitor result set sizes to catch unexpected row multiplication early.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.