R dplyr - anti_join() and semi_join()

The dplyr package provides two filtering joins that differ fundamentally from mutating joins like `inner_join()` or `left_join()`. While mutating joins combine columns from both tables, filtering...

Key Insights

  • anti_join() returns rows from the left table that have no match in the right table, making it essential for identifying missing records or data inconsistencies
  • semi_join() returns rows from the left table that have a match in the right table, but unlike inner_join(), it never duplicates left table rows and doesn’t add columns from the right table
  • Both filtering joins are more memory-efficient than mutating joins when you only need to filter rows based on the existence of matches, not merge data

Understanding Filtering Joins

The dplyr package provides two filtering joins that differ fundamentally from mutating joins like inner_join() or left_join(). While mutating joins combine columns from both tables, filtering joins use the right table purely as a filter criterion. They return rows from the left table based on whether matches exist in the right table, without adding any columns from the right side.

semi_join(x, y) keeps all rows in x that have a match in y. anti_join(x, y) keeps all rows in x that don’t have a match in y. Both operations preserve the original structure of the left table.

anti_join() - Finding What’s Missing

anti_join() excels at identifying gaps in your data. Common use cases include finding customers who haven’t made purchases, detecting missing records, or validating data completeness.

library(dplyr)

# Customer database
customers <- tibble(
  customer_id = c(101, 102, 103, 104, 105),
  name = c("Alice", "Bob", "Carol", "David", "Eve"),
  region = c("West", "East", "West", "North", "South")
)

# Recent orders
orders <- tibble(
  order_id = c(1, 2, 3),
  customer_id = c(101, 103, 103),
  amount = c(250, 180, 320)
)

# Find customers who haven't ordered
inactive_customers <- customers %>%
  anti_join(orders, by = "customer_id")

print(inactive_customers)

Output:

# A tibble: 3 × 3
  customer_id name  region
        <dbl> <chr> <chr> 
1         102 Bob   East  
2         104 David North 
3         105 Eve   South 

The result contains only customers without matching orders. Notice that the orders table columns aren’t included—you get the complete left table rows, filtered by the absence of matches.

semi_join() - Filtering by Existence

semi_join() returns rows from the left table that have at least one match in the right table. Unlike inner_join(), it never duplicates rows from the left table, regardless of how many matches exist in the right table.

# Products catalog
products <- tibble(
  product_id = c(1, 2, 3, 4, 5),
  product_name = c("Laptop", "Mouse", "Keyboard", "Monitor", "Webcam"),
  category = c("Computer", "Accessory", "Accessory", "Display", "Accessory")
)

# Items currently in stock
inventory <- tibble(
  product_id = c(1, 2, 2, 3, 3, 3),
  warehouse = c("A", "A", "B", "A", "B", "C"),
  quantity = c(10, 5, 8, 12, 6, 4)
)

# Find products that are in stock
available_products <- products %>%
  semi_join(inventory, by = "product_id")

print(available_products)

Output:

# A tibble: 3 × 3
  product_id product_name category  
       <dbl> <chr>        <chr>     
1          1 Laptop       Computer  
2          2 Mouse        Accessory 
3          3 Keyboard     Accessory 

Product ID 2 appears twice in inventory (warehouses A and B), and product ID 3 appears three times (warehouses A, B, and C), but each product appears only once in the result. This is the key distinction from inner_join().

Comparing semi_join() vs inner_join()

Understanding when to use semi_join() versus inner_join() is crucial for writing efficient queries.

# Using inner_join - creates duplicates
inner_result <- products %>%
  inner_join(inventory, by = "product_id") %>%
  select(product_id, product_name, category)

print(inner_result)

Output:

# A tibble: 6 × 3
  product_id product_name category  
       <dbl> <chr>        <chr>     
1          1 Laptop       Computer  
2          2 Mouse        Accessory 
3          2 Mouse        Accessory 
4          3 Keyboard     Accessory 
5          3 Keyboard     Accessory 
6          3 Keyboard     Accessory 

The inner_join() creates one row for each match, resulting in duplicates. If you need the warehouse information, use inner_join(). If you only need to know which products are in stock, semi_join() is more appropriate and efficient.

Multiple Join Keys

Both filtering joins support multiple columns for matching, essential when dealing with composite keys.

# Employee assignments
employees <- tibble(
  emp_id = c(1, 2, 3, 4, 5),
  department = c("Sales", "IT", "Sales", "HR", "IT"),
  name = c("John", "Jane", "Mike", "Sarah", "Tom")
)

# Active projects
projects <- tibble(
  emp_id = c(1, 2, 2, 3),
  department = c("Sales", "IT", "IT", "Marketing"),
  project_name = c("Q4 Campaign", "Cloud Migration", "Security Audit", "Rebranding")
)

# Find employees with projects in their assigned department
active_in_dept <- employees %>%
  semi_join(projects, by = c("emp_id", "department"))

print(active_in_dept)

Output:

# A tibble: 2 × 3
  emp_id department name 
   <dbl> <chr>      <chr>
1      1 Sales      John 
2      2 IT         Jane 

Mike (emp_id 3) doesn’t appear because while he exists in both tables, his department assignment doesn’t match the project department.

Data Quality Validation

anti_join() is invaluable for data validation workflows, helping identify referential integrity issues.

# Order details referencing products
order_details <- tibble(
  order_id = c(1, 1, 2, 2, 3),
  product_id = c(1, 2, 1, 99, 88),  # 99 and 88 don't exist
  quantity = c(2, 1, 3, 1, 2)
)

# Valid products
valid_products <- tibble(
  product_id = c(1, 2, 3, 4, 5),
  name = c("Widget A", "Widget B", "Widget C", "Widget D", "Widget E")
)

# Find orphaned order details
orphaned_orders <- order_details %>%
  anti_join(valid_products, by = "product_id")

print(orphaned_orders)

Output:

# A tibble: 2 × 3
  order_id product_id quantity
     <dbl>      <dbl>    <dbl>
1        2         99        1
2        3         88        2

This immediately reveals data integrity problems where order details reference non-existent products.

Performance Considerations

Filtering joins are generally faster and more memory-efficient than mutating joins when you don’t need columns from the right table.

# Large dataset scenario
set.seed(123)
large_customers <- tibble(
  customer_id = 1:100000,
  customer_data = paste0("data_", 1:100000)
)

active_ids <- tibble(
  customer_id = sample(1:100000, 50000)
)

# Efficient: semi_join
system.time({
  result_semi <- large_customers %>%
    semi_join(active_ids, by = "customer_id")
})

# Less efficient: inner_join with select
system.time({
  result_inner <- large_customers %>%
    inner_join(active_ids, by = "customer_id") %>%
    select(customer_id, customer_data)
})

The semi_join() approach typically performs better because it doesn’t need to merge and then drop columns.

Combining Filtering Joins

You can chain filtering joins to implement complex filtering logic.

# Premium customers
premium_customers <- tibble(
  customer_id = c(101, 102, 103)
)

# Customers with complaints
complaints <- tibble(
  customer_id = c(102, 105)
)

# Find premium customers without complaints
target_customers <- customers %>%
  semi_join(premium_customers, by = "customer_id") %>%
  anti_join(complaints, by = "customer_id")

print(target_customers)

Output:

# A tibble: 2 × 3
  customer_id name  region
        <dbl> <chr> <chr> 
1         101 Alice West  
2         103 Carol West  

This pattern is cleaner than complex WHERE clauses and easier to understand than nested subqueries.

Practical Applications

Use anti_join() for: identifying missing records in ETL pipelines, finding unmatched transactions during reconciliation, detecting data quality issues, or generating exception reports.

Use semi_join() for: filtering datasets based on criteria in another table, implementing efficient EXISTS-style queries, deduplicating while filtering, or subsetting data based on reference tables.

Both filtering joins are essential tools for data analysis workflows where the goal is filtering rather than merging data. They provide cleaner, more efficient alternatives to complex filter conditions or mutating joins followed by column selection.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.