PySpark - Left Semi Join with Examples

Key Insights

Left semi joins return only rows from the left DataFrame where matches exist in the right DataFrame, without duplicating rows or adding columns from the right side—making them ideal for filtering based on existence checks
They outperform alternatives like inner join + dropDuplicates() or filter + isin() for large datasets because they stop processing once a match is found and never materialize right-side columns
Use broadcast hints with left semi joins when the right DataFrame is small (< 10MB) to dramatically improve performance by avoiding expensive shuffle operations

Introduction to Left Semi Join

A left semi join is one of PySpark’s most underutilized join types, yet it solves a common problem elegantly: filtering a DataFrame based on the existence of matching records in another DataFrame. Unlike inner or left joins, a left semi join returns only columns from the left DataFrame and never duplicates rows—even if multiple matches exist in the right DataFrame.

Think of it as asking “which rows in DataFrame A have at least one corresponding row in DataFrame B?” You get back only rows from A, with A’s original schema intact.

Here’s a visual comparison:

from pyspark.sql import SparkSession
from pyspark.sql.functions import col

spark = SparkSession.builder.appName("SemiJoinDemo").getOrCreate()

# Sample data
customers = spark.createDataFrame([
    (1, "Alice"),
    (2, "Bob"),
    (3, "Charlie")
], ["customer_id", "name"])

orders = spark.createDataFrame([
    (101, 1),
    (102, 1),
    (103, 2)
], ["order_id", "customer_id"])

# Left join - includes all customers, adds order columns, duplicates customers with multiple orders
left_result = customers.join(orders, "customer_id", "left")
print("Left Join:")
left_result.show()
# +----------+-------+--------+
# |customer_id|  name|order_id|
# +----------+-------+--------+
# |         1| Alice|     101|
# |         1| Alice|     102|  <- Alice duplicated
# |         2|   Bob|     103|
# |         3|Charlie|    null|  <- Charlie included with null
# +----------+-------+--------+

# Left semi join - only customers with orders, no duplication, no order columns
semi_result = customers.join(orders, "customer_id", "left_semi")
print("Left Semi Join:")
semi_result.show()
# +----------+-----+
# |customer_id| name|
# +----------+-----+
# |         1|Alice|  <- No duplication despite 2 orders
# |         2|  Bob|
# +----------+-----+

The left semi join gives you a clean, deduplicated list of customers who have placed orders, without bringing in any order details.

Left Semi Join Syntax and Basic Usage

The syntax for left semi joins in PySpark uses the standard join() method with "left_semi" as the join type parameter:

result = left_df.join(right_df, join_condition, "left_semi")

Here’s a practical example with employees and departments:

employees = spark.createDataFrame([
    (1, "John", "Engineering"),
    (2, "Sarah", "Marketing"),
    (3, "Mike", "Sales"),
    (4, "Emma", "Engineering"),
    (5, "Tom", "Finance")
], ["emp_id", "emp_name", "dept_name"])

active_departments = spark.createDataFrame([
    ("Engineering", "Building A"),
    ("Marketing", "Building B")
], ["dept_name", "location"])

# Find employees who work in active departments
result = employees.join(
    active_departments,
    employees.dept_name == active_departments.dept_name,
    "left_semi"
)

result.show()
# +------+--------+------------+
# |emp_id|emp_name|   dept_name|
# +------+--------+------------+
# |     1|    John| Engineering|
# |     2|   Sarah|   Marketing|
# |     4|    Emma| Engineering|
# +------+--------+------------+

Notice that only employees from Engineering and Marketing appear in the result, and the location column from active_departments is not included. This is the defining characteristic of semi joins.

Practical Use Cases

Left semi joins excel in several common scenarios:

1. Filtering based on existence: When you need to filter records that exist in a reference table without needing the reference data itself.

2. Deduplication during joins: When the right DataFrame might have multiple matches but you only care about existence.

3. Performance optimization: Replacing subqueries or complex filter operations with a more efficient join strategy.

Here’s an e-commerce example finding products that have been ordered:

products = spark.createDataFrame([
    (101, "Laptop", 999.99),
    (102, "Mouse", 29.99),
    (103, "Keyboard", 79.99),
    (104, "Monitor", 299.99),
    (105, "Webcam", 89.99)
], ["product_id", "product_name", "price"])

order_items = spark.createDataFrame([
    (1, 101, 2),
    (2, 101, 1),
    (3, 102, 5),
    (4, 103, 3),
    (5, 102, 2)
], ["order_item_id", "product_id", "quantity"])

# Find products that have been ordered at least once
ordered_products = products.join(
    order_items,
    "product_id",
    "left_semi"
)

ordered_products.show()
# +----------+------------+------+
# |product_id|product_name| price|
# +----------+------------+------+
# |       101|      Laptop|999.99|
# |       102|       Mouse| 29.99|
# |       103|    Keyboard| 79.99|
# +----------+------------+------+

This immediately shows which products from your catalog have sales history, without duplicating products that appear in multiple orders or bringing in order quantities.

Left Semi Join vs Alternatives

Understanding when to use left semi join over alternatives is crucial for writing efficient PySpark code.

Alternative 1: Inner Join + Drop Duplicates

# Less efficient approach
alternative1 = products.join(order_items, "product_id", "inner") \
    .select(products.columns) \
    .dropDuplicates()

Alternative 2: Filter with isin()

# Requires collecting data to driver
ordered_ids = [row.product_id for row in order_items.select("product_id").distinct().collect()]
alternative2 = products.filter(col("product_id").isin(ordered_ids))

Left Semi Join (Best)

# Most efficient
best_approach = products.join(order_items, "product_id", "left_semi")

The left semi join is superior because:

It stops searching for matches after finding the first one (short-circuit evaluation)
It never materializes columns from the right DataFrame
It automatically handles deduplication without a separate operation
It doesn’t require collecting data to the driver

You can verify this by examining execution plans:

# Compare physical plans
alternative1.explain()
# Shows: Join -> Project -> Aggregate (for dropDuplicates)

best_approach.explain()
# Shows: LeftSemi Join (single optimized operation)

Advanced Examples with Multiple Conditions

Left semi joins support complex join conditions involving multiple columns and boolean expressions:

Multi-column join:

sales = spark.createDataFrame([
    (1, "2024-01", "US", 1000),
    (2, "2024-01", "UK", 500),
    (3, "2024-02", "US", 1200),
    (4, "2024-02", "CA", 800)
], ["sale_id", "month", "country", "amount"])

targets = spark.createDataFrame([
    ("2024-01", "US", 900),
    ("2024-02", "US", 1100)
], ["month", "country", "target"])

# Find sales that met their targets
met_targets = sales.join(
    targets,
    (sales.month == targets.month) & (sales.country == targets.country) & (sales.amount >= targets.target),
    "left_semi"
)

met_targets.show()
# +-------+-------+-------+------+
# |sale_id|  month|country|amount|
# +-------+-------+-------+------+
# |      1|2024-01|     US|  1000|
# |      3|2024-02|     US|  1200|
# +-------+-------+-------+------+

Complex conditions with additional filters:

# Find high-value customers who made purchases in specific categories
customers = spark.createDataFrame([
    (1, "Alice", "Premium"),
    (2, "Bob", "Standard"),
    (3, "Charlie", "Premium")
], ["customer_id", "name", "tier"])

purchases = spark.createDataFrame([
    (101, 1, "Electronics", 500),
    (102, 2, "Electronics", 300),
    (103, 1, "Clothing", 100),
    (104, 3, "Books", 50)
], ["purchase_id", "customer_id", "category", "amount"])

premium_electronics = customers.join(
    purchases.filter((col("category") == "Electronics") & (col("amount") > 400)),
    "customer_id",
    "left_semi"
)

premium_electronics.show()
# +----------+-----+-------+
# |customer_id| name|   tier|
# +----------+-----+-------+
# |         1|Alice|Premium|
# +----------+-----+-------+

Performance Considerations and Best Practices

Use broadcast joins for small lookup tables:

When your right DataFrame is small (typically under 10MB), use broadcast hints to avoid expensive shuffles:

from pyspark.sql.functions import broadcast

# Assume active_departments is small
large_employee_table = spark.read.parquet("s3://data/employees")
small_dept_lookup = spark.read.parquet("s3://data/active_depts")

# Broadcast the small table
result = large_employee_table.join(
    broadcast(small_dept_lookup),
    "dept_name",
    "left_semi"
)

Partition your data appropriately:

# If frequently joining on customer_id, partition accordingly
products.repartition("product_id").join(
    order_items.repartition("product_id"),
    "product_id",
    "left_semi"
)

When to use left semi join:

The right DataFrame is used purely for filtering
You don’t need columns from the right DataFrame
The right DataFrame might have multiple matches per left row
You’re replacing an EXISTS subquery from SQL

When NOT to use left semi join:

You need columns from both DataFrames (use inner or left join)
You need to count matches from the right side (use aggregation instead)
The right DataFrame is massive and mostly non-matching (consider left anti join for inverse)

Common Pitfalls and Troubleshooting

Null handling in join keys:

Nulls in join keys never match, even with other nulls:

data_with_nulls = spark.createDataFrame([
    (1, "A"),
    (2, "B"),
    (None, "C")
], ["id", "value"])

lookup = spark.createDataFrame([
    (1,),
    (None,)
], ["id"])

result = data_with_nulls.join(lookup, "id", "left_semi")
result.show()
# +---+-----+
# | id|value|
# +---+-----+
# |  1|    A|  <- Only non-null match appears
# +---+-----+

# Handle nulls explicitly if needed
result_with_nulls = data_with_nulls.join(
    lookup,
    (data_with_nulls.id == lookup.id) | (data_with_nulls.id.isNull() & lookup.id.isNull()),
    "left_semi"
)

Case sensitivity:

By default, Spark is case-insensitive for column names but case-sensitive for values:

# This works - column names are case-insensitive
df1.join(df2, df1.ProductID == df2.productid, "left_semi")

# This might not match as expected - values are case-sensitive
df1.join(df2, df1.category == df2.category, "left_semi")  # "Electronics" != "electronics"

# Use lower() for case-insensitive value matching
from pyspark.sql.functions import lower
df1.join(df2, lower(df1.category) == lower(df2.category), "left_semi")

Misconception about deduplication:

Left semi join deduplicates the left DataFrame based on right DataFrame matches, but doesn’t deduplicate the left DataFrame itself:

left_with_dupes = spark.createDataFrame([
    (1, "A"),
    (1, "A"),  # Duplicate in left
    (2, "B")
], ["id", "value"])

right = spark.createDataFrame([(1,)], ["id"])

result = left_with_dupes.join(right, "id", "left_semi")
result.show()
# +---+-----+
# | id|value|
# +---+-----+
# |  1|    A|
# |  1|    A|  <- Both duplicates remain!
# +---+-----+

If you need true deduplication, combine with dropDuplicates() explicitly.

Left semi joins are a powerful tool in PySpark for existence-based filtering. Master them to write cleaner, more performant data transformations that clearly express your intent while leveraging Spark’s optimization capabilities.