How to Outer Join in PySpark
Every data engineer eventually hits the same problem: you need to combine two datasets, but they don't perfectly align. Maybe you're merging customer records with transactions, and some customers...
Key Insights
- Outer joins in PySpark preserve records that don’t have matching keys, filling missing values with nulls—use
left,right, orfull/outerin thehowparameter to control which DataFrame’s unmatched records survive. - Always handle duplicate column names before joining by aliasing or renaming columns, otherwise you’ll end up with ambiguous references that break downstream operations.
- For skewed or large datasets, use broadcast hints on the smaller DataFrame and consider repartitioning on join keys to avoid shuffle-heavy operations that kill performance.
Introduction to Outer Joins in PySpark
Every data engineer eventually hits the same problem: you need to combine two datasets, but they don’t perfectly align. Maybe you’re merging customer records with transactions, and some customers haven’t made purchases yet. Or you’re joining product catalogs where items exist in one system but not another.
Inner joins won’t help here—they drop any record without a match on both sides. That’s where outer joins come in. They let you preserve records from one or both DataFrames even when the join key doesn’t exist in the other dataset.
PySpark supports three outer join types: left outer, right outer, and full outer. Choosing the wrong one means either losing data you need or keeping data you don’t. This article walks through each type with practical examples so you can pick the right join for your use case.
Understanding Outer Join Types
Before diving into code, let’s clarify what each outer join preserves:
| Join Type | Left DataFrame Records | Right DataFrame Records | Unmatched Handling |
|---|---|---|---|
| Left Outer | All | Only matching | Right columns → null |
| Right Outer | Only matching | All | Left columns → null |
| Full Outer | All | All | Both sides → null where unmatched |
The key insight: outer joins are about which side’s orphan records you want to keep. Left keeps left orphans, right keeps right orphans, full keeps everyone.
Let’s create two sample DataFrames to work with throughout this article:
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
spark = SparkSession.builder.appName("OuterJoinDemo").getOrCreate()
# Customers DataFrame
customers_data = [
(1, "Alice", "New York"),
(2, "Bob", "Los Angeles"),
(3, "Charlie", "Chicago"),
(4, "Diana", "Houston")
]
customers = spark.createDataFrame(
customers_data,
["customer_id", "name", "city"]
)
# Orders DataFrame
orders_data = [
(101, 1, 250.00),
(102, 2, 175.50),
(103, 2, 89.99),
(104, 5, 320.00), # customer_id 5 doesn't exist in customers
(105, 6, 450.00) # customer_id 6 doesn't exist in customers
]
orders = spark.createDataFrame(
orders_data,
["order_id", "customer_id", "amount"]
)
customers.show()
orders.show()
Output:
+-----------+-------+-----------+
|customer_id| name| city|
+-----------+-------+-----------+
| 1| Alice| New York|
| 2| Bob|Los Angeles|
| 3|Charlie| Chicago|
| 4| Diana| Houston|
+-----------+-------+-----------+
+--------+-----------+------+
|order_id|customer_id|amount|
+--------+-----------+------+
| 101| 1| 250.0|
| 102| 2| 175.5|
| 103| 2| 89.99|
| 104| 5| 320.0|
| 105| 6| 450.0|
+--------+-----------+------+
Notice the mismatch: customers 3 and 4 have no orders, while orders 104 and 105 reference customers that don’t exist. This is the exact scenario where outer joins matter.
Left Outer Join
Use a left outer join when you want all records from the left DataFrame, regardless of whether they have matches in the right DataFrame. This is the most common outer join in practice.
Typical use case: You have a master table (customers) and want to enrich it with optional data (orders). Customers without orders should still appear in the result.
left_joined = customers.join(
orders,
on="customer_id",
how="left"
)
left_joined.show()
Output:
+-----------+-------+-----------+--------+------+
|customer_id| name| city|order_id|amount|
+-----------+-------+-----------+--------+------+
| 1| Alice| New York| 101| 250.0|
| 2| Bob|Los Angeles| 102| 175.5|
| 2| Bob|Los Angeles| 103| 89.99|
| 3|Charlie| Chicago| NULL| NULL|
| 4| Diana| Houston| NULL| NULL|
+-----------+-------+-----------+--------+------+
Charlie and Diana appear with null values for order columns because they have no orders. Meanwhile, orders 104 and 105 (with non-existent customer IDs) are dropped entirely.
You can also use how="left_outer" or how="leftouter"—they’re all equivalent.
Right Outer Join
Right outer joins flip the logic: keep all records from the right DataFrame, filling nulls for unmatched left records.
Typical use case: You’re working with a fact table (orders) and want to identify orphan records—transactions that reference entities not in your dimension table.
right_joined = customers.join(
orders,
on="customer_id",
how="right"
)
right_joined.show()
Output:
+-----------+-----+-----------+--------+------+
|customer_id| name| city|order_id|amount|
+-----------+-----+-----------+--------+------+
| 1|Alice| New York| 101| 250.0|
| 2| Bob|Los Angeles| 102| 175.5|
| 2| Bob|Los Angeles| 103| 89.99|
| 5| NULL| NULL| 104| 320.0|
| 6| NULL| NULL| 105| 450.0|
+-----------+-----+-----------+--------+------+
Now orders 104 and 105 appear, but with null customer details. Charlie and Diana are gone because they have no orders.
In practice, I rarely use right outer joins. You can always swap the DataFrame order and use a left join instead, which reads more naturally. But right joins exist for cases where DataFrame order is fixed by upstream logic.
Full Outer Join
Full outer joins keep everything from both sides. Any record without a match gets null-filled columns from the other DataFrame.
Typical use case: Data reconciliation. You’re comparing two systems and need to identify records that exist in one but not the other.
full_joined = customers.join(
orders,
on="customer_id",
how="full" # "outer" and "fullouter" also work
)
full_joined.show()
Output:
+-----------+-------+-----------+--------+------+
|customer_id| name| city|order_id|amount|
+-----------+-------+-----------+--------+------+
| 1| Alice| New York| 101| 250.0|
| 2| Bob|Los Angeles| 102| 175.5|
| 2| Bob|Los Angeles| 103| 89.99|
| 3|Charlie| Chicago| NULL| NULL|
| 4| Diana| Houston| NULL| NULL|
| 5| NULL| NULL| 104| 320.0|
| 6| NULL| NULL| 105| 450.0|
+-----------+-------+-----------+--------+------+
Every record from both DataFrames appears. This is useful for auditing but creates larger result sets—use it deliberately.
Handling Null Values and Common Pitfalls
Outer joins introduce nulls by design. Here’s how to handle them and avoid common mistakes.
Filling Nulls with coalesce()
Use coalesce() to replace nulls with default values:
from pyspark.sql.functions import coalesce, lit
result = full_joined.withColumn(
"amount_clean",
coalesce("amount", lit(0.0))
).withColumn(
"name_clean",
coalesce("name", lit("Unknown"))
)
result.select("customer_id", "name_clean", "amount_clean").show()
Avoiding Duplicate Column Names
When join keys have different names or you’re joining on expressions, you can end up with duplicate columns:
# This creates TWO customer_id columns
orders_renamed = orders.withColumnRenamed("customer_id", "cust_id")
bad_join = customers.join(
orders_renamed,
customers.customer_id == orders_renamed.cust_id,
how="left"
)
# Now you have both customer_id and cust_id columns
Fix this by dropping the redundant column:
clean_join = customers.join(
orders_renamed,
customers.customer_id == orders_renamed.cust_id,
how="left"
).drop("cust_id")
Or use aliases when column names collide:
from pyspark.sql.functions import col
c = customers.alias("c")
o = orders.alias("o")
aliased_join = c.join(o, col("c.customer_id") == col("o.customer_id"), "left")
aliased_join.select("c.customer_id", "c.name", "o.order_id").show()
Performance Optimization
Outer joins on large datasets can be expensive. Two techniques help:
Broadcast small DataFrames: If one DataFrame fits in memory (under a few hundred MB), broadcast it to avoid shuffles:
from pyspark.sql.functions import broadcast
optimized = customers.join(
broadcast(orders),
on="customer_id",
how="left"
)
Repartition on join keys: For large-to-large joins, repartitioning both DataFrames on the join key reduces shuffle overhead:
customers_repartitioned = customers.repartition(200, "customer_id")
orders_repartitioned = orders.repartition(200, "customer_id")
efficient_join = customers_repartitioned.join(
orders_repartitioned,
on="customer_id",
how="full"
)
Conclusion
Outer joins are essential when your data doesn’t perfectly align—which is most of the time in real-world data engineering.
Use left outer when enriching a master dataset with optional attributes. Use right outer when you need to preserve the right side (though swapping order with a left join is usually clearer). Use full outer for reconciliation and auditing where you need visibility into both sides’ orphan records.
Always plan for nulls in your downstream logic, alias or rename columns to avoid ambiguity, and apply broadcast hints when one side is small enough to fit in executor memory. Get these fundamentals right, and your PySpark joins will be both correct and performant.