How to Outer Join in Pandas
An outer join combines two DataFrames while preserving all records from both sides, regardless of whether a matching key exists. When a row from one DataFrame has no corresponding match in the other,...
Key Insights
- Outer joins preserve all records from both DataFrames, filling missing values with NaN—use them when you can’t afford to lose data from either source.
- The
merge()function withhow='outer'is the most flexible approach, whilejoin()works best for index-based operations. - Always use
indicator=Trueduring development to validate your join results and catch unexpected behavior before it becomes a production bug.
Introduction to Outer Joins
An outer join combines two DataFrames while preserving all records from both sides, regardless of whether a matching key exists. When a row from one DataFrame has no corresponding match in the other, pandas fills the missing columns with NaN values.
This differs fundamentally from an inner join, which only keeps rows where keys match in both DataFrames. Inner joins are exclusive—they throw away data. Outer joins are inclusive—they keep everything.
Use outer joins when:
- You need a complete picture of both datasets
- Missing relationships are meaningful information (e.g., customers without orders)
- You’re building reports that must account for all entities
- Data completeness matters more than clean, matched records
The trade-off is obvious: outer joins produce larger result sets with more null values to handle. But when you need comprehensive data, that’s a price worth paying.
Setting Up Sample DataFrames
Let’s create two DataFrames that simulate a real scenario: customers and their orders. Some customers have placed orders, some haven’t, and some orders reference customers not in our customer table (perhaps deleted or from a legacy system).
import pandas as pd
import numpy as np
# Customer data - our "left" DataFrame
customers = pd.DataFrame({
'customer_id': [1, 2, 3, 4, 5],
'name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
'signup_date': ['2023-01-15', '2023-02-20', '2023-03-10', '2023-04-05', '2023-05-12']
})
# Order data - our "right" DataFrame
orders = pd.DataFrame({
'order_id': [101, 102, 103, 104, 105],
'customer_id': [1, 2, 2, 6, 7], # Note: customers 6 and 7 don't exist in customers table
'amount': [150.00, 200.50, 75.25, 300.00, 125.75],
'order_date': ['2023-06-01', '2023-06-05', '2023-06-10', '2023-06-15', '2023-06-20']
})
print("Customers:")
print(customers)
print("\nOrders:")
print(orders)
Notice the intentional mismatch: customers 3, 4, and 5 have no orders, while orders exist for customer IDs 6 and 7 who aren’t in our customer table. This setup lets us see exactly how outer joins handle both scenarios.
Full Outer Join with merge()
The pd.merge() function is your primary tool for outer joins. It’s explicit, flexible, and handles most joining scenarios you’ll encounter.
# Full outer join using merge()
full_result = pd.merge(
customers,
orders,
how='outer',
on='customer_id'
)
print(full_result)
Output:
customer_id name signup_date order_id amount order_date
0 1 Alice 2023-01-15 101.0 150.00 2023-06-01
1 2 Bob 2023-02-20 102.0 200.50 2023-06-05
2 2 Bob 2023-02-20 103.0 75.25 2023-06-10
3 3 Charlie 2023-03-10 NaN NaN NaN
4 4 Diana 2023-04-05 NaN NaN NaN
5 5 Eve 2023-05-12 NaN NaN NaN
6 6 NaN NaN 104.0 300.00 2023-06-15
7 7 NaN NaN 105.0 125.75 2023-06-20
Every customer appears, every order appears. The NaN values tell a story: Charlie, Diana, and Eve haven’t ordered anything, while orders 104 and 105 belong to unknown customers.
When your key columns have different names, use left_on and right_on:
# Rename the column in orders to simulate different naming
orders_renamed = orders.rename(columns={'customer_id': 'cust_id'})
result = pd.merge(
customers,
orders_renamed,
how='outer',
left_on='customer_id',
right_on='cust_id'
)
print(result)
This keeps both key columns in the result. If you want to drop the redundant one:
result = result.drop(columns=['cust_id'])
Outer Join with join()
The join() method is more concise but requires index-based joining. It’s ideal when your DataFrames already use meaningful indexes or when you’re working with time series data.
# Prepare DataFrames with customer_id as index
customers_indexed = customers.set_index('customer_id')
orders_indexed = orders.set_index('customer_id')
# Outer join using join()
join_result = customers_indexed.join(orders_indexed, how='outer', lsuffix='_cust', rsuffix='_ord')
print(join_result)
The lsuffix and rsuffix parameters handle column name collisions. Without them, pandas raises an error if both DataFrames share column names (other than the join key).
One key difference: join() defaults to how='left', while merge() defaults to how='inner'. Always specify how='outer' explicitly to avoid confusion.
# Reset index if you need customer_id as a column again
join_result_reset = join_result.reset_index()
Use join() when you’re already working with indexed DataFrames. Use merge() for everything else—it’s more explicit and easier to reason about.
Handling NaN Values After Outer Joins
Outer joins guarantee NaN values. You need a strategy for handling them.
Fill with defaults:
# Fill numeric columns with 0, string columns with 'Unknown'
result_filled = full_result.copy()
result_filled['amount'] = result_filled['amount'].fillna(0)
result_filled['order_id'] = result_filled['order_id'].fillna(0).astype(int)
result_filled['name'] = result_filled['name'].fillna('Unknown Customer')
print(result_filled)
Fill with column-specific defaults using a dictionary:
fill_values = {
'name': 'Unknown',
'signup_date': 'N/A',
'order_id': 0,
'amount': 0.0,
'order_date': 'N/A'
}
result_clean = full_result.fillna(fill_values)
print(result_clean)
Type considerations matter. NaN is a float value, so integer columns become floats after an outer join introduces nulls. Pandas 1.0+ offers nullable integer types:
# Convert to nullable integer type
result_filled['order_id'] = result_filled['order_id'].astype('Int64') # Capital I for nullable
Selective dropping:
# Drop rows where ALL join columns are NaN (completely unmatched)
result_filtered = full_result.dropna(subset=['name', 'order_id'], how='all')
Multi-Key Outer Joins
Real-world joins often require matching on multiple columns. Pass a list to the on parameter:
# Extended example with region-based data
customers_regional = pd.DataFrame({
'customer_id': [1, 2, 3, 4],
'region': ['East', 'East', 'West', 'West'],
'name': ['Alice', 'Bob', 'Charlie', 'Diana']
})
sales_regional = pd.DataFrame({
'customer_id': [1, 2, 5, 6],
'region': ['East', 'West', 'East', 'West'], # Note: Bob's region doesn't match
'revenue': [1000, 1500, 2000, 2500]
})
# Join on both customer_id AND region
multi_key_result = pd.merge(
customers_regional,
sales_regional,
how='outer',
on=['customer_id', 'region']
)
print(multi_key_result)
Output:
customer_id region name revenue
0 1 East Alice 1000.0
1 2 East Bob NaN
2 3 West Charlie NaN
3 4 West Diana NaN
4 2 West NaN 1500.0
5 5 East NaN 2000.0
6 6 West NaN 2500.0
Bob appears twice: once from customers (East region, no matching sale) and once from sales (West region, no matching customer). Multi-key joins require all keys to match, not just some.
Common Pitfalls and Best Practices
Pitfall 1: Duplicate column suffixes cluttering your DataFrame
When both DataFrames have columns with the same name (besides the join key), pandas appends _x and _y suffixes by default. Control this:
result = pd.merge(
customers,
orders,
how='outer',
on='customer_id',
suffixes=('_customer', '_order')
)
Pitfall 2: Memory explosion with large DataFrames
Outer joins can produce massive results, especially with many-to-many relationships. Check cardinality before joining:
# Sanity check before joining
print(f"Left rows: {len(customers)}")
print(f"Right rows: {len(orders)}")
print(f"Unique keys left: {customers['customer_id'].nunique()}")
print(f"Unique keys right: {orders['customer_id'].nunique()}")
Pitfall 3: Not validating join results
Use indicator=True to understand what happened during the join:
debug_result = pd.merge(
customers,
orders,
how='outer',
on='customer_id',
indicator=True
)
print(debug_result)
print("\nJoin breakdown:")
print(debug_result['_merge'].value_counts())
Output:
Join breakdown:
both 3
left_only 3
right_only 2
Name: _merge, dtype: int64
This tells you exactly how many rows matched (both), how many came only from the left DataFrame (left_only), and how many came only from the right (right_only). Use this during development to catch unexpected join behavior.
Best practice: Validate before removing the indicator
# Assert expected behavior in tests or data pipelines
assert debug_result['_merge'].value_counts()['both'] > 0, "No matching records found!"
# Then drop the indicator column for production
final_result = debug_result.drop(columns=['_merge'])
Outer joins are powerful but require careful handling. Always inspect your results, handle NaN values deliberately, and use the indicator parameter until you trust your data completely.