How to Outer Join in Pandas

An outer join combines two DataFrames while preserving all records from both sides, regardless of whether a matching key exists. When a row from one DataFrame has no corresponding match in the other,...

Key Insights

  • Outer joins preserve all records from both DataFrames, filling missing values with NaN—use them when you can’t afford to lose data from either source.
  • The merge() function with how='outer' is the most flexible approach, while join() works best for index-based operations.
  • Always use indicator=True during development to validate your join results and catch unexpected behavior before it becomes a production bug.

Introduction to Outer Joins

An outer join combines two DataFrames while preserving all records from both sides, regardless of whether a matching key exists. When a row from one DataFrame has no corresponding match in the other, pandas fills the missing columns with NaN values.

This differs fundamentally from an inner join, which only keeps rows where keys match in both DataFrames. Inner joins are exclusive—they throw away data. Outer joins are inclusive—they keep everything.

Use outer joins when:

  • You need a complete picture of both datasets
  • Missing relationships are meaningful information (e.g., customers without orders)
  • You’re building reports that must account for all entities
  • Data completeness matters more than clean, matched records

The trade-off is obvious: outer joins produce larger result sets with more null values to handle. But when you need comprehensive data, that’s a price worth paying.

Setting Up Sample DataFrames

Let’s create two DataFrames that simulate a real scenario: customers and their orders. Some customers have placed orders, some haven’t, and some orders reference customers not in our customer table (perhaps deleted or from a legacy system).

import pandas as pd
import numpy as np

# Customer data - our "left" DataFrame
customers = pd.DataFrame({
    'customer_id': [1, 2, 3, 4, 5],
    'name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
    'signup_date': ['2023-01-15', '2023-02-20', '2023-03-10', '2023-04-05', '2023-05-12']
})

# Order data - our "right" DataFrame
orders = pd.DataFrame({
    'order_id': [101, 102, 103, 104, 105],
    'customer_id': [1, 2, 2, 6, 7],  # Note: customers 6 and 7 don't exist in customers table
    'amount': [150.00, 200.50, 75.25, 300.00, 125.75],
    'order_date': ['2023-06-01', '2023-06-05', '2023-06-10', '2023-06-15', '2023-06-20']
})

print("Customers:")
print(customers)
print("\nOrders:")
print(orders)

Notice the intentional mismatch: customers 3, 4, and 5 have no orders, while orders exist for customer IDs 6 and 7 who aren’t in our customer table. This setup lets us see exactly how outer joins handle both scenarios.

Full Outer Join with merge()

The pd.merge() function is your primary tool for outer joins. It’s explicit, flexible, and handles most joining scenarios you’ll encounter.

# Full outer join using merge()
full_result = pd.merge(
    customers, 
    orders, 
    how='outer', 
    on='customer_id'
)

print(full_result)

Output:

   customer_id     name signup_date  order_id  amount  order_date
0            1    Alice  2023-01-15     101.0  150.00  2023-06-01
1            2      Bob  2023-02-20     102.0  200.50  2023-06-05
2            2      Bob  2023-02-20     103.0   75.25  2023-06-10
3            3  Charlie  2023-03-10       NaN     NaN         NaN
4            4    Diana  2023-04-05       NaN     NaN         NaN
5            5      Eve  2023-05-12       NaN     NaN         NaN
6            6      NaN         NaN     104.0  300.00  2023-06-15
7            7      NaN         NaN     105.0  125.75  2023-06-20

Every customer appears, every order appears. The NaN values tell a story: Charlie, Diana, and Eve haven’t ordered anything, while orders 104 and 105 belong to unknown customers.

When your key columns have different names, use left_on and right_on:

# Rename the column in orders to simulate different naming
orders_renamed = orders.rename(columns={'customer_id': 'cust_id'})

result = pd.merge(
    customers,
    orders_renamed,
    how='outer',
    left_on='customer_id',
    right_on='cust_id'
)

print(result)

This keeps both key columns in the result. If you want to drop the redundant one:

result = result.drop(columns=['cust_id'])

Outer Join with join()

The join() method is more concise but requires index-based joining. It’s ideal when your DataFrames already use meaningful indexes or when you’re working with time series data.

# Prepare DataFrames with customer_id as index
customers_indexed = customers.set_index('customer_id')
orders_indexed = orders.set_index('customer_id')

# Outer join using join()
join_result = customers_indexed.join(orders_indexed, how='outer', lsuffix='_cust', rsuffix='_ord')

print(join_result)

The lsuffix and rsuffix parameters handle column name collisions. Without them, pandas raises an error if both DataFrames share column names (other than the join key).

One key difference: join() defaults to how='left', while merge() defaults to how='inner'. Always specify how='outer' explicitly to avoid confusion.

# Reset index if you need customer_id as a column again
join_result_reset = join_result.reset_index()

Use join() when you’re already working with indexed DataFrames. Use merge() for everything else—it’s more explicit and easier to reason about.

Handling NaN Values After Outer Joins

Outer joins guarantee NaN values. You need a strategy for handling them.

Fill with defaults:

# Fill numeric columns with 0, string columns with 'Unknown'
result_filled = full_result.copy()
result_filled['amount'] = result_filled['amount'].fillna(0)
result_filled['order_id'] = result_filled['order_id'].fillna(0).astype(int)
result_filled['name'] = result_filled['name'].fillna('Unknown Customer')

print(result_filled)

Fill with column-specific defaults using a dictionary:

fill_values = {
    'name': 'Unknown',
    'signup_date': 'N/A',
    'order_id': 0,
    'amount': 0.0,
    'order_date': 'N/A'
}

result_clean = full_result.fillna(fill_values)
print(result_clean)

Type considerations matter. NaN is a float value, so integer columns become floats after an outer join introduces nulls. Pandas 1.0+ offers nullable integer types:

# Convert to nullable integer type
result_filled['order_id'] = result_filled['order_id'].astype('Int64')  # Capital I for nullable

Selective dropping:

# Drop rows where ALL join columns are NaN (completely unmatched)
result_filtered = full_result.dropna(subset=['name', 'order_id'], how='all')

Multi-Key Outer Joins

Real-world joins often require matching on multiple columns. Pass a list to the on parameter:

# Extended example with region-based data
customers_regional = pd.DataFrame({
    'customer_id': [1, 2, 3, 4],
    'region': ['East', 'East', 'West', 'West'],
    'name': ['Alice', 'Bob', 'Charlie', 'Diana']
})

sales_regional = pd.DataFrame({
    'customer_id': [1, 2, 5, 6],
    'region': ['East', 'West', 'East', 'West'],  # Note: Bob's region doesn't match
    'revenue': [1000, 1500, 2000, 2500]
})

# Join on both customer_id AND region
multi_key_result = pd.merge(
    customers_regional,
    sales_regional,
    how='outer',
    on=['customer_id', 'region']
)

print(multi_key_result)

Output:

   customer_id region     name  revenue
0            1   East    Alice   1000.0
1            2   East      Bob      NaN
2            3   West  Charlie      NaN
3            4   West    Diana      NaN
4            2   West      NaN   1500.0
5            5   East      NaN   2000.0
6            6   West      NaN   2500.0

Bob appears twice: once from customers (East region, no matching sale) and once from sales (West region, no matching customer). Multi-key joins require all keys to match, not just some.

Common Pitfalls and Best Practices

Pitfall 1: Duplicate column suffixes cluttering your DataFrame

When both DataFrames have columns with the same name (besides the join key), pandas appends _x and _y suffixes by default. Control this:

result = pd.merge(
    customers,
    orders,
    how='outer',
    on='customer_id',
    suffixes=('_customer', '_order')
)

Pitfall 2: Memory explosion with large DataFrames

Outer joins can produce massive results, especially with many-to-many relationships. Check cardinality before joining:

# Sanity check before joining
print(f"Left rows: {len(customers)}")
print(f"Right rows: {len(orders)}")
print(f"Unique keys left: {customers['customer_id'].nunique()}")
print(f"Unique keys right: {orders['customer_id'].nunique()}")

Pitfall 3: Not validating join results

Use indicator=True to understand what happened during the join:

debug_result = pd.merge(
    customers,
    orders,
    how='outer',
    on='customer_id',
    indicator=True
)

print(debug_result)
print("\nJoin breakdown:")
print(debug_result['_merge'].value_counts())

Output:

Join breakdown:
both          3
left_only     3
right_only    2
Name: _merge, dtype: int64

This tells you exactly how many rows matched (both), how many came only from the left DataFrame (left_only), and how many came only from the right (right_only). Use this during development to catch unexpected join behavior.

Best practice: Validate before removing the indicator

# Assert expected behavior in tests or data pipelines
assert debug_result['_merge'].value_counts()['both'] > 0, "No matching records found!"

# Then drop the indicator column for production
final_result = debug_result.drop(columns=['_merge'])

Outer joins are powerful but require careful handling. Always inspect your results, handle NaN values deliberately, and use the indicator parameter until you trust your data completely.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.