How to Left Join in Pandas
A left join returns all rows from the left DataFrame and the matched rows from the right DataFrame. When there's no match, the result contains `NaN` values for columns from the right DataFrame.
Key Insights
- Use
pd.merge(left_df, right_df, how='left', on='key')for most left join operations—it’s the most flexible and readable approach - The
join()method is faster for index-based joins but requires more setup; use it when your DataFrames are already indexed appropriately - Always specify
suffixesparameter when joining DataFrames with overlapping column names to avoid confusing_xand_ydefaults
What Is a Left Join?
A left join returns all rows from the left DataFrame and the matched rows from the right DataFrame. When there’s no match, the result contains NaN values for columns from the right DataFrame.
If you’re coming from SQL, this is exactly like SELECT * FROM left_table LEFT JOIN right_table ON left_table.key = right_table.key. The “left” table keeps all its rows, regardless of whether matches exist in the “right” table.
You’ll use left joins constantly in data analysis: enriching transaction data with customer details, adding product information to order records, or combining survey responses with demographic data. The pattern is always the same—you have a primary dataset you want to preserve completely while adding supplementary information where available.
Understanding Left Joins Visually
Consider two DataFrames: one with orders, one with customer information. A left join on customer_id keeps every order while pulling in customer details where they exist.
Left DataFrame (Orders) Right DataFrame (Customers)
+----------+--------+ +-------------+----------+
| order_id | cust_id| | customer_id | name |
+----------+--------+ +-------------+----------+
| 1 | 101 | | 101 | Alice |
| 2 | 102 | | 103 | Charlie |
| 3 | 101 | +-------------+----------+
| 4 | 104 |
+----------+--------+
Result (Left Join on customer_id):
+----------+---------+----------+
| order_id | cust_id | name |
+----------+---------+----------+
| 1 | 101 | Alice |
| 2 | 102 | NaN | <- No match for 102
| 3 | 101 | Alice |
| 4 | 104 | NaN | <- No match for 104
+----------+---------+----------+
Let’s create these sample DataFrames to use throughout the article:
import pandas as pd
# Orders DataFrame (left)
orders = pd.DataFrame({
'order_id': [1, 2, 3, 4],
'customer_id': [101, 102, 101, 104],
'amount': [250.00, 150.00, 300.00, 450.00]
})
# Customers DataFrame (right)
customers = pd.DataFrame({
'customer_id': [101, 103, 105],
'name': ['Alice', 'Charlie', 'Eve'],
'city': ['New York', 'Boston', 'Seattle']
})
print("Orders:")
print(orders)
print("\nCustomers:")
print(customers)
Output:
Orders:
order_id customer_id amount
0 1 101 250.0
1 2 102 150.0
2 3 101 300.0
3 4 104 450.0
Customers:
customer_id name city
0 101 Alice New York
1 103 Charlie Boston
2 105 Eve Seattle
Basic Left Join with merge()
The pd.merge() function is the workhorse for joining DataFrames. It’s explicit, readable, and handles most joining scenarios cleanly.
result = pd.merge(orders, customers, how='left', on='customer_id')
print(result)
Output:
order_id customer_id amount name city
0 1 101 250.0 Alice New York
1 2 102 150.0 NaN NaN
2 3 101 300.0 Alice New York
3 4 104 450.0 NaN NaN
All four orders remain. Customer 101 (Alice) appears twice because she has two orders. Customers 102 and 104 don’t exist in our customers DataFrame, so their name and city values are NaN.
Key Parameters
The on parameter works when both DataFrames share the same column name for the join key. When column names differ, use left_on and right_on:
# Rename column in customers to simulate different naming
customers_renamed = customers.rename(columns={'customer_id': 'cust_id'})
result = pd.merge(
orders,
customers_renamed,
how='left',
left_on='customer_id',
right_on='cust_id'
)
print(result)
Output:
order_id customer_id amount cust_id name city
0 1 101 250.0 101.0 Alice New York
1 2 102 150.0 NaN NaN NaN
2 3 101 300.0 101.0 Alice New York
3 4 104 450.0 NaN NaN NaN
Notice you end up with both customer_id and cust_id columns. You’ll typically drop the redundant one afterward:
result = result.drop(columns=['cust_id'])
Left Join with join() Method
The join() method is an alternative that works directly on DataFrame objects. It joins on indexes by default, which makes it faster but requires your data to be indexed appropriately.
# Set customer_id as index for both DataFrames
orders_indexed = orders.set_index('customer_id')
customers_indexed = customers.set_index('customer_id')
result = orders_indexed.join(customers_indexed, how='left')
print(result)
Output:
order_id amount name city
customer_id
101 1 250.0 Alice New York
102 2 150.0 NaN NaN
101 3 300.0 Alice New York
104 4 450.0 NaN NaN
If you want to join on a column rather than the index, use the on parameter, but the right DataFrame must still be indexed:
# Only right DataFrame needs index
customers_indexed = customers.set_index('customer_id')
result = orders.join(customers_indexed, on='customer_id', how='left')
print(result)
This produces the same result but keeps customer_id as a regular column rather than the index.
When to Use join() vs merge()
Use join() when:
- Your DataFrames are already indexed on the join keys
- You’re doing many joins on the same key (set index once, join multiple times)
- Performance matters and you can preprocess your data
Use merge() when:
- You’re doing ad-hoc analysis
- Join keys are columns, not indexes
- You want explicit, self-documenting code
Handling Common Scenarios
Joining on Multiple Columns
Sometimes a single column isn’t enough to uniquely identify matches. Pass a list to the on parameter:
# Orders with date
orders_dated = pd.DataFrame({
'customer_id': [101, 101, 102],
'order_date': ['2024-01-15', '2024-01-20', '2024-01-15'],
'amount': [250.00, 300.00, 150.00]
})
# Promotions by customer and date
promotions = pd.DataFrame({
'customer_id': [101, 101, 102],
'order_date': ['2024-01-15', '2024-01-25', '2024-01-15'],
'discount': [0.10, 0.15, 0.05]
})
result = pd.merge(
orders_dated,
promotions,
how='left',
on=['customer_id', 'order_date']
)
print(result)
Output:
customer_id order_date amount discount
0 101 2024-01-15 250.0 0.10
1 101 2024-01-20 300.0 NaN
2 102 2024-01-15 150.0 0.05
Dealing with Duplicate Column Names
When both DataFrames have columns with the same name (other than the join key), pandas adds suffixes to distinguish them. The default _x and _y are cryptic—always specify meaningful suffixes:
# Both DataFrames have a 'status' column
orders_status = pd.DataFrame({
'order_id': [1, 2, 3],
'customer_id': [101, 102, 101],
'status': ['shipped', 'pending', 'delivered']
})
customers_status = pd.DataFrame({
'customer_id': [101, 102],
'name': ['Alice', 'Bob'],
'status': ['active', 'inactive']
})
result = pd.merge(
orders_status,
customers_status,
how='left',
on='customer_id',
suffixes=('_order', '_customer')
)
print(result)
Output:
order_id customer_id status_order name status_customer
0 1 101 shipped Alice active
1 2 102 pending Bob inactive
2 3 101 delivered Alice active
Handling NaN Values in Results
After a left join, you’ll often have NaN values where no match existed. Handle them appropriately:
result = pd.merge(orders, customers, how='left', on='customer_id')
# Fill NaN with default values
result['name'] = result['name'].fillna('Unknown')
result['city'] = result['city'].fillna('Not Specified')
# Or check for unmatched rows
unmatched = result[result['name'].isna()]
print(f"Orders without customer data: {len(unmatched)}")
Performance Tips
For DataFrames under 100,000 rows, performance rarely matters. For larger datasets, these optimizations help:
Set indexes before joining: If you’re joining repeatedly on the same key, set it as the index once and use join():
customers_indexed = customers.set_index('customer_id')
# Now multiple joins are faster
result1 = orders1.join(customers_indexed, on='customer_id', how='left')
result2 = orders2.join(customers_indexed, on='customer_id', how='left')
Use appropriate data types: Joining on integers is faster than strings. Convert string IDs to categorical or integer types when possible:
orders['customer_id'] = orders['customer_id'].astype('int32')
customers['customer_id'] = customers['customer_id'].astype('int32')
Sort before joining: For very large DataFrames, sorting both on the join key can improve performance:
orders_sorted = orders.sort_values('customer_id')
customers_sorted = customers.sort_values('customer_id')
result = pd.merge(orders_sorted, customers_sorted, how='left', on='customer_id')
Quick Reference
| Scenario | merge() Syntax | join() Syntax |
|---|---|---|
| Basic left join | pd.merge(left, right, how='left', on='key') |
left.set_index('key').join(right.set_index('key'), how='left') |
| Different column names | pd.merge(left, right, how='left', left_on='a', right_on='b') |
Not directly supported |
| Multiple keys | pd.merge(left, right, how='left', on=['key1', 'key2']) |
left.join(right, on=['key1', 'key2'], how='left') (right must be indexed) |
| Custom suffixes | pd.merge(..., suffixes=('_left', '_right')) |
left.join(..., lsuffix='_left', rsuffix='_right') |
For most use cases, stick with pd.merge(). It’s clearer, more flexible, and the performance difference is negligible for typical datasets. Reserve join() for index-heavy workflows where you’ve already structured your data around specific keys.