How to Left Join in Pandas

A left join returns all rows from the left DataFrame and the matched rows from the right DataFrame. When there's no match, the result contains `NaN` values for columns from the right DataFrame.

Key Insights

  • Use pd.merge(left_df, right_df, how='left', on='key') for most left join operations—it’s the most flexible and readable approach
  • The join() method is faster for index-based joins but requires more setup; use it when your DataFrames are already indexed appropriately
  • Always specify suffixes parameter when joining DataFrames with overlapping column names to avoid confusing _x and _y defaults

What Is a Left Join?

A left join returns all rows from the left DataFrame and the matched rows from the right DataFrame. When there’s no match, the result contains NaN values for columns from the right DataFrame.

If you’re coming from SQL, this is exactly like SELECT * FROM left_table LEFT JOIN right_table ON left_table.key = right_table.key. The “left” table keeps all its rows, regardless of whether matches exist in the “right” table.

You’ll use left joins constantly in data analysis: enriching transaction data with customer details, adding product information to order records, or combining survey responses with demographic data. The pattern is always the same—you have a primary dataset you want to preserve completely while adding supplementary information where available.

Understanding Left Joins Visually

Consider two DataFrames: one with orders, one with customer information. A left join on customer_id keeps every order while pulling in customer details where they exist.

Left DataFrame (Orders)     Right DataFrame (Customers)
+----------+--------+       +-------------+----------+
| order_id | cust_id|       | customer_id | name     |
+----------+--------+       +-------------+----------+
| 1        | 101    |       | 101         | Alice    |
| 2        | 102    |       | 103         | Charlie  |
| 3        | 101    |       +-------------+----------+
| 4        | 104    |
+----------+--------+

Result (Left Join on customer_id):
+----------+---------+----------+
| order_id | cust_id | name     |
+----------+---------+----------+
| 1        | 101     | Alice    |
| 2        | 102     | NaN      |  <- No match for 102
| 3        | 101     | Alice    |
| 4        | 104     | NaN      |  <- No match for 104
+----------+---------+----------+

Let’s create these sample DataFrames to use throughout the article:

import pandas as pd

# Orders DataFrame (left)
orders = pd.DataFrame({
    'order_id': [1, 2, 3, 4],
    'customer_id': [101, 102, 101, 104],
    'amount': [250.00, 150.00, 300.00, 450.00]
})

# Customers DataFrame (right)
customers = pd.DataFrame({
    'customer_id': [101, 103, 105],
    'name': ['Alice', 'Charlie', 'Eve'],
    'city': ['New York', 'Boston', 'Seattle']
})

print("Orders:")
print(orders)
print("\nCustomers:")
print(customers)

Output:

Orders:
   order_id  customer_id  amount
0         1          101   250.0
1         2          102   150.0
2         3          101   300.0
3         4          104   450.0

Customers:
   customer_id     name      city
0          101    Alice  New York
1          103  Charlie    Boston
2          105      Eve   Seattle

Basic Left Join with merge()

The pd.merge() function is the workhorse for joining DataFrames. It’s explicit, readable, and handles most joining scenarios cleanly.

result = pd.merge(orders, customers, how='left', on='customer_id')
print(result)

Output:

   order_id  customer_id  amount   name      city
0         1          101   250.0  Alice  New York
1         2          102   150.0    NaN       NaN
2         3          101   300.0  Alice  New York
3         4          104   450.0    NaN       NaN

All four orders remain. Customer 101 (Alice) appears twice because she has two orders. Customers 102 and 104 don’t exist in our customers DataFrame, so their name and city values are NaN.

Key Parameters

The on parameter works when both DataFrames share the same column name for the join key. When column names differ, use left_on and right_on:

# Rename column in customers to simulate different naming
customers_renamed = customers.rename(columns={'customer_id': 'cust_id'})

result = pd.merge(
    orders, 
    customers_renamed, 
    how='left', 
    left_on='customer_id', 
    right_on='cust_id'
)
print(result)

Output:

   order_id  customer_id  amount  cust_id   name      city
0         1          101   250.0    101.0  Alice  New York
1         2          102   150.0      NaN    NaN       NaN
2         3          101   300.0    101.0  Alice  New York
3         4          104   450.0      NaN    NaN       NaN

Notice you end up with both customer_id and cust_id columns. You’ll typically drop the redundant one afterward:

result = result.drop(columns=['cust_id'])

Left Join with join() Method

The join() method is an alternative that works directly on DataFrame objects. It joins on indexes by default, which makes it faster but requires your data to be indexed appropriately.

# Set customer_id as index for both DataFrames
orders_indexed = orders.set_index('customer_id')
customers_indexed = customers.set_index('customer_id')

result = orders_indexed.join(customers_indexed, how='left')
print(result)

Output:

             order_id  amount   name      city
customer_id                                   
101                 1   250.0  Alice  New York
102                 2   150.0    NaN       NaN
101                 3   300.0  Alice  New York
104                 4   450.0    NaN       NaN

If you want to join on a column rather than the index, use the on parameter, but the right DataFrame must still be indexed:

# Only right DataFrame needs index
customers_indexed = customers.set_index('customer_id')

result = orders.join(customers_indexed, on='customer_id', how='left')
print(result)

This produces the same result but keeps customer_id as a regular column rather than the index.

When to Use join() vs merge()

Use join() when:

  • Your DataFrames are already indexed on the join keys
  • You’re doing many joins on the same key (set index once, join multiple times)
  • Performance matters and you can preprocess your data

Use merge() when:

  • You’re doing ad-hoc analysis
  • Join keys are columns, not indexes
  • You want explicit, self-documenting code

Handling Common Scenarios

Joining on Multiple Columns

Sometimes a single column isn’t enough to uniquely identify matches. Pass a list to the on parameter:

# Orders with date
orders_dated = pd.DataFrame({
    'customer_id': [101, 101, 102],
    'order_date': ['2024-01-15', '2024-01-20', '2024-01-15'],
    'amount': [250.00, 300.00, 150.00]
})

# Promotions by customer and date
promotions = pd.DataFrame({
    'customer_id': [101, 101, 102],
    'order_date': ['2024-01-15', '2024-01-25', '2024-01-15'],
    'discount': [0.10, 0.15, 0.05]
})

result = pd.merge(
    orders_dated, 
    promotions, 
    how='left', 
    on=['customer_id', 'order_date']
)
print(result)

Output:

   customer_id  order_date  amount  discount
0          101  2024-01-15   250.0      0.10
1          101  2024-01-20   300.0       NaN
2          102  2024-01-15   150.0      0.05

Dealing with Duplicate Column Names

When both DataFrames have columns with the same name (other than the join key), pandas adds suffixes to distinguish them. The default _x and _y are cryptic—always specify meaningful suffixes:

# Both DataFrames have a 'status' column
orders_status = pd.DataFrame({
    'order_id': [1, 2, 3],
    'customer_id': [101, 102, 101],
    'status': ['shipped', 'pending', 'delivered']
})

customers_status = pd.DataFrame({
    'customer_id': [101, 102],
    'name': ['Alice', 'Bob'],
    'status': ['active', 'inactive']
})

result = pd.merge(
    orders_status, 
    customers_status, 
    how='left', 
    on='customer_id',
    suffixes=('_order', '_customer')
)
print(result)

Output:

   order_id  customer_id status_order   name status_customer
0         1          101      shipped  Alice          active
1         2          102      pending    Bob        inactive
2         3          101    delivered  Alice          active

Handling NaN Values in Results

After a left join, you’ll often have NaN values where no match existed. Handle them appropriately:

result = pd.merge(orders, customers, how='left', on='customer_id')

# Fill NaN with default values
result['name'] = result['name'].fillna('Unknown')
result['city'] = result['city'].fillna('Not Specified')

# Or check for unmatched rows
unmatched = result[result['name'].isna()]
print(f"Orders without customer data: {len(unmatched)}")

Performance Tips

For DataFrames under 100,000 rows, performance rarely matters. For larger datasets, these optimizations help:

Set indexes before joining: If you’re joining repeatedly on the same key, set it as the index once and use join():

customers_indexed = customers.set_index('customer_id')
# Now multiple joins are faster
result1 = orders1.join(customers_indexed, on='customer_id', how='left')
result2 = orders2.join(customers_indexed, on='customer_id', how='left')

Use appropriate data types: Joining on integers is faster than strings. Convert string IDs to categorical or integer types when possible:

orders['customer_id'] = orders['customer_id'].astype('int32')
customers['customer_id'] = customers['customer_id'].astype('int32')

Sort before joining: For very large DataFrames, sorting both on the join key can improve performance:

orders_sorted = orders.sort_values('customer_id')
customers_sorted = customers.sort_values('customer_id')
result = pd.merge(orders_sorted, customers_sorted, how='left', on='customer_id')

Quick Reference

Scenario merge() Syntax join() Syntax
Basic left join pd.merge(left, right, how='left', on='key') left.set_index('key').join(right.set_index('key'), how='left')
Different column names pd.merge(left, right, how='left', left_on='a', right_on='b') Not directly supported
Multiple keys pd.merge(left, right, how='left', on=['key1', 'key2']) left.join(right, on=['key1', 'key2'], how='left') (right must be indexed)
Custom suffixes pd.merge(..., suffixes=('_left', '_right')) left.join(..., lsuffix='_left', rsuffix='_right')

For most use cases, stick with pd.merge(). It’s clearer, more flexible, and the performance difference is negligible for typical datasets. Reserve join() for index-heavy workflows where you’ve already structured your data around specific keys.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.