Pandas - Merge on Multiple Columns

Merging on multiple columns follows the same syntax as single-column merges, but passes a list to the `on` parameter. This creates a composite key where all specified columns must match for rows to...

Key Insights

  • Merging on multiple columns in Pandas requires matching values across all specified columns simultaneously, enabling complex joins based on composite keys like (customer_id, date) or (store, product).
  • The on parameter accepts a list of column names for multi-column merges, while left_on and right_on handle cases where column names differ between DataFrames.
  • Performance degrades with multiple merge columns due to increased comparison operations—use categorical dtypes and ensure proper indexing for datasets exceeding 100K rows.

Basic Multi-Column Merge Syntax

Merging on multiple columns follows the same syntax as single-column merges, but passes a list to the on parameter. This creates a composite key where all specified columns must match for rows to join.

import pandas as pd

# Sales data
sales = pd.DataFrame({
    'store_id': [101, 101, 102, 102, 103],
    'product_id': ['A', 'B', 'A', 'C', 'B'],
    'quantity': [50, 30, 45, 20, 35],
    'date': ['2024-01-15', '2024-01-15', '2024-01-15', '2024-01-16', '2024-01-16']
})

# Pricing data
prices = pd.DataFrame({
    'store_id': [101, 101, 102, 102, 103],
    'product_id': ['A', 'B', 'A', 'C', 'A'],
    'price': [10.50, 15.00, 10.00, 12.50, 11.00]
})

# Merge on both store_id and product_id
result = pd.merge(sales, prices, on=['store_id', 'product_id'], how='inner')
print(result)
   store_id product_id  quantity        date  price
0       101          A        50  2024-01-15  10.50
1       101          B        30  2024-01-15  15.00
2       102          A        45  2024-01-15  10.00
3       102          C        20  2024-01-16  12.50

Notice that store 103, product B doesn’t appear because there’s no matching price record for that combination.

Handling Different Column Names

When merging columns have different names across DataFrames, use left_on and right_on with lists of equal length. The order matters—first element in left_on matches with first in right_on, and so on.

orders = pd.DataFrame({
    'order_store': [101, 102, 103],
    'order_product': ['A', 'B', 'C'],
    'units': [100, 150, 200]
})

inventory = pd.DataFrame({
    'warehouse_store': [101, 102, 103],
    'warehouse_product': ['A', 'B', 'D'],
    'stock': [500, 300, 400]
})

merged = pd.merge(
    orders, 
    inventory,
    left_on=['order_store', 'order_product'],
    right_on=['warehouse_store', 'warehouse_product'],
    how='left'
)

print(merged)
   order_store order_product  units  warehouse_store warehouse_product  stock
0          101             A    100            101.0                 A  500.0
1          102             B    150            102.0                 B  300.0
2          103             C    200              NaN               NaN    NaN

This creates duplicate columns for the merge keys. Clean them up by dropping redundant columns:

merged = merged.drop(columns=['warehouse_store', 'warehouse_product'])

Join Types with Multiple Columns

The four join types (inner, left, right, outer) work identically with multiple columns, but understanding the behavior becomes critical when dealing with partial matches.

df1 = pd.DataFrame({
    'region': ['North', 'North', 'South', 'East'],
    'category': ['A', 'B', 'A', 'C'],
    'sales': [100, 200, 150, 300]
})

df2 = pd.DataFrame({
    'region': ['North', 'South', 'South', 'West'],
    'category': ['A', 'A', 'B', 'A'],
    'target': [120, 180, 90, 200]
})

# Inner join - only exact matches on both columns
inner = pd.merge(df1, df2, on=['region', 'category'], how='inner')
print("Inner:\n", inner)

# Outer join - all combinations, NaN where no match
outer = pd.merge(df1, df2, on=['region', 'category'], how='outer')
print("\nOuter:\n", outer)
Inner:
   region category  sales  target
0  North        A    100     120
1  South        A    150     180

Outer:
   region category  sales  target
0  North        A  100.0   120.0
1  North        B  200.0     NaN
2  South        A  150.0   180.0
3  South        B    NaN    90.0
4   East        C  300.0     NaN
5   West        A    NaN   200.0

Merging with Time-Based Columns

When one of your merge columns is a date or timestamp, ensure consistent data types before merging. Type mismatches cause silent failures where valid matches are missed.

transactions = pd.DataFrame({
    'customer_id': [1001, 1001, 1002, 1003],
    'transaction_date': pd.to_datetime(['2024-01-15', '2024-01-20', '2024-01-15', '2024-01-18']),
    'amount': [250.00, 175.50, 400.00, 125.00]
})

promotions = pd.DataFrame({
    'customer_id': [1001, 1002, 1003],
    'promo_date': pd.to_datetime(['2024-01-15', '2024-01-15', '2024-01-18']),
    'discount_pct': [10, 15, 5]
})

# Merge on customer and exact date match
result = pd.merge(
    transactions,
    promotions,
    left_on=['customer_id', 'transaction_date'],
    right_on=['customer_id', 'promo_date'],
    how='left'
)

# Calculate discounted amount
result['discount'] = result['amount'] * result['discount_pct'].fillna(0) / 100
result['final_amount'] = result['amount'] - result['discount']

print(result[['customer_id', 'transaction_date', 'amount', 'discount_pct', 'final_amount']])
   customer_id transaction_date  amount  discount_pct  final_amount
0         1001       2024-01-15  250.00          10.0        225.00
1         1001       2024-01-20  175.50           NaN        175.50
2         1002       2024-01-15  400.00          15.0        340.00
3         1003       2024-01-18  125.00           5.0        118.75

Performance Optimization Strategies

Multi-column merges are computationally expensive. For large datasets, these optimizations make a significant difference:

import numpy as np

# Create large test dataset
n_rows = 1_000_000
df_large = pd.DataFrame({
    'store': np.random.randint(1, 100, n_rows),
    'product': np.random.randint(1, 1000, n_rows),
    'sales': np.random.rand(n_rows) * 1000
})

df_reference = pd.DataFrame({
    'store': np.random.randint(1, 100, 10000),
    'product': np.random.randint(1, 1000, 10000),
    'cost': np.random.rand(10000) * 500
})

# Strategy 1: Convert to categorical for repeated values
df_large['store'] = df_large['store'].astype('category')
df_large['product'] = df_large['product'].astype('category')
df_reference['store'] = df_reference['store'].astype('category')
df_reference['product'] = df_reference['product'].astype('category')

# Strategy 2: Set multi-level index on the reference data
df_reference_indexed = df_reference.set_index(['store', 'product'])

# Use join instead of merge when possible
result = df_large.join(
    df_reference_indexed,
    on=['store', 'product'],
    how='left'
)

# Strategy 3: Filter before merging
high_value_sales = df_large[df_large['sales'] > 500]
result_filtered = pd.merge(
    high_value_sales,
    df_reference,
    on=['store', 'product']
)

Setting a multi-index on the reference DataFrame and using .join() typically runs 2-3x faster than standard merge operations for lookups.

Handling Duplicate Keys

When multiple rows share the same composite key, merges create a Cartesian product for those keys. This often indicates data quality issues but is sometimes intentional.

shipments = pd.DataFrame({
    'warehouse': ['W1', 'W1', 'W2'],
    'product': ['SKU100', 'SKU100', 'SKU200'],
    'shipment_id': ['S001', 'S002', 'S003'],
    'quantity': [50, 30, 40]
})

defects = pd.DataFrame({
    'warehouse': ['W1', 'W1', 'W2'],
    'product': ['SKU100', 'SKU100', 'SKU200'],
    'defect_type': ['damaged', 'missing_parts', 'damaged'],
    'count': [5, 2, 3]
})

result = pd.merge(shipments, defects, on=['warehouse', 'product'])
print(result)
print(f"\nOriginal shipments: {len(shipments)}, Result rows: {len(result)}")
  warehouse  product shipment_id  quantity    defect_type  count
0        W1  SKU100        S001        50        damaged      5
1        W1  SKU100        S001        50  missing_parts      2
2        W1  SKU100        S002        30        damaged      5
3        W1  SKU100        S002        30  missing_parts      2
4        W2  SKU200        S003        40        damaged      3

Original shipments: 3, Result rows: 5

Each shipment from W1/SKU100 matches both defect records, creating 4 rows from 2 original shipment rows. Validate this behavior matches your business logic or aggregate before merging:

# Aggregate defects first
defects_agg = defects.groupby(['warehouse', 'product'], as_index=False).agg({
    'count': 'sum'
})

result_clean = pd.merge(shipments, defects_agg, on=['warehouse', 'product'])
print(result_clean)

This produces one row per shipment with total defect counts, preventing unexpected row multiplication.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.