Pandas - Merge Two DataFrames (merge)

The `merge()` function combines two DataFrames based on common columns or indexes. At its simplest, merge automatically detects common column names and uses them as join keys.

Key Insights

  • Pandas merge() performs SQL-style joins on DataFrames using one or more keys, supporting inner, outer, left, and right join types with configurable behavior for handling duplicates and missing values
  • The function offers fine-grained control through parameters like on, left_on/right_on, how, indicator, and suffixes to handle complex merging scenarios including many-to-many relationships
  • Performance optimization requires understanding join types, using appropriate key columns with proper data types, and leveraging categorical data or indexes when merging large datasets repeatedly

Basic Merge Operations

The merge() function combines two DataFrames based on common columns or indexes. At its simplest, merge automatically detects common column names and uses them as join keys.

import pandas as pd

# Create sample DataFrames
customers = pd.DataFrame({
    'customer_id': [1, 2, 3, 4],
    'name': ['Alice', 'Bob', 'Charlie', 'David'],
    'city': ['NYC', 'LA', 'Chicago', 'Houston']
})

orders = pd.DataFrame({
    'order_id': [101, 102, 103, 104, 105],
    'customer_id': [1, 2, 2, 3, 5],
    'amount': [250, 150, 300, 200, 400]
})

# Inner join (default) - only matching records
result = pd.merge(customers, orders)
print(result)

Output:

   customer_id     name     city  order_id  amount
0            1    Alice      NYC       101     250
1            2      Bob       LA       102     150
2            2      Bob       LA       103     300
3            3  Charlie  Chicago       104     200

Notice customer_id 4 (David) and order 105 are excluded because they don’t have matches in both DataFrames.

Join Types with the how Parameter

The how parameter controls which records to include in the result, mirroring SQL join behavior.

# Left join - all customers, matched orders
left_join = pd.merge(customers, orders, how='left')
print("\nLeft Join:")
print(left_join)

# Right join - all orders, matched customers
right_join = pd.merge(customers, orders, how='right')
print("\nRight Join:")
print(right_join)

# Outer join - all records from both
outer_join = pd.merge(customers, orders, how='outer')
print("\nOuter Join:")
print(outer_join)

Left join output shows all customers with NaN for unmatched orders:

   customer_id     name     city  order_id  amount
0            1    Alice      NYC     101.0   250.0
1            2      Bob       LA     102.0   150.0
2            2      Bob       LA     103.0   300.0
3            3  Charlie  Chicago     104.0   200.0
4            4    David  Houston       NaN     NaN

Specifying Join Keys

When column names differ or you need explicit control, use on, left_on, and right_on parameters.

# Different column names
products = pd.DataFrame({
    'product_code': ['A1', 'A2', 'A3'],
    'product_name': ['Widget', 'Gadget', 'Tool'],
    'price': [10, 20, 15]
})

inventory = pd.DataFrame({
    'sku': ['A1', 'A2', 'A4'],
    'stock': [100, 50, 75]
})

# Merge on different column names
merged = pd.merge(
    products, 
    inventory, 
    left_on='product_code', 
    right_on='sku',
    how='left'
)
print(merged)

Output:

  product_code product_name  price  sku  stock
0           A1       Widget     10   A1  100.0
1           A2       Gadget     20   A2   50.0
2           A3         Tool     15  NaN    NaN

Multiple Key Columns

Merge on multiple columns for composite keys, common in normalized databases.

sales = pd.DataFrame({
    'region': ['North', 'North', 'South', 'South'],
    'product': ['A', 'B', 'A', 'B'],
    'revenue': [1000, 1500, 1200, 1800]
})

targets = pd.DataFrame({
    'region': ['North', 'North', 'South'],
    'product': ['A', 'B', 'A'],
    'target': [900, 1400, 1100]
})

# Merge on both region and product
performance = pd.merge(
    sales, 
    targets, 
    on=['region', 'product'],
    how='left'
)

# Calculate performance vs target
performance['vs_target'] = performance['revenue'] - performance['target']
print(performance)

Output:

  region product  revenue  target  vs_target
0  North       A     1000   900.0      100.0
1  North       B     1500  1400.0      100.0
2  South       A     1200  1100.0      100.0
3  South       B     1800     NaN        NaN

Handling Overlapping Column Names

When DataFrames share column names beyond the join key, use suffixes to distinguish them.

q1_sales = pd.DataFrame({
    'product_id': [1, 2, 3],
    'sales': [100, 150, 200],
    'returns': [5, 8, 3]
})

q2_sales = pd.DataFrame({
    'product_id': [1, 2, 4],
    'sales': [120, 140, 180],
    'returns': [3, 6, 4]
})

# Compare quarters
comparison = pd.merge(
    q1_sales, 
    q2_sales, 
    on='product_id',
    how='outer',
    suffixes=('_q1', '_q2')
)

print(comparison)

Output:

   product_id  sales_q1  returns_q1  sales_q2  returns_q2
0           1     100.0         5.0     120.0         3.0
1           2     150.0         8.0     140.0         6.0
2           3     200.0         3.0       NaN         NaN
3           4       NaN         NaN     180.0         4.0

Using the Indicator Parameter

The indicator parameter adds a column showing the source of each row, useful for debugging and data quality checks.

expected = pd.DataFrame({
    'user_id': [1, 2, 3, 4],
    'expected': [True, True, True, True]
})

actual = pd.DataFrame({
    'user_id': [2, 3, 5],
    'completed': [True, True, True]
})

# Track merge sources
audit = pd.merge(
    expected, 
    actual, 
    on='user_id',
    how='outer',
    indicator=True
)

print(audit)
print("\nMissing users:")
print(audit[audit['_merge'] == 'left_only'])
print("\nUnexpected users:")
print(audit[audit['_merge'] == 'right_only'])

Output:

   user_id  expected  completed      _merge
0        1      True        NaN   left_only
1        2      True       True        both
2        3      True       True        both
3        4      True        NaN   left_only
4        5       NaN       True  right_only

Merging on Index

Merge on DataFrame indexes instead of columns using left_index and right_index parameters.

# DataFrames with meaningful indexes
user_profiles = pd.DataFrame({
    'username': ['alice', 'bob', 'charlie'],
    'email': ['a@ex.com', 'b@ex.com', 'c@ex.com']
}, index=[101, 102, 103])

user_activity = pd.DataFrame({
    'last_login': ['2024-01-15', '2024-01-16', '2024-01-14'],
    'posts': [5, 12, 3]
}, index=[101, 102, 104])

# Merge on index
user_data = pd.merge(
    user_profiles,
    user_activity,
    left_index=True,
    right_index=True,
    how='outer'
)

print(user_data)

Performance Considerations

Merge performance degrades with large datasets and inefficient key types. Optimize by converting string keys to categorical data and ensuring proper data types.

import numpy as np

# Create large DataFrames
np.random.seed(42)
n = 1_000_000

df1 = pd.DataFrame({
    'key': np.random.choice(['A', 'B', 'C', 'D'], n),
    'value1': np.random.randn(n)
})

df2 = pd.DataFrame({
    'key': np.random.choice(['A', 'B', 'C', 'D'], n),
    'value2': np.random.randn(n)
})

# Optimize with categorical
df1['key'] = df1['key'].astype('category')
df2['key'] = df2['key'].astype('category')

# Now merge is faster
result = pd.merge(df1, df2, on='key')
print(f"Result shape: {result.shape}")

For repeated merges on the same key, set it as the index:

# Set index once
df1_indexed = df1.set_index('key')
df2_indexed = df2.set_index('key')

# Faster subsequent merges
result = pd.merge(
    df1_indexed,
    df2_indexed,
    left_index=True,
    right_index=True
)

Handling Validation

Use the validate parameter to ensure merge integrity and catch data quality issues.

# One-to-one relationship expected
users = pd.DataFrame({
    'user_id': [1, 2, 3],
    'name': ['Alice', 'Bob', 'Charlie']
})

profiles = pd.DataFrame({
    'user_id': [1, 2, 3],
    'bio': ['Bio1', 'Bio2', 'Bio3']
})

# This succeeds
valid_merge = pd.merge(users, profiles, on='user_id', validate='1:1')

# This would raise MergeError if duplicates exist
duplicate_profiles = pd.DataFrame({
    'user_id': [1, 1, 2],
    'bio': ['Bio1a', 'Bio1b', 'Bio2']
})

try:
    invalid_merge = pd.merge(users, duplicate_profiles, on='user_id', validate='1:1')
except pd.errors.MergeError as e:
    print(f"Validation failed: {e}")

Valid options: '1:1', '1:m', 'm:1', 'm:m' for enforcing relationship cardinality.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.