Pandas - Merge Two DataFrames (merge)
The `merge()` function combines two DataFrames based on common columns or indexes. At its simplest, merge automatically detects common column names and uses them as join keys.
Key Insights
- Pandas merge() performs SQL-style joins on DataFrames using one or more keys, supporting inner, outer, left, and right join types with configurable behavior for handling duplicates and missing values
- The function offers fine-grained control through parameters like
on,left_on/right_on,how,indicator, andsuffixesto handle complex merging scenarios including many-to-many relationships - Performance optimization requires understanding join types, using appropriate key columns with proper data types, and leveraging categorical data or indexes when merging large datasets repeatedly
Basic Merge Operations
The merge() function combines two DataFrames based on common columns or indexes. At its simplest, merge automatically detects common column names and uses them as join keys.
import pandas as pd
# Create sample DataFrames
customers = pd.DataFrame({
'customer_id': [1, 2, 3, 4],
'name': ['Alice', 'Bob', 'Charlie', 'David'],
'city': ['NYC', 'LA', 'Chicago', 'Houston']
})
orders = pd.DataFrame({
'order_id': [101, 102, 103, 104, 105],
'customer_id': [1, 2, 2, 3, 5],
'amount': [250, 150, 300, 200, 400]
})
# Inner join (default) - only matching records
result = pd.merge(customers, orders)
print(result)
Output:
customer_id name city order_id amount
0 1 Alice NYC 101 250
1 2 Bob LA 102 150
2 2 Bob LA 103 300
3 3 Charlie Chicago 104 200
Notice customer_id 4 (David) and order 105 are excluded because they don’t have matches in both DataFrames.
Join Types with the how Parameter
The how parameter controls which records to include in the result, mirroring SQL join behavior.
# Left join - all customers, matched orders
left_join = pd.merge(customers, orders, how='left')
print("\nLeft Join:")
print(left_join)
# Right join - all orders, matched customers
right_join = pd.merge(customers, orders, how='right')
print("\nRight Join:")
print(right_join)
# Outer join - all records from both
outer_join = pd.merge(customers, orders, how='outer')
print("\nOuter Join:")
print(outer_join)
Left join output shows all customers with NaN for unmatched orders:
customer_id name city order_id amount
0 1 Alice NYC 101.0 250.0
1 2 Bob LA 102.0 150.0
2 2 Bob LA 103.0 300.0
3 3 Charlie Chicago 104.0 200.0
4 4 David Houston NaN NaN
Specifying Join Keys
When column names differ or you need explicit control, use on, left_on, and right_on parameters.
# Different column names
products = pd.DataFrame({
'product_code': ['A1', 'A2', 'A3'],
'product_name': ['Widget', 'Gadget', 'Tool'],
'price': [10, 20, 15]
})
inventory = pd.DataFrame({
'sku': ['A1', 'A2', 'A4'],
'stock': [100, 50, 75]
})
# Merge on different column names
merged = pd.merge(
products,
inventory,
left_on='product_code',
right_on='sku',
how='left'
)
print(merged)
Output:
product_code product_name price sku stock
0 A1 Widget 10 A1 100.0
1 A2 Gadget 20 A2 50.0
2 A3 Tool 15 NaN NaN
Multiple Key Columns
Merge on multiple columns for composite keys, common in normalized databases.
sales = pd.DataFrame({
'region': ['North', 'North', 'South', 'South'],
'product': ['A', 'B', 'A', 'B'],
'revenue': [1000, 1500, 1200, 1800]
})
targets = pd.DataFrame({
'region': ['North', 'North', 'South'],
'product': ['A', 'B', 'A'],
'target': [900, 1400, 1100]
})
# Merge on both region and product
performance = pd.merge(
sales,
targets,
on=['region', 'product'],
how='left'
)
# Calculate performance vs target
performance['vs_target'] = performance['revenue'] - performance['target']
print(performance)
Output:
region product revenue target vs_target
0 North A 1000 900.0 100.0
1 North B 1500 1400.0 100.0
2 South A 1200 1100.0 100.0
3 South B 1800 NaN NaN
Handling Overlapping Column Names
When DataFrames share column names beyond the join key, use suffixes to distinguish them.
q1_sales = pd.DataFrame({
'product_id': [1, 2, 3],
'sales': [100, 150, 200],
'returns': [5, 8, 3]
})
q2_sales = pd.DataFrame({
'product_id': [1, 2, 4],
'sales': [120, 140, 180],
'returns': [3, 6, 4]
})
# Compare quarters
comparison = pd.merge(
q1_sales,
q2_sales,
on='product_id',
how='outer',
suffixes=('_q1', '_q2')
)
print(comparison)
Output:
product_id sales_q1 returns_q1 sales_q2 returns_q2
0 1 100.0 5.0 120.0 3.0
1 2 150.0 8.0 140.0 6.0
2 3 200.0 3.0 NaN NaN
3 4 NaN NaN 180.0 4.0
Using the Indicator Parameter
The indicator parameter adds a column showing the source of each row, useful for debugging and data quality checks.
expected = pd.DataFrame({
'user_id': [1, 2, 3, 4],
'expected': [True, True, True, True]
})
actual = pd.DataFrame({
'user_id': [2, 3, 5],
'completed': [True, True, True]
})
# Track merge sources
audit = pd.merge(
expected,
actual,
on='user_id',
how='outer',
indicator=True
)
print(audit)
print("\nMissing users:")
print(audit[audit['_merge'] == 'left_only'])
print("\nUnexpected users:")
print(audit[audit['_merge'] == 'right_only'])
Output:
user_id expected completed _merge
0 1 True NaN left_only
1 2 True True both
2 3 True True both
3 4 True NaN left_only
4 5 NaN True right_only
Merging on Index
Merge on DataFrame indexes instead of columns using left_index and right_index parameters.
# DataFrames with meaningful indexes
user_profiles = pd.DataFrame({
'username': ['alice', 'bob', 'charlie'],
'email': ['a@ex.com', 'b@ex.com', 'c@ex.com']
}, index=[101, 102, 103])
user_activity = pd.DataFrame({
'last_login': ['2024-01-15', '2024-01-16', '2024-01-14'],
'posts': [5, 12, 3]
}, index=[101, 102, 104])
# Merge on index
user_data = pd.merge(
user_profiles,
user_activity,
left_index=True,
right_index=True,
how='outer'
)
print(user_data)
Performance Considerations
Merge performance degrades with large datasets and inefficient key types. Optimize by converting string keys to categorical data and ensuring proper data types.
import numpy as np
# Create large DataFrames
np.random.seed(42)
n = 1_000_000
df1 = pd.DataFrame({
'key': np.random.choice(['A', 'B', 'C', 'D'], n),
'value1': np.random.randn(n)
})
df2 = pd.DataFrame({
'key': np.random.choice(['A', 'B', 'C', 'D'], n),
'value2': np.random.randn(n)
})
# Optimize with categorical
df1['key'] = df1['key'].astype('category')
df2['key'] = df2['key'].astype('category')
# Now merge is faster
result = pd.merge(df1, df2, on='key')
print(f"Result shape: {result.shape}")
For repeated merges on the same key, set it as the index:
# Set index once
df1_indexed = df1.set_index('key')
df2_indexed = df2.set_index('key')
# Faster subsequent merges
result = pd.merge(
df1_indexed,
df2_indexed,
left_index=True,
right_index=True
)
Handling Validation
Use the validate parameter to ensure merge integrity and catch data quality issues.
# One-to-one relationship expected
users = pd.DataFrame({
'user_id': [1, 2, 3],
'name': ['Alice', 'Bob', 'Charlie']
})
profiles = pd.DataFrame({
'user_id': [1, 2, 3],
'bio': ['Bio1', 'Bio2', 'Bio3']
})
# This succeeds
valid_merge = pd.merge(users, profiles, on='user_id', validate='1:1')
# This would raise MergeError if duplicates exist
duplicate_profiles = pd.DataFrame({
'user_id': [1, 1, 2],
'bio': ['Bio1a', 'Bio1b', 'Bio2']
})
try:
invalid_merge = pd.merge(users, duplicate_profiles, on='user_id', validate='1:1')
except pd.errors.MergeError as e:
print(f"Validation failed: {e}")
Valid options: '1:1', '1:m', 'm:1', 'm:m' for enforcing relationship cardinality.