Pandas: Merge vs Join vs Concat
Combining DataFrames is one of the most common operations in data analysis, yet Pandas offers three different methods that seem to do similar things: `concat`, `merge`, and `join`. This creates...
Key Insights
- Use
concatfor simple stacking operations (appending rows or adding columns),mergefor SQL-style joins on column values, andjoinfor index-based combining. mergeis the most flexible and explicit option—when in doubt, use it. Thejoinmethod is essentially syntactic sugar for index-based merges.- Performance differences are negligible for most datasets; prioritize code clarity and correctness over micro-optimizations unless you’re working with millions of rows.
Introduction
Combining DataFrames is one of the most common operations in data analysis, yet Pandas offers three different methods that seem to do similar things: concat, merge, and join. This creates confusion, especially for developers coming from SQL backgrounds who expect a single JOIN operation.
The truth is each method exists for a reason. They solve different problems, and using the wrong one leads to verbose code, unexpected results, or performance issues. This article breaks down when and why to use each method with practical examples you can apply immediately.
Concat: Stacking DataFrames
The concat function is your tool for simple stacking operations. Think of it as gluing DataFrames together either vertically (adding more rows) or horizontally (adding more columns).
Vertical Concatenation (axis=0)
The most common use case is appending DataFrames with identical structures:
import pandas as pd
# Monthly sales data
jan_sales = pd.DataFrame({
'product': ['Widget', 'Gadget', 'Gizmo'],
'revenue': [1000, 1500, 800],
'month': 'January'
})
feb_sales = pd.DataFrame({
'product': ['Widget', 'Gadget', 'Gizmo'],
'revenue': [1200, 1400, 950],
'month': 'February'
})
mar_sales = pd.DataFrame({
'product': ['Widget', 'Gadget', 'Gizmo'],
'revenue': [1100, 1600, 870],
'month': 'March'
})
# Stack all months vertically
all_sales = pd.concat([jan_sales, feb_sales, mar_sales], ignore_index=True)
print(all_sales)
Output:
product revenue month
0 Widget 1000 January
1 Gadget 1500 January
2 Gizmo 800 January
3 Widget 1200 February
4 Gadget 1400 February
5 Gizmo 950 February
6 Widget 1100 March
7 Gadget 1600 March
8 Gizmo 870 March
The ignore_index=True parameter resets the index to avoid duplicates. Without it, you’d have indices 0, 1, 2 repeated three times.
Using Keys for Hierarchical Indexing
If you want to preserve the source of each row:
all_sales = pd.concat(
[jan_sales, feb_sales, mar_sales],
keys=['jan', 'feb', 'mar']
)
print(all_sales.loc['feb'])
Horizontal Concatenation (axis=1)
When combining feature columns from different sources:
# Different feature sets for the same observations
demographics = pd.DataFrame({
'age': [25, 34, 45],
'income': [50000, 75000, 90000]
}, index=['user_1', 'user_2', 'user_3'])
behavior = pd.DataFrame({
'visits': [10, 25, 5],
'purchases': [2, 8, 1]
}, index=['user_1', 'user_2', 'user_3'])
# Combine horizontally
user_features = pd.concat([demographics, behavior], axis=1)
print(user_features)
Output:
age income visits purchases
user_1 25 50000 10 2
user_2 34 75000 25 8
user_3 45 90000 5 1
Key point: concat aligns on the index. If your indexes don’t match, you’ll get NaN values for misaligned rows.
Merge: SQL-Style Joins
The merge function is Pandas’ equivalent of SQL JOIN operations. Use it when you need to combine DataFrames based on column values rather than positions or indexes.
Basic Merge on a Key Column
customers = pd.DataFrame({
'customer_id': [1, 2, 3, 4],
'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
'segment': ['Premium', 'Standard', 'Premium', 'Standard']
})
orders = pd.DataFrame({
'order_id': [101, 102, 103, 104, 105],
'customer_id': [1, 2, 1, 3, 5], # Note: customer 5 doesn't exist
'amount': [250, 100, 175, 300, 50]
})
# Inner join (default) - only matching records
result = pd.merge(customers, orders, on='customer_id')
print(result)
Output:
customer_id name segment order_id amount
0 1 Alice Premium 101 250
1 1 Alice Premium 103 175
2 2 Bob Standard 102 100
3 3 Charlie Premium 104 300
Notice that customer 4 (Diana) has no orders and customer 5’s order has no matching customer—both are excluded from an inner join.
Join Types
# Left join - keep all customers
left_result = pd.merge(customers, orders, on='customer_id', how='left')
# Right join - keep all orders
right_result = pd.merge(customers, orders, on='customer_id', how='right')
# Outer join - keep everything
outer_result = pd.merge(customers, orders, on='customer_id', how='outer')
Different Column Names
When the key columns have different names in each DataFrame:
orders_v2 = orders.rename(columns={'customer_id': 'cust_id'})
result = pd.merge(
customers,
orders_v2,
left_on='customer_id',
right_on='cust_id'
)
Handling Duplicate Column Names
When both DataFrames have columns with the same name (other than the join key):
df1 = pd.DataFrame({
'id': [1, 2],
'value': [100, 200],
'date': ['2024-01', '2024-01']
})
df2 = pd.DataFrame({
'id': [1, 2],
'value': [150, 250],
'date': ['2024-02', '2024-02']
})
result = pd.merge(df1, df2, on='id', suffixes=('_jan', '_feb'))
print(result)
Output:
id value_jan date_jan value_feb date_feb
0 1 100 2024-01 150 2024-02
1 2 200 2024-01 250 2024-02
Many-to-Many Merges
Be cautious with many-to-many relationships—they create a Cartesian product of matching rows:
products = pd.DataFrame({
'category': ['Electronics', 'Electronics', 'Clothing'],
'product': ['Phone', 'Laptop', 'Shirt']
})
suppliers = pd.DataFrame({
'category': ['Electronics', 'Electronics', 'Clothing'],
'supplier': ['TechCorp', 'GadgetInc', 'FashionCo']
})
# This creates more rows than either input!
result = pd.merge(products, suppliers, on='category')
print(result)
Output:
category product supplier
0 Electronics Phone TechCorp
1 Electronics Phone GadgetInc
2 Electronics Laptop TechCorp
3 Electronics Laptop GadgetInc
4 Clothing Shirt FashionCo
Join: Index-Based Combining
The join method is a convenience wrapper around merge that defaults to joining on indexes. It’s useful when your DataFrames are already indexed meaningfully.
# Stock prices indexed by ticker
prices = pd.DataFrame({
'price': [150.25, 280.50, 95.75],
'volume': [1000000, 500000, 750000]
}, index=['AAPL', 'MSFT', 'GOOG'])
# Company info indexed by ticker
info = pd.DataFrame({
'name': ['Apple Inc.', 'Microsoft Corp.', 'Alphabet Inc.'],
'sector': ['Technology', 'Technology', 'Technology']
}, index=['AAPL', 'MSFT', 'GOOG'])
# Join on index
combined = prices.join(info)
print(combined)
Output:
price volume name sector
AAPL 150.25 1000000 Apple Inc. Technology
MSFT 280.50 500000 Microsoft Corp. Technology
GOOG 95.75 750000 Alphabet Inc. Technology
The equivalent merge operation would be:
combined = pd.merge(prices, info, left_index=True, right_index=True)
Use lsuffix and rsuffix for overlapping column names (similar to suffixes in merge):
combined = prices.join(other_prices, lsuffix='_today', rsuffix='_yesterday')
Performance Comparison
For most real-world datasets, the performance differences between these methods are negligible. Here’s a benchmark to illustrate:
import numpy as np
# Create test DataFrames
n = 100000
df1 = pd.DataFrame({
'key': np.random.randint(0, 10000, n),
'value1': np.random.randn(n)
})
df2 = pd.DataFrame({
'key': np.random.randint(0, 10000, n),
'value2': np.random.randn(n)
})
# Set index for join test
df1_indexed = df1.set_index('key')
df2_indexed = df2.set_index('key')
# Timing results (your mileage may vary)
%timeit pd.merge(df1, df2, on='key')
# ~15-25ms
%timeit df1_indexed.join(df2_indexed, lsuffix='_1', rsuffix='_2')
# ~10-20ms
%timeit pd.concat([df1_indexed, df2_indexed], axis=1)
# ~5-10ms (but different semantics - no key matching!)
The takeaway: concat is fastest but doesn’t do key matching. join can be slightly faster than merge when data is already indexed, but the difference rarely matters in practice.
Decision Guide: Which Method to Use
| Scenario | Method | Why |
|---|---|---|
| Appending rows from multiple DataFrames | concat |
Simple stacking, no key matching needed |
| Adding columns with aligned indexes | concat(axis=1) |
Straightforward horizontal combination |
| Combining on column values | merge |
Explicit, flexible, handles all join types |
| Combining on index values | join or merge |
join is shorter; merge is more explicit |
| Complex multi-key joins | merge |
Most control over join behavior |
Common Pitfalls
-
Unexpected row multiplication: Many-to-many merges create more rows than you expect. Always check
len(result)after merging. -
Index misalignment with concat: If indexes don’t match during horizontal concat, you get NaN values silently.
-
Forgetting
ignore_index: Vertical concat preserves original indexes by default, leading to duplicate index values. -
Assuming join behavior:
joindefaults to left join, whilemergedefaults to inner join.
Conclusion
Choose concat for simple stacking operations where you’re combining DataFrames with the same structure. Use merge for SQL-style joins on column values—it’s the most explicit and flexible option. Reserve join for when your data is already meaningfully indexed and you want slightly cleaner syntax.
When in doubt, use merge. It’s explicit about what’s happening, works consistently regardless of index state, and handles edge cases predictably. The small performance overhead is worth the clarity and maintainability it provides.