Pandas: Merge vs Join vs Concat

Combining DataFrames is one of the most common operations in data analysis, yet Pandas offers three different methods that seem to do similar things: `concat`, `merge`, and `join`. This creates...

Key Insights

  • Use concat for simple stacking operations (appending rows or adding columns), merge for SQL-style joins on column values, and join for index-based combining.
  • merge is the most flexible and explicit option—when in doubt, use it. The join method is essentially syntactic sugar for index-based merges.
  • Performance differences are negligible for most datasets; prioritize code clarity and correctness over micro-optimizations unless you’re working with millions of rows.

Introduction

Combining DataFrames is one of the most common operations in data analysis, yet Pandas offers three different methods that seem to do similar things: concat, merge, and join. This creates confusion, especially for developers coming from SQL backgrounds who expect a single JOIN operation.

The truth is each method exists for a reason. They solve different problems, and using the wrong one leads to verbose code, unexpected results, or performance issues. This article breaks down when and why to use each method with practical examples you can apply immediately.

Concat: Stacking DataFrames

The concat function is your tool for simple stacking operations. Think of it as gluing DataFrames together either vertically (adding more rows) or horizontally (adding more columns).

Vertical Concatenation (axis=0)

The most common use case is appending DataFrames with identical structures:

import pandas as pd

# Monthly sales data
jan_sales = pd.DataFrame({
    'product': ['Widget', 'Gadget', 'Gizmo'],
    'revenue': [1000, 1500, 800],
    'month': 'January'
})

feb_sales = pd.DataFrame({
    'product': ['Widget', 'Gadget', 'Gizmo'],
    'revenue': [1200, 1400, 950],
    'month': 'February'
})

mar_sales = pd.DataFrame({
    'product': ['Widget', 'Gadget', 'Gizmo'],
    'revenue': [1100, 1600, 870],
    'month': 'March'
})

# Stack all months vertically
all_sales = pd.concat([jan_sales, feb_sales, mar_sales], ignore_index=True)
print(all_sales)

Output:

  product  revenue     month
0  Widget     1000   January
1  Gadget     1500   January
2   Gizmo      800   January
3  Widget     1200  February
4  Gadget     1400  February
5   Gizmo      950  February
6  Widget     1100     March
7  Gadget     1600     March
8   Gizmo      870     March

The ignore_index=True parameter resets the index to avoid duplicates. Without it, you’d have indices 0, 1, 2 repeated three times.

Using Keys for Hierarchical Indexing

If you want to preserve the source of each row:

all_sales = pd.concat(
    [jan_sales, feb_sales, mar_sales],
    keys=['jan', 'feb', 'mar']
)
print(all_sales.loc['feb'])

Horizontal Concatenation (axis=1)

When combining feature columns from different sources:

# Different feature sets for the same observations
demographics = pd.DataFrame({
    'age': [25, 34, 45],
    'income': [50000, 75000, 90000]
}, index=['user_1', 'user_2', 'user_3'])

behavior = pd.DataFrame({
    'visits': [10, 25, 5],
    'purchases': [2, 8, 1]
}, index=['user_1', 'user_2', 'user_3'])

# Combine horizontally
user_features = pd.concat([demographics, behavior], axis=1)
print(user_features)

Output:

        age  income  visits  purchases
user_1   25   50000      10          2
user_2   34   75000      25          8
user_3   45   90000       5          1

Key point: concat aligns on the index. If your indexes don’t match, you’ll get NaN values for misaligned rows.

Merge: SQL-Style Joins

The merge function is Pandas’ equivalent of SQL JOIN operations. Use it when you need to combine DataFrames based on column values rather than positions or indexes.

Basic Merge on a Key Column

customers = pd.DataFrame({
    'customer_id': [1, 2, 3, 4],
    'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
    'segment': ['Premium', 'Standard', 'Premium', 'Standard']
})

orders = pd.DataFrame({
    'order_id': [101, 102, 103, 104, 105],
    'customer_id': [1, 2, 1, 3, 5],  # Note: customer 5 doesn't exist
    'amount': [250, 100, 175, 300, 50]
})

# Inner join (default) - only matching records
result = pd.merge(customers, orders, on='customer_id')
print(result)

Output:

   customer_id     name   segment  order_id  amount
0            1    Alice   Premium       101     250
1            1    Alice   Premium       103     175
2            2      Bob  Standard       102     100
3            3  Charlie   Premium       104     300

Notice that customer 4 (Diana) has no orders and customer 5’s order has no matching customer—both are excluded from an inner join.

Join Types

# Left join - keep all customers
left_result = pd.merge(customers, orders, on='customer_id', how='left')

# Right join - keep all orders
right_result = pd.merge(customers, orders, on='customer_id', how='right')

# Outer join - keep everything
outer_result = pd.merge(customers, orders, on='customer_id', how='outer')

Different Column Names

When the key columns have different names in each DataFrame:

orders_v2 = orders.rename(columns={'customer_id': 'cust_id'})

result = pd.merge(
    customers, 
    orders_v2, 
    left_on='customer_id', 
    right_on='cust_id'
)

Handling Duplicate Column Names

When both DataFrames have columns with the same name (other than the join key):

df1 = pd.DataFrame({
    'id': [1, 2],
    'value': [100, 200],
    'date': ['2024-01', '2024-01']
})

df2 = pd.DataFrame({
    'id': [1, 2],
    'value': [150, 250],
    'date': ['2024-02', '2024-02']
})

result = pd.merge(df1, df2, on='id', suffixes=('_jan', '_feb'))
print(result)

Output:

   id  value_jan   date_jan  value_feb   date_feb
0   1        100    2024-01        150    2024-02
1   2        200    2024-01        250    2024-02

Many-to-Many Merges

Be cautious with many-to-many relationships—they create a Cartesian product of matching rows:

products = pd.DataFrame({
    'category': ['Electronics', 'Electronics', 'Clothing'],
    'product': ['Phone', 'Laptop', 'Shirt']
})

suppliers = pd.DataFrame({
    'category': ['Electronics', 'Electronics', 'Clothing'],
    'supplier': ['TechCorp', 'GadgetInc', 'FashionCo']
})

# This creates more rows than either input!
result = pd.merge(products, suppliers, on='category')
print(result)

Output:

      category product   supplier
0  Electronics   Phone   TechCorp
1  Electronics   Phone  GadgetInc
2  Electronics  Laptop   TechCorp
3  Electronics  Laptop  GadgetInc
4     Clothing   Shirt  FashionCo

Join: Index-Based Combining

The join method is a convenience wrapper around merge that defaults to joining on indexes. It’s useful when your DataFrames are already indexed meaningfully.

# Stock prices indexed by ticker
prices = pd.DataFrame({
    'price': [150.25, 280.50, 95.75],
    'volume': [1000000, 500000, 750000]
}, index=['AAPL', 'MSFT', 'GOOG'])

# Company info indexed by ticker
info = pd.DataFrame({
    'name': ['Apple Inc.', 'Microsoft Corp.', 'Alphabet Inc.'],
    'sector': ['Technology', 'Technology', 'Technology']
}, index=['AAPL', 'MSFT', 'GOOG'])

# Join on index
combined = prices.join(info)
print(combined)

Output:

       price   volume             name      sector
AAPL  150.25  1000000       Apple Inc.  Technology
MSFT  280.50   500000  Microsoft Corp.  Technology
GOOG   95.75   750000   Alphabet Inc.  Technology

The equivalent merge operation would be:

combined = pd.merge(prices, info, left_index=True, right_index=True)

Use lsuffix and rsuffix for overlapping column names (similar to suffixes in merge):

combined = prices.join(other_prices, lsuffix='_today', rsuffix='_yesterday')

Performance Comparison

For most real-world datasets, the performance differences between these methods are negligible. Here’s a benchmark to illustrate:

import numpy as np

# Create test DataFrames
n = 100000
df1 = pd.DataFrame({
    'key': np.random.randint(0, 10000, n),
    'value1': np.random.randn(n)
})

df2 = pd.DataFrame({
    'key': np.random.randint(0, 10000, n),
    'value2': np.random.randn(n)
})

# Set index for join test
df1_indexed = df1.set_index('key')
df2_indexed = df2.set_index('key')
# Timing results (your mileage may vary)
%timeit pd.merge(df1, df2, on='key')
# ~15-25ms

%timeit df1_indexed.join(df2_indexed, lsuffix='_1', rsuffix='_2')
# ~10-20ms

%timeit pd.concat([df1_indexed, df2_indexed], axis=1)
# ~5-10ms (but different semantics - no key matching!)

The takeaway: concat is fastest but doesn’t do key matching. join can be slightly faster than merge when data is already indexed, but the difference rarely matters in practice.

Decision Guide: Which Method to Use

Scenario Method Why
Appending rows from multiple DataFrames concat Simple stacking, no key matching needed
Adding columns with aligned indexes concat(axis=1) Straightforward horizontal combination
Combining on column values merge Explicit, flexible, handles all join types
Combining on index values join or merge join is shorter; merge is more explicit
Complex multi-key joins merge Most control over join behavior

Common Pitfalls

  1. Unexpected row multiplication: Many-to-many merges create more rows than you expect. Always check len(result) after merging.

  2. Index misalignment with concat: If indexes don’t match during horizontal concat, you get NaN values silently.

  3. Forgetting ignore_index: Vertical concat preserves original indexes by default, leading to duplicate index values.

  4. Assuming join behavior: join defaults to left join, while merge defaults to inner join.

Conclusion

Choose concat for simple stacking operations where you’re combining DataFrames with the same structure. Use merge for SQL-style joins on column values—it’s the most explicit and flexible option. Reserve join for when your data is already meaningfully indexed and you want slightly cleaner syntax.

When in doubt, use merge. It’s explicit about what’s happening, works consistently regardless of index state, and handles edge cases predictably. The small performance overhead is worth the clarity and maintainability it provides.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.