How to Join DataFrames in Pandas

Combining data from multiple sources is one of the most common operations in data analysis. Whether you're merging customer records with transaction data, combining time series from different...

Key Insights

  • Use merge() for SQL-style joins on columns, join() for index-based joins, and concat() for stacking DataFrames—each has distinct performance characteristics and use cases.
  • Always verify your join results with indicator=True and check for unexpected row count changes; silent data duplication from many-to-many joins is a common source of bugs.
  • Setting indexes before joining large DataFrames can improve performance by 2-10x compared to joining on columns directly.

Introduction to DataFrame Joins

Combining data from multiple sources is one of the most common operations in data analysis. Whether you’re merging customer records with transaction data, combining time series from different sensors, or consolidating reports from multiple departments, you’ll need to join DataFrames.

Pandas provides three primary methods for combining data: merge(), join(), and concat(). Each serves a different purpose, and understanding when to use which will save you debugging time and prevent subtle data quality issues.

Let’s create two sample DataFrames that we’ll use throughout this article:

import pandas as pd
import numpy as np

# Customer data
customers = pd.DataFrame({
    'customer_id': [1, 2, 3, 4, 5],
    'name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
    'city': ['NYC', 'LA', 'Chicago', 'NYC', 'Boston']
})

# Order data (note: customer_id 6 doesn't exist in customers)
orders = pd.DataFrame({
    'order_id': [101, 102, 103, 104, 105],
    'customer_id': [1, 2, 2, 6, 3],
    'amount': [250, 150, 300, 400, 175]
})

print("Customers:")
print(customers)
print("\nOrders:")
print(orders)

Using merge() for SQL-Style Joins

The merge() function is your go-to tool for combining DataFrames based on column values. It mirrors SQL JOIN syntax and supports all standard join types.

Inner Join (Default)

An inner join returns only rows where the key exists in both DataFrames:

inner_result = pd.merge(customers, orders, on='customer_id')
print(inner_result)

Output:

   customer_id     name     city  order_id  amount
0            1    Alice      NYC       101     250
1            2      Bob       LA       102     150
2            2      Bob       LA       103     300
3            3  Charlie  Chicago       105     175

Notice that customer_id 6 (from orders) and customers 4 and 5 are excluded because they don’t have matches in both tables.

Left, Right, and Outer Joins

# Left join: keep all customers, match orders where available
left_result = pd.merge(customers, orders, on='customer_id', how='left')
print("Left Join:")
print(left_result)

# Right join: keep all orders, match customers where available
right_result = pd.merge(customers, orders, on='customer_id', how='right')
print("\nRight Join:")
print(right_result)

# Outer join: keep everything from both DataFrames
outer_result = pd.merge(customers, orders, on='customer_id', how='outer')
print("\nOuter Join:")
print(outer_result)

Joining on Different Column Names

When your key columns have different names, use left_on and right_on:

# Rename for demonstration
orders_renamed = orders.rename(columns={'customer_id': 'cust_id'})

result = pd.merge(
    customers, 
    orders_renamed, 
    left_on='customer_id', 
    right_on='cust_id'
)
print(result)

This creates both columns in the output. Drop the redundant one if needed:

result = result.drop(columns=['cust_id'])

The join() Method for Index-Based Joins

The join() method is optimized for joining on indexes. It’s more concise than merge() when your join keys are already indexes.

# Set indexes
customers_indexed = customers.set_index('customer_id')
orders_indexed = orders.set_index('customer_id')

# Join on index
joined = customers_indexed.join(orders_indexed, how='inner')
print(joined)

When to Use join() vs merge()

Use join() when:

  • Your DataFrames already have meaningful indexes
  • You’re joining on the index of the left DataFrame
  • You want cleaner syntax for simple joins

Use merge() when:

  • You’re joining on columns (not indexes)
  • You need to join on multiple columns
  • Your key columns have different names

You can also join an indexed DataFrame with a column from another:

# Join customers (indexed) with orders on orders' customer_id column
result = customers_indexed.join(orders.set_index('customer_id'), how='left')

Handling Column Name Conflicts

When both DataFrames have columns with the same name, use the lsuffix and rsuffix parameters:

df1 = pd.DataFrame({'value': [1, 2, 3]}, index=['a', 'b', 'c'])
df2 = pd.DataFrame({'value': [4, 5, 6]}, index=['a', 'b', 'd'])

result = df1.join(df2, lsuffix='_left', rsuffix='_right', how='outer')
print(result)

Concatenating DataFrames with concat()

While merge() and join() combine DataFrames horizontally based on keys, concat() stacks DataFrames either vertically or horizontally.

Vertical Concatenation (Stacking Rows)

# Monthly sales data
jan_sales = pd.DataFrame({
    'product': ['A', 'B', 'C'],
    'sales': [100, 200, 150],
    'month': 'Jan'
})

feb_sales = pd.DataFrame({
    'product': ['A', 'B', 'D'],
    'sales': [120, 180, 90],
    'month': 'Feb'
})

# Stack vertically
all_sales = pd.concat([jan_sales, feb_sales], ignore_index=True)
print(all_sales)

The ignore_index=True parameter creates a fresh index. Without it, you’d have duplicate index values.

Horizontal Concatenation

# Concat along columns
metrics_a = pd.DataFrame({'revenue': [100, 200]}, index=['Q1', 'Q2'])
metrics_b = pd.DataFrame({'costs': [80, 150]}, index=['Q1', 'Q2'])

combined = pd.concat([metrics_a, metrics_b], axis=1)
print(combined)

Handling Mismatched Columns

When DataFrames have different columns, concat() fills missing values with NaN:

df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df2 = pd.DataFrame({'B': [5, 6], 'C': [7, 8]})

result = pd.concat([df1, df2], ignore_index=True)
print(result)

Output:

     A  B    C
0  1.0  3  NaN
1  2.0  4  NaN
2  NaN  5  7.0
3  NaN  6  8.0

Handling Common Join Issues

Dealing with Duplicate Column Names

Use the suffixes parameter in merge() to handle overlapping column names:

df1 = pd.DataFrame({
    'id': [1, 2],
    'value': [10, 20],
    'status': ['active', 'inactive']
})

df2 = pd.DataFrame({
    'id': [1, 2],
    'value': [100, 200],
    'category': ['A', 'B']
})

result = pd.merge(df1, df2, on='id', suffixes=('_original', '_updated'))
print(result)

Debugging Joins with indicator

The indicator parameter adds a column showing where each row came from:

result = pd.merge(
    customers, 
    orders, 
    on='customer_id', 
    how='outer', 
    indicator=True
)
print(result)
print("\nRow sources:")
print(result['_merge'].value_counts())

This is invaluable for debugging unexpected join results.

Key Column Data Type Mismatches

A common gotcha: joining fails silently when key columns have different types.

# This creates problems
df1 = pd.DataFrame({'id': [1, 2, 3], 'value': ['a', 'b', 'c']})
df2 = pd.DataFrame({'id': ['1', '2', '4'], 'score': [10, 20, 30]})

# No matches because int != string
result = pd.merge(df1, df2, on='id')
print(f"Rows in result: {len(result)}")  # 0 rows!

# Fix: ensure matching types
df2['id'] = df2['id'].astype(int)
result = pd.merge(df1, df2, on='id')
print(f"Rows after type fix: {len(result)}")  # 2 rows

Always verify column types before joining:

print(df1['id'].dtype, df2['id'].dtype)

Performance Tips for Large DataFrames

Set Indexes Before Joining

Joining on indexes is significantly faster than joining on columns:

import time

# Create larger DataFrames for benchmarking
n = 100000
large_df1 = pd.DataFrame({
    'key': np.random.randint(0, 10000, n),
    'value1': np.random.randn(n)
})
large_df2 = pd.DataFrame({
    'key': np.random.randint(0, 10000, n),
    'value2': np.random.randn(n)
})

# Merge on columns
start = time.time()
result1 = pd.merge(large_df1, large_df2, on='key')
column_time = time.time() - start

# Merge on index
df1_indexed = large_df1.set_index('key').sort_index()
df2_indexed = large_df2.set_index('key').sort_index()

start = time.time()
result2 = df1_indexed.join(df2_indexed, lsuffix='_1', rsuffix='_2')
index_time = time.time() - start

print(f"Column merge: {column_time:.3f}s")
print(f"Index join: {index_time:.3f}s")
print(f"Speedup: {column_time/index_time:.1f}x")

Memory Considerations

  • Inner joins produce the smallest output
  • Outer joins can explode memory with many-to-many relationships
  • Use copy=False in concat() when you don’t need to modify the result

Quick Reference Summary

Method Best For Join Keys Performance
merge() SQL-style joins on columns Columns Good
join() Index-based joins Index Better
concat() Stacking DataFrames Position/Index Fast

Decision Guide:

  1. Stacking rows from similar DataFrames?concat(axis=0)
  2. Combining columns side-by-side?concat(axis=1) or join()
  3. Matching rows by key values?merge()
  4. DataFrames already have meaningful indexes?join()
  5. Need SQL-style join control?merge()

Master these three methods, and you’ll handle any DataFrame combination task efficiently. Start with merge() for most use cases, graduate to join() when working with indexed data, and reach for concat() when stacking similar structures.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.