Pandas - Inner Join DataFrames
An inner join combines two DataFrames by matching rows based on common column values, retaining only the rows where matches exist in both datasets. This is the default join type in Pandas and the...
Key Insights
- Inner joins in Pandas use
merge()orjoin()methods to combine DataFrames based on common columns, keeping only matching rows from both datasets - The
onparameter specifies join keys, whileleft_on/right_onhandle cases where key columns have different names across DataFrames - Performance optimization requires proper indexing, especially for large datasets where setting the join column as an index before merging significantly reduces execution time
Understanding Inner Joins in Pandas
An inner join combines two DataFrames by matching rows based on common column values, retaining only the rows where matches exist in both datasets. This is the default join type in Pandas and the most commonly used merge operation.
import pandas as pd
# Create sample DataFrames
customers = pd.DataFrame({
'customer_id': [1, 2, 3, 4, 5],
'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
'city': ['NYC', 'LA', 'Chicago', 'Houston', 'Phoenix']
})
orders = pd.DataFrame({
'order_id': [101, 102, 103, 104],
'customer_id': [1, 2, 2, 6],
'amount': [250, 150, 300, 100]
})
# Perform inner join
result = pd.merge(customers, orders, on='customer_id', how='inner')
print(result)
Output:
customer_id name city order_id amount
0 1 Alice NYC 101 250
1 2 Bob LA 102 150
2 2 Bob LA 103 300
Notice that customers 3, 4, and 5 are excluded because they have no matching orders, and order 104 (customer_id 6) is excluded because that customer doesn’t exist.
Merging on Multiple Columns
Real-world scenarios often require joining on multiple columns to establish unique relationships between datasets.
sales = pd.DataFrame({
'region': ['North', 'North', 'South', 'South'],
'product': ['A', 'B', 'A', 'B'],
'revenue': [1000, 1500, 1200, 1800]
})
targets = pd.DataFrame({
'region': ['North', 'North', 'South', 'West'],
'product': ['A', 'B', 'A', 'A'],
'target': [900, 1400, 1100, 1000]
})
# Join on both region and product
combined = pd.merge(sales, targets, on=['region', 'product'], how='inner')
print(combined)
Output:
region product revenue target
0 North A 1000 900
1 North B 1500 1400
2 South A 1200 1100
The South-B combination from sales and West-A from targets are excluded because they don’t have matching pairs in both DataFrames.
Handling Different Column Names
When join keys have different names across DataFrames, use left_on and right_on parameters.
employees = pd.DataFrame({
'emp_id': [101, 102, 103, 104],
'emp_name': ['John', 'Sarah', 'Mike', 'Lisa'],
'dept_id': [1, 2, 1, 3]
})
departments = pd.DataFrame({
'department_id': [1, 2, 3, 4],
'dept_name': ['Engineering', 'Marketing', 'Sales', 'HR']
})
# Join using different column names
result = pd.merge(
employees,
departments,
left_on='dept_id',
right_on='department_id',
how='inner'
)
print(result)
Output:
emp_id emp_name dept_id department_id dept_name
0 101 John 1 1 Engineering
1 102 Sarah 2 2 Marketing
2 103 Mike 1 1 Engineering
3 104 Lisa 3 3 Sales
Note that both dept_id and department_id columns are retained. Remove redundant columns:
result = result.drop('department_id', axis=1)
# Or select specific columns during merge
result = pd.merge(
employees,
departments,
left_on='dept_id',
right_on='department_id'
)[['emp_id', 'emp_name', 'dept_id', 'dept_name']]
Using Index-Based Joins
The join() method provides a concise syntax for index-based merges, particularly useful when DataFrames are already indexed.
# Create DataFrames with indexes
products = pd.DataFrame({
'product_name': ['Laptop', 'Mouse', 'Keyboard', 'Monitor'],
'category': ['Electronics', 'Accessories', 'Accessories', 'Electronics']
}, index=[1, 2, 3, 4])
inventory = pd.DataFrame({
'stock': [50, 200, 150, 30],
'warehouse': ['A', 'B', 'A', 'C']
}, index=[1, 2, 3, 5])
# Join on index
result = products.join(inventory, how='inner')
print(result)
Output:
product_name category stock warehouse
1 Laptop Electronics 50 A
2 Mouse Accessories 200 B
3 Keyboard Accessories 150 A
For joining on a column while one DataFrame uses an index:
orders = pd.DataFrame({
'order_id': [1001, 1002, 1003],
'product_id': [1, 2, 1],
'quantity': [2, 1, 3]
})
# Join orders on product_id with products index
result = pd.merge(
orders,
products,
left_on='product_id',
right_index=True,
how='inner'
)
print(result)
Performance Optimization Strategies
For large datasets, join performance becomes critical. Setting appropriate indexes before merging dramatically improves execution time.
import numpy as np
import time
# Create large DataFrames
np.random.seed(42)
df1 = pd.DataFrame({
'key': np.random.randint(0, 100000, 500000),
'value1': np.random.randn(500000)
})
df2 = pd.DataFrame({
'key': np.random.randint(0, 100000, 500000),
'value2': np.random.randn(500000)
})
# Standard merge
start = time.time()
result1 = pd.merge(df1, df2, on='key', how='inner')
standard_time = time.time() - start
# Indexed merge
df1_indexed = df1.set_index('key')
df2_indexed = df2.set_index('key')
start = time.time()
result2 = df1_indexed.join(df2_indexed, how='inner')
indexed_time = time.time() - start
print(f"Standard merge: {standard_time:.4f}s")
print(f"Indexed merge: {indexed_time:.4f}s")
print(f"Speedup: {standard_time/indexed_time:.2f}x")
Additional optimization techniques:
# Use categorical data types for join keys when appropriate
df1['category_key'] = df1['key'].astype('category')
df2['category_key'] = df2['key'].astype('category')
# Sort DataFrames before merging
df1_sorted = df1.sort_values('key')
df2_sorted = df2.sort_values('key')
# Specify indicator to understand merge behavior
result = pd.merge(df1, df2, on='key', how='inner', indicator=True)
print(result['_merge'].value_counts())
Handling Duplicate Keys
When duplicate keys exist, inner joins create a Cartesian product of matching rows.
customers = pd.DataFrame({
'customer_id': [1, 2, 2],
'name': ['Alice', 'Bob', 'Bob Jr.']
})
orders = pd.DataFrame({
'customer_id': [1, 2, 2],
'order_id': [101, 102, 103]
})
result = pd.merge(customers, orders, on='customer_id', how='inner')
print(result)
Output:
customer_id name order_id
0 1 Alice 101
1 2 Bob 102
2 2 Bob 103
3 2 Bob Jr. 102
4 2 Bob Jr. 103
Customer ID 2 appears twice in both DataFrames, resulting in 4 combinations (2 × 2). Handle this by:
# Remove duplicates before merging
customers_unique = customers.drop_duplicates(subset='customer_id')
# Or aggregate before joining
customers_agg = customers.groupby('customer_id').first().reset_index()
result = pd.merge(customers_agg, orders, on='customer_id', how='inner')
Suffix Handling for Overlapping Columns
When both DataFrames contain columns with identical names (excluding join keys), Pandas adds suffixes to distinguish them.
sales_q1 = pd.DataFrame({
'product_id': [1, 2, 3],
'revenue': [1000, 1500, 1200],
'units': [10, 15, 12]
})
sales_q2 = pd.DataFrame({
'product_id': [1, 2, 4],
'revenue': [1100, 1600, 1300],
'units': [11, 16, 13]
})
# Default suffixes (_x, _y)
result = pd.merge(sales_q1, sales_q2, on='product_id', how='inner')
print(result.columns)
# Index(['product_id', 'revenue_x', 'units_x', 'revenue_y', 'units_y'])
# Custom suffixes
result = pd.merge(
sales_q1, sales_q2,
on='product_id',
how='inner',
suffixes=('_q1', '_q2')
)
print(result)
This produces clearly labeled columns for comparing quarterly data across the same products.