Pandas - Inner Join DataFrames

An inner join combines two DataFrames by matching rows based on common column values, retaining only the rows where matches exist in both datasets. This is the default join type in Pandas and the...

Key Insights

  • Inner joins in Pandas use merge() or join() methods to combine DataFrames based on common columns, keeping only matching rows from both datasets
  • The on parameter specifies join keys, while left_on/right_on handle cases where key columns have different names across DataFrames
  • Performance optimization requires proper indexing, especially for large datasets where setting the join column as an index before merging significantly reduces execution time

Understanding Inner Joins in Pandas

An inner join combines two DataFrames by matching rows based on common column values, retaining only the rows where matches exist in both datasets. This is the default join type in Pandas and the most commonly used merge operation.

import pandas as pd

# Create sample DataFrames
customers = pd.DataFrame({
    'customer_id': [1, 2, 3, 4, 5],
    'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
    'city': ['NYC', 'LA', 'Chicago', 'Houston', 'Phoenix']
})

orders = pd.DataFrame({
    'order_id': [101, 102, 103, 104],
    'customer_id': [1, 2, 2, 6],
    'amount': [250, 150, 300, 100]
})

# Perform inner join
result = pd.merge(customers, orders, on='customer_id', how='inner')
print(result)

Output:

   customer_id     name     city  order_id  amount
0            1    Alice      NYC       101     250
1            2      Bob       LA       102     150
2            2      Bob       LA       103     300

Notice that customers 3, 4, and 5 are excluded because they have no matching orders, and order 104 (customer_id 6) is excluded because that customer doesn’t exist.

Merging on Multiple Columns

Real-world scenarios often require joining on multiple columns to establish unique relationships between datasets.

sales = pd.DataFrame({
    'region': ['North', 'North', 'South', 'South'],
    'product': ['A', 'B', 'A', 'B'],
    'revenue': [1000, 1500, 1200, 1800]
})

targets = pd.DataFrame({
    'region': ['North', 'North', 'South', 'West'],
    'product': ['A', 'B', 'A', 'A'],
    'target': [900, 1400, 1100, 1000]
})

# Join on both region and product
combined = pd.merge(sales, targets, on=['region', 'product'], how='inner')
print(combined)

Output:

  region product  revenue  target
0  North       A     1000     900
1  North       B     1500    1400
2  South       A     1200    1100

The South-B combination from sales and West-A from targets are excluded because they don’t have matching pairs in both DataFrames.

Handling Different Column Names

When join keys have different names across DataFrames, use left_on and right_on parameters.

employees = pd.DataFrame({
    'emp_id': [101, 102, 103, 104],
    'emp_name': ['John', 'Sarah', 'Mike', 'Lisa'],
    'dept_id': [1, 2, 1, 3]
})

departments = pd.DataFrame({
    'department_id': [1, 2, 3, 4],
    'dept_name': ['Engineering', 'Marketing', 'Sales', 'HR']
})

# Join using different column names
result = pd.merge(
    employees, 
    departments, 
    left_on='dept_id', 
    right_on='department_id', 
    how='inner'
)
print(result)

Output:

   emp_id emp_name  dept_id  department_id       dept_name
0     101     John        1              1     Engineering
1     102    Sarah        2              2       Marketing
2     103     Mike        1              1     Engineering
3     104     Lisa        3              3           Sales

Note that both dept_id and department_id columns are retained. Remove redundant columns:

result = result.drop('department_id', axis=1)
# Or select specific columns during merge
result = pd.merge(
    employees, 
    departments, 
    left_on='dept_id', 
    right_on='department_id'
)[['emp_id', 'emp_name', 'dept_id', 'dept_name']]

Using Index-Based Joins

The join() method provides a concise syntax for index-based merges, particularly useful when DataFrames are already indexed.

# Create DataFrames with indexes
products = pd.DataFrame({
    'product_name': ['Laptop', 'Mouse', 'Keyboard', 'Monitor'],
    'category': ['Electronics', 'Accessories', 'Accessories', 'Electronics']
}, index=[1, 2, 3, 4])

inventory = pd.DataFrame({
    'stock': [50, 200, 150, 30],
    'warehouse': ['A', 'B', 'A', 'C']
}, index=[1, 2, 3, 5])

# Join on index
result = products.join(inventory, how='inner')
print(result)

Output:

  product_name     category  stock warehouse
1       Laptop  Electronics     50         A
2        Mouse  Accessories    200         B
3     Keyboard  Accessories    150         A

For joining on a column while one DataFrame uses an index:

orders = pd.DataFrame({
    'order_id': [1001, 1002, 1003],
    'product_id': [1, 2, 1],
    'quantity': [2, 1, 3]
})

# Join orders on product_id with products index
result = pd.merge(
    orders,
    products,
    left_on='product_id',
    right_index=True,
    how='inner'
)
print(result)

Performance Optimization Strategies

For large datasets, join performance becomes critical. Setting appropriate indexes before merging dramatically improves execution time.

import numpy as np
import time

# Create large DataFrames
np.random.seed(42)
df1 = pd.DataFrame({
    'key': np.random.randint(0, 100000, 500000),
    'value1': np.random.randn(500000)
})

df2 = pd.DataFrame({
    'key': np.random.randint(0, 100000, 500000),
    'value2': np.random.randn(500000)
})

# Standard merge
start = time.time()
result1 = pd.merge(df1, df2, on='key', how='inner')
standard_time = time.time() - start

# Indexed merge
df1_indexed = df1.set_index('key')
df2_indexed = df2.set_index('key')

start = time.time()
result2 = df1_indexed.join(df2_indexed, how='inner')
indexed_time = time.time() - start

print(f"Standard merge: {standard_time:.4f}s")
print(f"Indexed merge: {indexed_time:.4f}s")
print(f"Speedup: {standard_time/indexed_time:.2f}x")

Additional optimization techniques:

# Use categorical data types for join keys when appropriate
df1['category_key'] = df1['key'].astype('category')
df2['category_key'] = df2['key'].astype('category')

# Sort DataFrames before merging
df1_sorted = df1.sort_values('key')
df2_sorted = df2.sort_values('key')

# Specify indicator to understand merge behavior
result = pd.merge(df1, df2, on='key', how='inner', indicator=True)
print(result['_merge'].value_counts())

Handling Duplicate Keys

When duplicate keys exist, inner joins create a Cartesian product of matching rows.

customers = pd.DataFrame({
    'customer_id': [1, 2, 2],
    'name': ['Alice', 'Bob', 'Bob Jr.']
})

orders = pd.DataFrame({
    'customer_id': [1, 2, 2],
    'order_id': [101, 102, 103]
})

result = pd.merge(customers, orders, on='customer_id', how='inner')
print(result)

Output:

   customer_id     name  order_id
0            1    Alice       101
1            2      Bob       102
2            2      Bob       103
3            2  Bob Jr.       102
4            2  Bob Jr.       103

Customer ID 2 appears twice in both DataFrames, resulting in 4 combinations (2 × 2). Handle this by:

# Remove duplicates before merging
customers_unique = customers.drop_duplicates(subset='customer_id')

# Or aggregate before joining
customers_agg = customers.groupby('customer_id').first().reset_index()
result = pd.merge(customers_agg, orders, on='customer_id', how='inner')

Suffix Handling for Overlapping Columns

When both DataFrames contain columns with identical names (excluding join keys), Pandas adds suffixes to distinguish them.

sales_q1 = pd.DataFrame({
    'product_id': [1, 2, 3],
    'revenue': [1000, 1500, 1200],
    'units': [10, 15, 12]
})

sales_q2 = pd.DataFrame({
    'product_id': [1, 2, 4],
    'revenue': [1100, 1600, 1300],
    'units': [11, 16, 13]
})

# Default suffixes (_x, _y)
result = pd.merge(sales_q1, sales_q2, on='product_id', how='inner')
print(result.columns)
# Index(['product_id', 'revenue_x', 'units_x', 'revenue_y', 'units_y'])

# Custom suffixes
result = pd.merge(
    sales_q1, sales_q2, 
    on='product_id', 
    how='inner',
    suffixes=('_q1', '_q2')
)
print(result)

This produces clearly labeled columns for comparing quarterly data across the same products.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.