How to Inner Join in Pandas
An inner join combines two DataFrames by keeping only the rows where the join key exists in both tables. If a key appears in one DataFrame but not the other, that row gets dropped. This makes inner...
Key Insights
- The
merge()function is the most versatile and explicit way to perform inner joins in Pandas, offering fine-grained control over join columns and handling of duplicates. - Use
join()when your DataFrames are already indexed on the join keys—it’s faster and more concise for index-based operations. - Always specify
suffixeswhen joining DataFrames with overlapping column names to avoid confusion and maintain clean, readable output.
Introduction
An inner join combines two DataFrames by keeping only the rows where the join key exists in both tables. If a key appears in one DataFrame but not the other, that row gets dropped. This makes inner joins the strictest type of join—and often the most useful when you need complete data from both sources.
In data analysis, you’ll reach for inner joins when combining related datasets where missing relationships indicate invalid or incomplete data. Think customer orders joined with customer profiles, or sales records joined with product catalogs. If an order references a customer that doesn’t exist, you probably want to exclude it.
Pandas provides two main methods for inner joins: merge() and join(). Both accomplish the same goal, but they differ in syntax, flexibility, and performance characteristics. This article covers both approaches with practical examples you can apply immediately.
Understanding Inner Joins Conceptually
Picture two tables side by side. The left table contains employee IDs and names. The right table contains employee IDs and departments. An inner join matches rows where the employee ID appears in both tables, producing a combined result with name and department for each matching employee.
Left DataFrame Right DataFrame Inner Join Result
+----+-------+ +----+------------+ +----+-------+------------+
| id | name | | id | department | | id | name | department |
+----+-------+ +----+------------+ +----+-------+------------+
| 1 | Alice | | 1 | Engineering| | 1 | Alice | Engineering|
| 2 | Bob | | 3 | Sales | | 3 | Carol | Sales |
| 3 | Carol | | 4 | Marketing | +----+-------+------------+
+----+-------+ +----+------------+
Notice that Bob (id=2) disappears because there’s no matching department record. Similarly, the Marketing department (id=4) disappears because there’s no matching employee. Only the intersection survives.
Let’s create sample DataFrames to use throughout this article:
import pandas as pd
# Employee information
employees = pd.DataFrame({
'emp_id': [1, 2, 3, 4, 5],
'name': ['Alice', 'Bob', 'Carol', 'David', 'Eve'],
'hire_year': [2019, 2020, 2018, 2021, 2019]
})
# Department assignments (not all employees have assignments)
departments = pd.DataFrame({
'emp_id': [1, 2, 4, 6, 7],
'department': ['Engineering', 'Sales', 'Marketing', 'HR', 'Finance'],
'budget': [500000, 300000, 250000, 150000, 400000]
})
print("Employees:")
print(employees)
print("\nDepartments:")
print(departments)
Output:
Employees:
emp_id name hire_year
0 1 Alice 2019
1 2 Bob 2020
2 3 Carol 2018
3 4 David 2021
4 5 Eve 2019
Departments:
emp_id department budget
0 1 Engineering 500000
1 2 Sales 300000
2 4 Marketing 250000
3 6 HR 150000
4 7 Finance 400000
Employees 3 and 5 have no department assignment. Departments with emp_id 6 and 7 reference non-existent employees. An inner join will exclude all of these.
Using merge() for Inner Joins
The merge() function is the workhorse for joining DataFrames in Pandas. It’s explicit, flexible, and handles most joining scenarios you’ll encounter.
# Basic inner join using merge()
result = pd.merge(employees, departments, on='emp_id', how='inner')
print(result)
Output:
emp_id name hire_year department budget
0 1 Alice 2019 Engineering 500000
1 2 Bob 2020 Sales 300000
2 4 David 2021 Marketing 250000
Only three rows survive—the employees who have matching department records.
The how='inner' parameter is actually the default, so you can omit it. However, I recommend always specifying it explicitly. Your future self (and teammates) will thank you for the clarity.
When join columns have different names in each DataFrame, use left_on and right_on:
# Rename column in one DataFrame for demonstration
dept_renamed = departments.rename(columns={'emp_id': 'employee_id'})
result = pd.merge(
employees,
dept_renamed,
left_on='emp_id',
right_on='employee_id',
how='inner'
)
print(result)
Output:
emp_id name hire_year employee_id department budget
0 1 Alice 2019 1 Engineering 500000
1 2 Bob 2020 2 Sales 300000
2 4 David 2021 4 Marketing 250000
Note that both join columns appear in the result. You’ll typically want to drop one of them afterward.
Using join() Method
The join() method provides a more concise syntax when your DataFrames use indexes as join keys. It’s a method called on a DataFrame rather than a standalone function.
# Set indexes for join()
emp_indexed = employees.set_index('emp_id')
dept_indexed = departments.set_index('emp_id')
result = emp_indexed.join(dept_indexed, how='inner')
print(result)
Output:
name hire_year department budget
emp_id
1 Alice 2019 Engineering 500000
2 Bob 2020 Sales 300000
4 David 2021 Marketing 250000
The join() method defaults to joining on the index of the calling DataFrame and the index of the passed DataFrame. This makes it ideal when your data is already indexed appropriately—common when working with time series or when you’ve loaded data with an index column.
You can also join on a column of the left DataFrame against the index of the right DataFrame using the on parameter:
# Left DataFrame uses column, right DataFrame uses index
result = employees.join(dept_indexed, on='emp_id', how='inner')
print(result)
This flexibility makes join() useful in pipelines where one DataFrame is naturally indexed and another isn’t.
Joining on Multiple Columns
Real-world data often requires matching on multiple columns. Consider a scenario where you need to join on both employee ID and year:
# Sales data by employee and year
sales = pd.DataFrame({
'emp_id': [1, 1, 2, 2, 3],
'year': [2022, 2023, 2022, 2023, 2023],
'revenue': [50000, 75000, 30000, 45000, 60000]
})
# Targets by employee and year
targets = pd.DataFrame({
'emp_id': [1, 1, 2, 3, 3],
'year': [2022, 2023, 2023, 2022, 2023],
'target': [40000, 70000, 50000, 55000, 65000]
})
# Join on both columns
result = pd.merge(sales, targets, on=['emp_id', 'year'], how='inner')
print(result)
Output:
emp_id year revenue target
0 1 2022 50000 40000
1 1 2023 75000 70000
2 2 2023 45000 50000
3 3 2023 60000 65000
Only combinations of emp_id and year that exist in both DataFrames appear in the result.
When column names differ between DataFrames, pass lists to left_on and right_on:
result = pd.merge(
sales,
targets.rename(columns={'emp_id': 'employee', 'year': 'fiscal_year'}),
left_on=['emp_id', 'year'],
right_on=['employee', 'fiscal_year'],
how='inner'
)
Handling Common Issues
Duplicate Column Names
When both DataFrames contain columns with the same name (other than join keys), Pandas appends suffixes to distinguish them:
df1 = pd.DataFrame({'id': [1, 2], 'value': [100, 200]})
df2 = pd.DataFrame({'id': [1, 2], 'value': [10, 20]})
result = pd.merge(df1, df2, on='id', how='inner')
print(result)
Output:
id value_x value_y
0 1 100 10
1 2 200 20
The default suffixes _x and _y are cryptic. Always specify meaningful suffixes:
result = pd.merge(
df1, df2,
on='id',
how='inner',
suffixes=('_original', '_updated')
)
print(result)
Debugging Join Issues
When joins produce unexpected results, use the indicator parameter to see which rows matched:
result = pd.merge(
employees,
departments,
on='emp_id',
how='outer', # Use outer to see everything
indicator=True
)
print(result)
Output:
emp_id name hire_year department budget _merge
0 1 Alice 2019.0 Engineering 500000.0 both
1 2 Bob 2020.0 Sales 300000.0 both
2 3 Carol 2018.0 NaN NaN left_only
3 4 David 2021.0 Marketing 250000.0 both
4 5 Eve 2019.0 NaN NaN left_only
5 6 NaN NaN HR 150000.0 right_only
6 7 NaN NaN Finance 400000.0 right_only
The _merge column reveals exactly what happened to each row. Filter for both to get your inner join result, or investigate left_only and right_only to understand missing matches.
Performance Considerations
For large DataFrames, sorting join keys before merging can significantly improve performance:
# Sort both DataFrames on join key before merging
df1_sorted = df1.sort_values('id')
df2_sorted = df2.sort_values('id')
result = pd.merge(df1_sorted, df2_sorted, on='id', how='inner')
Summary
Inner joins in Pandas are straightforward once you understand the two main approaches. Here’s when to use each:
| Feature | merge() |
join() |
|---|---|---|
| Join on columns | Yes | Limited (left only) |
| Join on indexes | Yes | Yes (default) |
| Multiple join keys | Yes | Yes |
| Different key names | Yes (left_on/right_on) |
No |
| Syntax | Explicit, verbose | Concise |
| Best for | General purpose | Index-based joins |
Use merge() as your default choice—it handles every scenario and makes your intent clear. Switch to join() when working with indexed DataFrames where the concise syntax improves readability. Always specify how='inner' explicitly, use meaningful suffixes, and leverage indicator=True when debugging unexpected results.