How to Inner Join in Pandas

Key Insights

The merge() function is the most versatile and explicit way to perform inner joins in Pandas, offering fine-grained control over join columns and handling of duplicates.
Use join() when your DataFrames are already indexed on the join keys—it’s faster and more concise for index-based operations.
Always specify suffixes when joining DataFrames with overlapping column names to avoid confusion and maintain clean, readable output.

Introduction

An inner join combines two DataFrames by keeping only the rows where the join key exists in both tables. If a key appears in one DataFrame but not the other, that row gets dropped. This makes inner joins the strictest type of join—and often the most useful when you need complete data from both sources.

In data analysis, you’ll reach for inner joins when combining related datasets where missing relationships indicate invalid or incomplete data. Think customer orders joined with customer profiles, or sales records joined with product catalogs. If an order references a customer that doesn’t exist, you probably want to exclude it.

Pandas provides two main methods for inner joins: merge() and join(). Both accomplish the same goal, but they differ in syntax, flexibility, and performance characteristics. This article covers both approaches with practical examples you can apply immediately.

Understanding Inner Joins Conceptually

Picture two tables side by side. The left table contains employee IDs and names. The right table contains employee IDs and departments. An inner join matches rows where the employee ID appears in both tables, producing a combined result with name and department for each matching employee.

Left DataFrame          Right DataFrame         Inner Join Result
+----+-------+          +----+------------+     +----+-------+------------+
| id | name  |          | id | department |     | id | name  | department |
+----+-------+          +----+------------+     +----+-------+------------+
| 1  | Alice |          | 1  | Engineering|     | 1  | Alice | Engineering|
| 2  | Bob   |          | 3  | Sales      |     | 3  | Carol | Sales      |
| 3  | Carol |          | 4  | Marketing  |     +----+-------+------------+
+----+-------+          +----+------------+

Notice that Bob (id=2) disappears because there’s no matching department record. Similarly, the Marketing department (id=4) disappears because there’s no matching employee. Only the intersection survives.

Let’s create sample DataFrames to use throughout this article:

import pandas as pd

# Employee information
employees = pd.DataFrame({
    'emp_id': [1, 2, 3, 4, 5],
    'name': ['Alice', 'Bob', 'Carol', 'David', 'Eve'],
    'hire_year': [2019, 2020, 2018, 2021, 2019]
})

# Department assignments (not all employees have assignments)
departments = pd.DataFrame({
    'emp_id': [1, 2, 4, 6, 7],
    'department': ['Engineering', 'Sales', 'Marketing', 'HR', 'Finance'],
    'budget': [500000, 300000, 250000, 150000, 400000]
})

print("Employees:")
print(employees)
print("\nDepartments:")
print(departments)

Output:

Employees:
   emp_id   name  hire_year
0       1  Alice       2019
1       2    Bob       2020
2       3  Carol       2018
3       4  David       2021
4       5    Eve       2019

Departments:
   emp_id   department  budget
0       1  Engineering  500000
1       2        Sales  300000
2       4    Marketing  250000
3       6           HR  150000
4       7      Finance  400000

Employees 3 and 5 have no department assignment. Departments with emp_id 6 and 7 reference non-existent employees. An inner join will exclude all of these.

Using merge() for Inner Joins

The merge() function is the workhorse for joining DataFrames in Pandas. It’s explicit, flexible, and handles most joining scenarios you’ll encounter.

# Basic inner join using merge()
result = pd.merge(employees, departments, on='emp_id', how='inner')
print(result)

Output:

   emp_id   name  hire_year   department  budget
0       1  Alice       2019  Engineering  500000
1       2    Bob       2020        Sales  300000
2       4  David       2021    Marketing  250000

Only three rows survive—the employees who have matching department records.

The how='inner' parameter is actually the default, so you can omit it. However, I recommend always specifying it explicitly. Your future self (and teammates) will thank you for the clarity.

When join columns have different names in each DataFrame, use left_on and right_on:

# Rename column in one DataFrame for demonstration
dept_renamed = departments.rename(columns={'emp_id': 'employee_id'})

result = pd.merge(
    employees, 
    dept_renamed, 
    left_on='emp_id', 
    right_on='employee_id', 
    how='inner'
)
print(result)

Output:

   emp_id   name  hire_year  employee_id   department  budget
0       1  Alice       2019            1  Engineering  500000
1       2    Bob       2020            2        Sales  300000
2       4  David       2021            4    Marketing  250000

Note that both join columns appear in the result. You’ll typically want to drop one of them afterward.

Using join() Method

The join() method provides a more concise syntax when your DataFrames use indexes as join keys. It’s a method called on a DataFrame rather than a standalone function.

# Set indexes for join()
emp_indexed = employees.set_index('emp_id')
dept_indexed = departments.set_index('emp_id')

result = emp_indexed.join(dept_indexed, how='inner')
print(result)

Output:

         name  hire_year   department  budget
emp_id                                       
1       Alice       2019  Engineering  500000
2         Bob       2020        Sales  300000
4       David       2021    Marketing  250000

The join() method defaults to joining on the index of the calling DataFrame and the index of the passed DataFrame. This makes it ideal when your data is already indexed appropriately—common when working with time series or when you’ve loaded data with an index column.

You can also join on a column of the left DataFrame against the index of the right DataFrame using the on parameter:

# Left DataFrame uses column, right DataFrame uses index
result = employees.join(dept_indexed, on='emp_id', how='inner')
print(result)

This flexibility makes join() useful in pipelines where one DataFrame is naturally indexed and another isn’t.

Joining on Multiple Columns

Real-world data often requires matching on multiple columns. Consider a scenario where you need to join on both employee ID and year:

# Sales data by employee and year
sales = pd.DataFrame({
    'emp_id': [1, 1, 2, 2, 3],
    'year': [2022, 2023, 2022, 2023, 2023],
    'revenue': [50000, 75000, 30000, 45000, 60000]
})

# Targets by employee and year
targets = pd.DataFrame({
    'emp_id': [1, 1, 2, 3, 3],
    'year': [2022, 2023, 2023, 2022, 2023],
    'target': [40000, 70000, 50000, 55000, 65000]
})

# Join on both columns
result = pd.merge(sales, targets, on=['emp_id', 'year'], how='inner')
print(result)

Output:

   emp_id  year  revenue  target
0       1  2022    50000   40000
1       1  2023    75000   70000
2       2  2023    45000   50000
3       3  2023    60000   65000

Only combinations of emp_id and year that exist in both DataFrames appear in the result.

When column names differ between DataFrames, pass lists to left_on and right_on:

result = pd.merge(
    sales, 
    targets.rename(columns={'emp_id': 'employee', 'year': 'fiscal_year'}),
    left_on=['emp_id', 'year'],
    right_on=['employee', 'fiscal_year'],
    how='inner'
)

Handling Common Issues

Duplicate Column Names

When both DataFrames contain columns with the same name (other than join keys), Pandas appends suffixes to distinguish them:

df1 = pd.DataFrame({'id': [1, 2], 'value': [100, 200]})
df2 = pd.DataFrame({'id': [1, 2], 'value': [10, 20]})

result = pd.merge(df1, df2, on='id', how='inner')
print(result)

Output:

   id  value_x  value_y
0   1      100       10
1   2      200       20

The default suffixes _x and _y are cryptic. Always specify meaningful suffixes:

result = pd.merge(
    df1, df2, 
    on='id', 
    how='inner',
    suffixes=('_original', '_updated')
)
print(result)

Debugging Join Issues

When joins produce unexpected results, use the indicator parameter to see which rows matched:

result = pd.merge(
    employees, 
    departments, 
    on='emp_id', 
    how='outer',  # Use outer to see everything
    indicator=True
)
print(result)

Output:

   emp_id   name  hire_year   department   budget      _merge
0       1  Alice     2019.0  Engineering  500000.0        both
1       2    Bob     2020.0        Sales  300000.0        both
2       3  Carol     2018.0          NaN      NaN   left_only
3       4  David     2021.0    Marketing  250000.0        both
4       5    Eve     2019.0          NaN      NaN   left_only
5       6    NaN        NaN           HR  150000.0  right_only
6       7    NaN        NaN      Finance  400000.0  right_only

The _merge column reveals exactly what happened to each row. Filter for both to get your inner join result, or investigate left_only and right_only to understand missing matches.

Performance Considerations

For large DataFrames, sorting join keys before merging can significantly improve performance:

# Sort both DataFrames on join key before merging
df1_sorted = df1.sort_values('id')
df2_sorted = df2.sort_values('id')
result = pd.merge(df1_sorted, df2_sorted, on='id', how='inner')

Summary

Inner joins in Pandas are straightforward once you understand the two main approaches. Here’s when to use each:

Feature	`merge()`	`join()`
Join on columns	Yes	Limited (left only)
Join on indexes	Yes	Yes (default)
Multiple join keys	Yes	Yes
Different key names	Yes (`left_on`/`right_on`)	No
Syntax	Explicit, verbose	Concise
Best for	General purpose	Index-based joins

Use merge() as your default choice—it handles every scenario and makes your intent clear. Switch to join() when working with indexed DataFrames where the concise syntax improves readability. Always specify how='inner' explicitly, use meaningful suffixes, and leverage indicator=True when debugging unexpected results.

Introduction

Understanding Inner Joins Conceptually

Using merge() for Inner Joins

Using join() Method

Joining on Multiple Columns

Handling Common Issues

Duplicate Column Names

Debugging Join Issues

Performance Considerations

Summary

Liked this? There's more.

Similar Articles