How to Inner Join in Polars

Key Insights

Polars uses df.join(other, on="key", how="inner") for inner joins, returning only rows where keys match in both DataFrames
Use left_on and right_on parameters when join columns have different names across DataFrames
Lazy evaluation with LazyFrame enables Polars to optimize join operations, often yielding 10x+ performance gains on large datasets

Introduction

Inner joins are the workhorse of data analysis. When you need to combine two datasets based on matching keys—customers with their orders, products with their categories, employees with their departments—an inner join is usually your first choice.

Polars has emerged as a serious alternative to pandas for DataFrame operations, particularly when performance matters. Written in Rust with a focus on parallel execution and memory efficiency, Polars can handle joins on datasets that would bring pandas to its knees. But the syntax differs enough from pandas that you’ll need to adjust your mental model.

This article covers everything you need to perform inner joins in Polars effectively, from basic single-column joins to multi-key operations with lazy evaluation.

Inner Join Basics

An inner join returns only the rows where the join key exists in both DataFrames. If a row in the left DataFrame has no matching key in the right DataFrame, it’s excluded from the result. Same goes for unmatched rows in the right DataFrame.

Think of it as the intersection of two datasets based on their keys:

Left DataFrame          Right DataFrame         Result (Inner Join)
┌─────┬───────┐        ┌─────┬────────┐        ┌─────┬───────┬────────┐
│ id  │ name  │        │ id  │ salary │        │ id  │ name  │ salary │
├─────┼───────┤        ├─────┼────────┤        ├─────┼───────┼────────┤
│ 1   │ Alice │        │ 1   │ 50000  │        │ 1   │ Alice │ 50000  │
│ 2   │ Bob   │   +    │ 3   │ 60000  │   =    │ 3   │ Carol │ 60000  │
│ 3   │ Carol │        │ 4   │ 70000  │        └─────┴───────┴────────┘
└─────┴───────┘        └─────┴────────┘        
                                               (id=2 and id=4 excluded)

Here’s the most basic inner join in Polars:

import polars as pl

# Create sample DataFrames
employees = pl.DataFrame({
    "emp_id": [1, 2, 3, 4],
    "name": ["Alice", "Bob", "Carol", "David"],
    "dept_id": [101, 102, 101, 103]
})

departments = pl.DataFrame({
    "dept_id": [101, 102, 104],
    "dept_name": ["Engineering", "Marketing", "Sales"]
})

# Inner join on dept_id
result = employees.join(departments, on="dept_id", how="inner")
print(result)

Output:

shape: (3, 4)
┌────────┬───────┬─────────┬─────────────┐
│ emp_id ┆ name  ┆ dept_id ┆ dept_name   │
│ ---    ┆ ---   ┆ ---     ┆ ---         │
│ i64    ┆ str   ┆ i64     ┆ str         │
╞════════╪═══════╪═════════╪═════════════╡
│ 1      ┆ Alice ┆ 101     ┆ Engineering │
│ 3      ┆ Carol ┆ 101     ┆ Engineering │
│ 2      ┆ Bob   ┆ 102     ┆ Marketing   │
└────────┴───────┴─────────┴─────────────┘

David (dept_id=103) is excluded because department 103 doesn’t exist in the departments DataFrame. Department 104 (Sales) is also excluded because no employee belongs to it.

Syntax and Parameters

The join() method in Polars provides several parameters for controlling join behavior:

DataFrame.join(
    other,           # The right DataFrame to join with
    on=None,         # Column name(s) to join on (when names match)
    left_on=None,    # Column name(s) in left DataFrame
    right_on=None,   # Column name(s) in right DataFrame
    how="inner",     # Join type: "inner", "left", "right", "outer", "cross"
    suffix="_right"  # Suffix for duplicate column names
)

When your join columns have different names across DataFrames, use left_on and right_on:

# DataFrames with different column names for the join key
orders = pl.DataFrame({
    "order_id": [1001, 1002, 1003, 1004],
    "customer_id": [1, 2, 1, 3],
    "amount": [150.00, 200.00, 75.00, 300.00]
})

customers = pl.DataFrame({
    "id": [1, 2, 4],  # Note: different column name, and customer 3 is missing
    "customer_name": ["Acme Corp", "Beta Inc", "Delta LLC"]
})

# Join using different column names
result = orders.join(
    customers,
    left_on="customer_id",
    right_on="id",
    how="inner"
)
print(result)

Output:

shape: (3, 4)
┌──────────┬─────────────┬────────┬───────────────┐
│ order_id ┆ customer_id ┆ amount ┆ customer_name │
│ ---      ┆ ---         ┆ ---    ┆ ---           │
│ i64      ┆ i64         ┆ f64    ┆ str           │
╞══════════╪═════════════╪════════╪═══════════════╡
│ 1001     ┆ 1           ┆ 150.0  ┆ Acme Corp     │
│ 1003     ┆ 1           ┆ 75.0   ┆ Acme Corp     │
│ 1002     ┆ 2           ┆ 200.0  ┆ Beta Inc      │
└──────────┴─────────────┴────────┴───────────────┘

Order 1004 (customer_id=3) is excluded because customer 3 doesn’t exist in the customers DataFrame.

Joining on Multiple Columns

Real-world data often requires joining on composite keys—multiple columns that together form a unique identifier. Pass a list to the on parameter:

# Sales data with composite key (region + product_id)
sales = pl.DataFrame({
    "region": ["East", "East", "West", "West", "North"],
    "product_id": [1, 2, 1, 2, 1],
    "quantity": [100, 150, 200, 75, 50]
})

# Pricing data with same composite key
pricing = pl.DataFrame({
    "region": ["East", "East", "West", "South"],
    "product_id": [1, 2, 1, 1],
    "unit_price": [10.00, 15.00, 12.00, 11.00]
})

# Join on composite key
result = sales.join(
    pricing,
    on=["region", "product_id"],
    how="inner"
)
print(result)

Output:

shape: (3, 4)
┌────────┬────────────┬──────────┬────────────┐
│ region ┆ product_id ┆ quantity ┆ unit_price │
│ ---    ┆ ---        ┆ ---      ┆ ---        │
│ str    ┆ i64        ┆ i64      ┆ f64        │
╞════════╪════════════╪══════════╪════════════╡
│ East   ┆ 1          ┆ 100      ┆ 10.0       │
│ East   ┆ 2          ┆ 150      ┆ 15.0       │
│ West   ┆ 1          ┆ 200      ┆ 12.0       │
└────────┴────────────┴──────────┴────────────┘

When column names differ, use lists for left_on and right_on:

result = df1.join(
    df2,
    left_on=["region_code", "prod_id"],
    right_on=["area", "product_id"],
    how="inner"
)

Lazy vs Eager Execution

Polars shines when you use lazy evaluation. Instead of executing operations immediately, a LazyFrame builds a query plan that Polars optimizes before execution. For joins, this means:

Predicate pushdown: Filters applied after the join can be pushed before it
Projection pushdown: Only required columns are loaded
Join order optimization: Polars can reorder operations for efficiency

Here’s how to perform a lazy inner join:

# Simulate reading from CSV files (lazy by default)
# In practice, you'd use: pl.scan_csv("employees.csv")

employees_lazy = pl.LazyFrame({
    "emp_id": range(1, 100001),
    "name": [f"Employee_{i}" for i in range(1, 100001)],
    "dept_id": [i % 100 for i in range(1, 100001)]
})

departments_lazy = pl.LazyFrame({
    "dept_id": range(0, 50),  # Only 50 departments exist
    "dept_name": [f"Department_{i}" for i in range(0, 50)]
})

# Build the query plan (no execution yet)
query = (
    employees_lazy
    .join(departments_lazy, on="dept_id", how="inner")
    .filter(pl.col("emp_id") < 1000)  # This filter gets optimized
    .select(["emp_id", "name", "dept_name"])
)

# Execute the optimized plan
result = query.collect()
print(result.head())

Output:

shape: (5, 3)
┌────────┬────────────┬──────────────┐
│ emp_id ┆ name       ┆ dept_name    │
│ ---    ┆ ---        ┆ ---          │
│ i64    ┆ str        ┆ str          │
╞════════╪════════════╪══════════════╡
│ 1      ┆ Employee_1 ┆ Department_1 │
│ 2      ┆ Employee_2 ┆ Department_2 │
│ 3      ┆ Employee_3 ┆ Department_3 │
│ 4      ┆ Employee_4 ┆ Department_4 │
│ 5      ┆ Employee_5 ┆ Department_5 │
└────────┴────────────┴──────────────┘

For file-based workflows, use scan_csv() or scan_parquet() to maintain laziness:

# Real-world lazy join from files
employees = pl.scan_csv("employees.csv")
departments = pl.scan_csv("departments.csv")

result = (
    employees
    .join(departments, on="dept_id", how="inner")
    .collect()
)

Common Pitfalls and Tips

Handling Duplicate Column Names

When both DataFrames have columns with the same name (other than join keys), Polars appends a suffix to avoid collisions:

df1 = pl.DataFrame({
    "id": [1, 2, 3],
    "value": [100, 200, 300],
    "status": ["active", "inactive", "active"]
})

df2 = pl.DataFrame({
    "id": [1, 2, 4],
    "value": [10, 20, 40],  # Same column name as df1
    "category": ["A", "B", "C"]
})

# Default suffix is "_right"
result = df1.join(df2, on="id", how="inner")
print(result)

Output:

shape: (2, 5)
┌─────┬───────┬──────────┬─────────────┬──────────┐
│ id  ┆ value ┆ status   ┆ value_right ┆ category │
│ --- ┆ ---   ┆ ---      ┆ ---         ┆ ---      │
│ i64 ┆ i64   ┆ str      ┆ i64         ┆ str      │
╞═════╪═══════╪══════════╪═════════════╪══════════╡
│ 1   ┆ 100   ┆ active   ┆ 10          ┆ A        │
│ 2   ┆ 200   ┆ inactive ┆ 20          ┆ B        │
└─────┴───────┴──────────┴─────────────┴──────────┘

Customize the suffix for clarity:

result = df1.join(df2, on="id", how="inner", suffix="_df2")

Data Type Mismatches

Polars is strict about data types. Join keys must have compatible types:

df1 = pl.DataFrame({"id": [1, 2, 3]})  # Int64
df2 = pl.DataFrame({"id": ["1", "2", "3"]})  # String

# This will fail
# result = df1.join(df2, on="id", how="inner")

# Fix: Cast to matching types
df2_fixed = df2.with_columns(pl.col("id").cast(pl.Int64))
result = df1.join(df2_fixed, on="id", how="inner")

Memory Considerations

Unlike pandas, Polars uses Apache Arrow memory format and doesn’t copy data unnecessarily. For large joins:

Use lazy evaluation to let Polars optimize memory usage
Filter data before joining when possible
Select only needed columns before the join
Consider join_asof() for time-series data with approximate matching

Conclusion

Inner joins in Polars follow a straightforward pattern: call join() with your key columns and how="inner". Use on when column names match, or left_on/right_on when they differ. For composite keys, pass lists of column names.

The real power comes from lazy evaluation. Wrap your joins in a LazyFrame pipeline, and Polars will optimize the entire query plan before execution. This matters increasingly as your data grows.

Choose inner joins when you only want records that exist in both datasets. If you need to preserve all rows from one side, look into left or right joins. For keeping everything regardless of matches, use outer joins.

Introduction

Inner Join Basics

Syntax and Parameters

Joining on Multiple Columns

Lazy vs Eager Execution

Common Pitfalls and Tips

Handling Duplicate Column Names

Data Type Mismatches

Memory Considerations

Conclusion

Liked this? There's more.

Similar Articles