How to Inner Join in Polars
Inner joins are the workhorse of data analysis. When you need to combine two datasets based on matching keys—customers with their orders, products with their categories, employees with their...
Key Insights
- Polars uses
df.join(other, on="key", how="inner")for inner joins, returning only rows where keys match in both DataFrames - Use
left_onandright_onparameters when join columns have different names across DataFrames - Lazy evaluation with
LazyFrameenables Polars to optimize join operations, often yielding 10x+ performance gains on large datasets
Introduction
Inner joins are the workhorse of data analysis. When you need to combine two datasets based on matching keys—customers with their orders, products with their categories, employees with their departments—an inner join is usually your first choice.
Polars has emerged as a serious alternative to pandas for DataFrame operations, particularly when performance matters. Written in Rust with a focus on parallel execution and memory efficiency, Polars can handle joins on datasets that would bring pandas to its knees. But the syntax differs enough from pandas that you’ll need to adjust your mental model.
This article covers everything you need to perform inner joins in Polars effectively, from basic single-column joins to multi-key operations with lazy evaluation.
Inner Join Basics
An inner join returns only the rows where the join key exists in both DataFrames. If a row in the left DataFrame has no matching key in the right DataFrame, it’s excluded from the result. Same goes for unmatched rows in the right DataFrame.
Think of it as the intersection of two datasets based on their keys:
Left DataFrame Right DataFrame Result (Inner Join)
┌─────┬───────┐ ┌─────┬────────┐ ┌─────┬───────┬────────┐
│ id │ name │ │ id │ salary │ │ id │ name │ salary │
├─────┼───────┤ ├─────┼────────┤ ├─────┼───────┼────────┤
│ 1 │ Alice │ │ 1 │ 50000 │ │ 1 │ Alice │ 50000 │
│ 2 │ Bob │ + │ 3 │ 60000 │ = │ 3 │ Carol │ 60000 │
│ 3 │ Carol │ │ 4 │ 70000 │ └─────┴───────┴────────┘
└─────┴───────┘ └─────┴────────┘
(id=2 and id=4 excluded)
Here’s the most basic inner join in Polars:
import polars as pl
# Create sample DataFrames
employees = pl.DataFrame({
"emp_id": [1, 2, 3, 4],
"name": ["Alice", "Bob", "Carol", "David"],
"dept_id": [101, 102, 101, 103]
})
departments = pl.DataFrame({
"dept_id": [101, 102, 104],
"dept_name": ["Engineering", "Marketing", "Sales"]
})
# Inner join on dept_id
result = employees.join(departments, on="dept_id", how="inner")
print(result)
Output:
shape: (3, 4)
┌────────┬───────┬─────────┬─────────────┐
│ emp_id ┆ name ┆ dept_id ┆ dept_name │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ i64 ┆ str │
╞════════╪═══════╪═════════╪═════════════╡
│ 1 ┆ Alice ┆ 101 ┆ Engineering │
│ 3 ┆ Carol ┆ 101 ┆ Engineering │
│ 2 ┆ Bob ┆ 102 ┆ Marketing │
└────────┴───────┴─────────┴─────────────┘
David (dept_id=103) is excluded because department 103 doesn’t exist in the departments DataFrame. Department 104 (Sales) is also excluded because no employee belongs to it.
Syntax and Parameters
The join() method in Polars provides several parameters for controlling join behavior:
DataFrame.join(
other, # The right DataFrame to join with
on=None, # Column name(s) to join on (when names match)
left_on=None, # Column name(s) in left DataFrame
right_on=None, # Column name(s) in right DataFrame
how="inner", # Join type: "inner", "left", "right", "outer", "cross"
suffix="_right" # Suffix for duplicate column names
)
When your join columns have different names across DataFrames, use left_on and right_on:
# DataFrames with different column names for the join key
orders = pl.DataFrame({
"order_id": [1001, 1002, 1003, 1004],
"customer_id": [1, 2, 1, 3],
"amount": [150.00, 200.00, 75.00, 300.00]
})
customers = pl.DataFrame({
"id": [1, 2, 4], # Note: different column name, and customer 3 is missing
"customer_name": ["Acme Corp", "Beta Inc", "Delta LLC"]
})
# Join using different column names
result = orders.join(
customers,
left_on="customer_id",
right_on="id",
how="inner"
)
print(result)
Output:
shape: (3, 4)
┌──────────┬─────────────┬────────┬───────────────┐
│ order_id ┆ customer_id ┆ amount ┆ customer_name │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ f64 ┆ str │
╞══════════╪═════════════╪════════╪═══════════════╡
│ 1001 ┆ 1 ┆ 150.0 ┆ Acme Corp │
│ 1003 ┆ 1 ┆ 75.0 ┆ Acme Corp │
│ 1002 ┆ 2 ┆ 200.0 ┆ Beta Inc │
└──────────┴─────────────┴────────┴───────────────┘
Order 1004 (customer_id=3) is excluded because customer 3 doesn’t exist in the customers DataFrame.
Joining on Multiple Columns
Real-world data often requires joining on composite keys—multiple columns that together form a unique identifier. Pass a list to the on parameter:
# Sales data with composite key (region + product_id)
sales = pl.DataFrame({
"region": ["East", "East", "West", "West", "North"],
"product_id": [1, 2, 1, 2, 1],
"quantity": [100, 150, 200, 75, 50]
})
# Pricing data with same composite key
pricing = pl.DataFrame({
"region": ["East", "East", "West", "South"],
"product_id": [1, 2, 1, 1],
"unit_price": [10.00, 15.00, 12.00, 11.00]
})
# Join on composite key
result = sales.join(
pricing,
on=["region", "product_id"],
how="inner"
)
print(result)
Output:
shape: (3, 4)
┌────────┬────────────┬──────────┬────────────┐
│ region ┆ product_id ┆ quantity ┆ unit_price │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ f64 │
╞════════╪════════════╪══════════╪════════════╡
│ East ┆ 1 ┆ 100 ┆ 10.0 │
│ East ┆ 2 ┆ 150 ┆ 15.0 │
│ West ┆ 1 ┆ 200 ┆ 12.0 │
└────────┴────────────┴──────────┴────────────┘
When column names differ, use lists for left_on and right_on:
result = df1.join(
df2,
left_on=["region_code", "prod_id"],
right_on=["area", "product_id"],
how="inner"
)
Lazy vs Eager Execution
Polars shines when you use lazy evaluation. Instead of executing operations immediately, a LazyFrame builds a query plan that Polars optimizes before execution. For joins, this means:
- Predicate pushdown: Filters applied after the join can be pushed before it
- Projection pushdown: Only required columns are loaded
- Join order optimization: Polars can reorder operations for efficiency
Here’s how to perform a lazy inner join:
# Simulate reading from CSV files (lazy by default)
# In practice, you'd use: pl.scan_csv("employees.csv")
employees_lazy = pl.LazyFrame({
"emp_id": range(1, 100001),
"name": [f"Employee_{i}" for i in range(1, 100001)],
"dept_id": [i % 100 for i in range(1, 100001)]
})
departments_lazy = pl.LazyFrame({
"dept_id": range(0, 50), # Only 50 departments exist
"dept_name": [f"Department_{i}" for i in range(0, 50)]
})
# Build the query plan (no execution yet)
query = (
employees_lazy
.join(departments_lazy, on="dept_id", how="inner")
.filter(pl.col("emp_id") < 1000) # This filter gets optimized
.select(["emp_id", "name", "dept_name"])
)
# Execute the optimized plan
result = query.collect()
print(result.head())
Output:
shape: (5, 3)
┌────────┬────────────┬──────────────┐
│ emp_id ┆ name ┆ dept_name │
│ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ str │
╞════════╪════════════╪══════════════╡
│ 1 ┆ Employee_1 ┆ Department_1 │
│ 2 ┆ Employee_2 ┆ Department_2 │
│ 3 ┆ Employee_3 ┆ Department_3 │
│ 4 ┆ Employee_4 ┆ Department_4 │
│ 5 ┆ Employee_5 ┆ Department_5 │
└────────┴────────────┴──────────────┘
For file-based workflows, use scan_csv() or scan_parquet() to maintain laziness:
# Real-world lazy join from files
employees = pl.scan_csv("employees.csv")
departments = pl.scan_csv("departments.csv")
result = (
employees
.join(departments, on="dept_id", how="inner")
.collect()
)
Common Pitfalls and Tips
Handling Duplicate Column Names
When both DataFrames have columns with the same name (other than join keys), Polars appends a suffix to avoid collisions:
df1 = pl.DataFrame({
"id": [1, 2, 3],
"value": [100, 200, 300],
"status": ["active", "inactive", "active"]
})
df2 = pl.DataFrame({
"id": [1, 2, 4],
"value": [10, 20, 40], # Same column name as df1
"category": ["A", "B", "C"]
})
# Default suffix is "_right"
result = df1.join(df2, on="id", how="inner")
print(result)
Output:
shape: (2, 5)
┌─────┬───────┬──────────┬─────────────┬──────────┐
│ id ┆ value ┆ status ┆ value_right ┆ category │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ str ┆ i64 ┆ str │
╞═════╪═══════╪══════════╪═════════════╪══════════╡
│ 1 ┆ 100 ┆ active ┆ 10 ┆ A │
│ 2 ┆ 200 ┆ inactive ┆ 20 ┆ B │
└─────┴───────┴──────────┴─────────────┴──────────┘
Customize the suffix for clarity:
result = df1.join(df2, on="id", how="inner", suffix="_df2")
Data Type Mismatches
Polars is strict about data types. Join keys must have compatible types:
df1 = pl.DataFrame({"id": [1, 2, 3]}) # Int64
df2 = pl.DataFrame({"id": ["1", "2", "3"]}) # String
# This will fail
# result = df1.join(df2, on="id", how="inner")
# Fix: Cast to matching types
df2_fixed = df2.with_columns(pl.col("id").cast(pl.Int64))
result = df1.join(df2_fixed, on="id", how="inner")
Memory Considerations
Unlike pandas, Polars uses Apache Arrow memory format and doesn’t copy data unnecessarily. For large joins:
- Use lazy evaluation to let Polars optimize memory usage
- Filter data before joining when possible
- Select only needed columns before the join
- Consider
join_asof()for time-series data with approximate matching
Conclusion
Inner joins in Polars follow a straightforward pattern: call join() with your key columns and how="inner". Use on when column names match, or left_on/right_on when they differ. For composite keys, pass lists of column names.
The real power comes from lazy evaluation. Wrap your joins in a LazyFrame pipeline, and Polars will optimize the entire query plan before execution. This matters increasingly as your data grows.
Choose inner joins when you only want records that exist in both datasets. If you need to preserve all rows from one side, look into left or right joins. For keeping everything regardless of matches, use outer joins.