How to Join DataFrames in Polars
Polars has earned its reputation as the fastest DataFrame library in the Python ecosystem. Written in Rust and designed from the ground up for parallel execution, it consistently outperforms pandas...
Key Insights
- Polars offers seven join types including semi and anti joins for filtering operations that many developers overlook but are incredibly useful for data validation and subsetting.
- Always prefer lazy evaluation with
scan_*methods for large datasets—Polars’ query optimizer can push down predicates and eliminate unnecessary columns before the join executes. - Use
join_asof()for time-series data where you need inexact matches; it’s significantly faster than workarounds involving cross joins and filters.
Introduction
Polars has earned its reputation as the fastest DataFrame library in the Python ecosystem. Written in Rust and designed from the ground up for parallel execution, it consistently outperforms pandas by 10-100x on common operations. But raw speed means nothing if you can’t express your data transformations correctly.
Joins are where data analysis gets real. You’re rarely working with a single, perfectly structured table. Instead, you’re combining customer data with orders, enriching events with dimension tables, or filtering records based on existence in another dataset. Getting joins wrong means incorrect results, and inefficient joins mean waiting hours instead of seconds.
This article covers everything you need to join DataFrames in Polars effectively. We’ll work through each join type with practical examples, then discuss performance considerations that matter at scale.
Join Types Overview
Polars supports seven join strategies. Understanding when to use each saves you from writing convoluted workarounds.
| Join Type | Returns | Use Case |
|---|---|---|
inner |
Rows matching in both DataFrames | Combine related records |
left |
All left rows, matching right rows | Preserve primary dataset |
right |
All right rows, matching left rows | Preserve secondary dataset |
full |
All rows from both DataFrames | Complete picture with gaps |
cross |
Cartesian product | Generate combinations |
semi |
Left rows that have a match | Filter by existence |
anti |
Left rows that have no match | Find orphaned records |
Let’s create two DataFrames we’ll use throughout this article:
import polars as pl
# Customer data
customers = pl.DataFrame({
"customer_id": [1, 2, 3, 4, 5],
"name": ["Alice", "Bob", "Charlie", "Diana", "Eve"],
"region": ["West", "East", "West", "East", "North"],
})
# Order data - note customer_id 6 doesn't exist in customers
orders = pl.DataFrame({
"order_id": [101, 102, 103, 104, 105, 106],
"customer_id": [1, 2, 2, 3, 6, 1],
"amount": [250.00, 150.00, 300.00, 450.00, 200.00, 175.00],
"order_date": ["2024-01-15", "2024-01-16", "2024-01-17",
"2024-01-18", "2024-01-19", "2024-01-20"],
})
print("Customers:")
print(customers)
print("\nOrders:")
print(orders)
Notice that customer 6 placed an order but doesn’t exist in our customer table, and customers 4 and 5 haven’t placed any orders. These edge cases will demonstrate how different join types behave.
Basic Join Syntax
The join() method is your primary tool. At minimum, you specify the right DataFrame and how to match rows.
Inner Join
An inner join returns only rows where the join key exists in both DataFrames:
# Inner join - only customers who have orders
result = customers.join(orders, on="customer_id", how="inner")
print(result)
shape: (5, 6)
┌─────────────┬─────────┬────────┬──────────┬────────┬────────────┐
│ customer_id ┆ name ┆ region ┆ order_id ┆ amount ┆ order_date │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ str ┆ i64 ┆ f64 ┆ str │
╞═════════════╪═════════╪════════╪══════════╪════════╪════════════╡
│ 1 ┆ Alice ┆ West ┆ 101 ┆ 250.0 ┆ 2024-01-15 │
│ 1 ┆ Alice ┆ West ┆ 106 ┆ 175.0 ┆ 2024-01-20 │
│ 2 ┆ Bob ┆ East ┆ 102 ┆ 150.0 ┆ 2024-01-16 │
│ 2 ┆ Bob ┆ East ┆ 103 ┆ 300.0 ┆ 2024-01-17 │
│ 3 ┆ Charlie ┆ West ┆ 104 ┆ 450.0 ┆ 2024-01-18 │
└─────────────┴─────────┴────────┴──────────┴────────┴────────────┘
Customer 6’s order is excluded (no matching customer), and customers 4 and 5 are excluded (no orders).
Left Join with Different Column Names
When your join columns have different names, use left_on and right_on:
# Rename for demonstration
orders_renamed = orders.rename({"customer_id": "cust_id"})
# Left join preserves all customers
result = customers.join(
orders_renamed,
left_on="customer_id",
right_on="cust_id",
how="left"
)
print(result)
shape: (7, 6)
┌─────────────┬─────────┬────────┬──────────┬────────┬────────────┐
│ customer_id ┆ name ┆ region ┆ order_id ┆ amount ┆ order_date │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ str ┆ i64 ┆ f64 ┆ str │
╞═════════════╪═════════╪════════╪══════════╪════════╪════════════╡
│ 1 ┆ Alice ┆ West ┆ 101 ┆ 250.0 ┆ 2024-01-15 │
│ 1 ┆ Alice ┆ West ┆ 106 ┆ 175.0 ┆ 2024-01-20 │
│ 2 ┆ Bob ┆ East ┆ 102 ┆ 150.0 ┆ 2024-01-16 │
│ 2 ┆ Bob ┆ East ┆ 103 ┆ 300.0 ┆ 2024-01-17 │
│ 3 ┆ Charlie ┆ West ┆ 104 ┆ 450.0 ┆ 2024-01-18 │
│ 4 ┆ Diana ┆ East ┆ null ┆ null ┆ null │
│ 5 ┆ Eve ┆ North ┆ null ┆ null ┆ null │
└─────────────┴─────────┴────────┴──────────┴────────┴────────────┘
Diana and Eve appear with null values for order columns—exactly what you want when building reports that need all customers regardless of order history.
Joining on Multiple Columns
Real-world data often requires composite keys. Pass a list of column names to join on multiple columns simultaneously:
# Sales data with composite key
sales_2023 = pl.DataFrame({
"year": [2023, 2023, 2023],
"month": [1, 2, 3],
"region": ["West", "East", "West"],
"revenue": [10000, 15000, 12000],
})
sales_2024 = pl.DataFrame({
"year": [2024, 2024, 2024],
"month": [1, 2, 3],
"region": ["West", "East", "West"],
"revenue": [11000, 16000, 13000],
})
# Targets by year and month
targets = pl.DataFrame({
"year": [2023, 2023, 2024, 2024],
"month": [1, 2, 1, 2],
"target": [9500, 14000, 10500, 15500],
})
# Combine all sales
all_sales = pl.concat([sales_2023, sales_2024])
# Join on composite key
result = all_sales.join(
targets,
on=["year", "month"],
how="left"
)
print(result)
shape: (6, 5)
┌──────┬───────┬────────┬─────────┬────────┐
│ year ┆ month ┆ region ┆ revenue ┆ target │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ str ┆ i64 ┆ i64 │
╞══════╪═══════╪════════╪═════════╪════════╡
│ 2023 ┆ 1 ┆ West ┆ 10000 ┆ 9500 │
│ 2023 ┆ 2 ┆ East ┆ 15000 ┆ 14000 │
│ 2023 ┆ 3 ┆ West ┆ 12000 ┆ null │
│ 2024 ┆ 1 ┆ West ┆ 11000 ┆ 10500 │
│ 2024 ┆ 2 ┆ East ┆ 16000 ┆ 15500 │
│ 2024 ┆ 3 ┆ West ┆ 13000 ┆ null │
└──────┴───────┴────────┴─────────┴────────┘
March has no target in either year, so those rows get null values. Both columns must match for the join to succeed.
Semi and Anti Joins
These are filtering joins—they don’t add columns from the right DataFrame. They simply filter the left DataFrame based on whether matches exist.
Semi Join: Keep Rows That Match
A semi join keeps rows from the left DataFrame where the key exists in the right DataFrame:
# Find customers who have placed at least one order
active_customers = customers.join(
orders,
on="customer_id",
how="semi"
)
print("Active customers:")
print(active_customers)
shape: (3, 3)
┌─────────────┬─────────┬────────┐
│ customer_id ┆ name ┆ region │
│ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ str │
╞═════════════╪═════════╪════════╡
│ 1 ┆ Alice ┆ West │
│ 2 ┆ Bob ┆ East │
│ 3 ┆ Charlie ┆ West │
└─────────────┴─────────┴────────┘
Notice we get each customer once, even though Alice and Bob have multiple orders. Semi joins deduplicate automatically—you’re asking “does a match exist?” not “give me all matches.”
Anti Join: Keep Rows That Don’t Match
An anti join is the inverse—keep rows from the left where no match exists in the right:
# Find customers who have never ordered
inactive_customers = customers.join(
orders,
on="customer_id",
how="anti"
)
print("Inactive customers:")
print(inactive_customers)
# Find orders with invalid customer IDs
orphaned_orders = orders.join(
customers,
on="customer_id",
how="anti"
)
print("\nOrphaned orders:")
print(orphaned_orders)
Inactive customers:
shape: (2, 3)
┌─────────────┬───────┬────────┐
│ customer_id ┆ name ┆ region │
│ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ str │
╞═════════════╪═══════╪════════╡
│ 4 ┆ Diana ┆ East │
│ 5 ┆ Eve ┆ North │
└─────────────┴───────┴────────┘
Orphaned orders:
shape: (1, 4)
┌──────────┬─────────────┬────────┬────────────┐
│ order_id ┆ customer_id ┆ amount ┆ order_date │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ f64 ┆ str │
╞══════════╪═════════════╪════════╪════════════╡
│ 105 ┆ 6 ┆ 200.0 ┆ 2024-01-19 │
└──────────┴─────────────┴────────┴────────────┘
Anti joins are invaluable for data quality checks. Finding orphaned foreign keys, identifying records missing required relationships, or filtering out already-processed items—these operations become trivial.
Join Performance Tips
Handle Duplicate Column Names
When both DataFrames have columns with the same name (besides the join key), Polars adds a suffix:
# Both have a 'status' column
df1 = pl.DataFrame({"id": [1, 2], "status": ["active", "pending"]})
df2 = pl.DataFrame({"id": [1, 2], "status": ["shipped", "processing"]})
result = df1.join(df2, on="id", how="inner")
print(result)
# Columns: id, status, status_right
# Custom suffix
result = df1.join(df2, on="id", how="inner", suffix="_order")
print(result)
# Columns: id, status, status_order
Use Lazy Evaluation for Large Data
For datasets that don’t fit in memory or when you want query optimization, use lazy mode:
# Lazy evaluation with query optimization
lazy_customers = pl.scan_csv("customers.csv")
lazy_orders = pl.scan_csv("orders.csv")
result = (
lazy_customers
.join(lazy_orders, on="customer_id", how="inner")
.filter(pl.col("amount") > 200)
.select(["name", "order_id", "amount"])
.collect() # Execute the query
)
Polars’ query optimizer will push the filter and select operations as close to the data source as possible, potentially reading less data from disk.
Time-Series Joins with join_asof
For time-series data where you need the closest match rather than an exact match, use join_asof():
# Stock prices (sparse)
prices = pl.DataFrame({
"timestamp": ["2024-01-01 09:30", "2024-01-01 10:00", "2024-01-01 10:30"],
"price": [150.00, 151.50, 149.75],
}).with_columns(pl.col("timestamp").str.to_datetime())
# Trades (can happen any time)
trades = pl.DataFrame({
"timestamp": ["2024-01-01 09:45", "2024-01-01 10:15"],
"quantity": [100, 50],
}).with_columns(pl.col("timestamp").str.to_datetime())
# Get price at or before each trade
result = trades.sort("timestamp").join_asof(
prices.sort("timestamp"),
on="timestamp",
strategy="backward" # Use most recent price
)
print(result)
The strategy parameter controls matching: backward finds the nearest earlier match, forward finds the nearest later match.
Conclusion
Polars gives you the full spectrum of join operations with a clean, consistent API. For most work, you’ll use inner and left joins. But don’t overlook semi and anti joins—they express filtering logic that would otherwise require subqueries or multiple operations.
When performance matters, remember these principles: use lazy evaluation for large datasets, let Polars optimize your query plan, and prefer join_asof() over workarounds for time-series alignment.
The official Polars documentation covers additional options like join validation, coalescing join keys, and cartesian products. But with the patterns in this article, you’re equipped to handle the vast majority of real-world joining scenarios efficiently.