How to Outer Join in Polars

Outer joins are essential when you need to combine datasets while preserving records that don't have matches in both tables. Unlike inner joins that discard non-matching rows, outer joins keep them...

Key Insights

  • Polars provides three outer join variants (full, left, right) through the join() method, with syntax that’s cleaner and more explicit than pandas
  • Lazy evaluation with LazyFrame can dramatically improve outer join performance by allowing Polars to optimize the query plan before execution
  • Always handle null values after outer joins explicitly—Polars won’t silently fill them, which prevents subtle bugs but requires intentional null management

Introduction

Outer joins are essential when you need to combine datasets while preserving records that don’t have matches in both tables. Unlike inner joins that discard non-matching rows, outer joins keep them and fill in nulls where data is missing.

Polars has emerged as a serious alternative to pandas for DataFrame operations, particularly when performance matters. Written in Rust with a focus on parallelism and memory efficiency, Polars consistently outperforms pandas on join operations—often by 5-10x on larger datasets. The API is also more explicit, which reduces the “magic” that can lead to unexpected behavior.

This article covers everything you need to perform outer joins in Polars effectively, from basic syntax to lazy evaluation patterns that squeeze maximum performance from your hardware.

Understanding Outer Join Types

Before diving into code, let’s clarify the three types of outer joins:

Left Outer Join: Returns all rows from the left DataFrame plus matching rows from the right. Non-matching right rows are excluded; non-matching left rows get nulls for right columns.

Right Outer Join: The mirror of left join. Returns all rows from the right DataFrame plus matching rows from the left.

Full Outer Join: Returns all rows from both DataFrames. Non-matching rows from either side get nulls for the columns from the other table.

Think of it this way: if you’re joining customers to orders, a left join on customers keeps all customers (even those without orders), a right join keeps all orders (even orphaned ones), and a full outer join keeps everything.

Let’s create sample data to work with throughout this article:

import polars as pl

# Customers DataFrame - some have no orders
customers = pl.DataFrame({
    "customer_id": [1, 2, 3, 4, 5],
    "name": ["Alice", "Bob", "Charlie", "Diana", "Eve"],
    "region": ["West", "East", "West", "East", "North"]
})

# Orders DataFrame - includes orders from customers not in our customer list
orders = pl.DataFrame({
    "order_id": [101, 102, 103, 104, 105],
    "customer_id": [1, 2, 2, 6, 7],  # Note: customers 6 and 7 don't exist
    "amount": [150.00, 200.00, 75.00, 300.00, 50.00]
})

print("Customers:")
print(customers)
print("\nOrders:")
print(orders)

Notice that customers 3, 4, and 5 have no orders, while orders from customers 6 and 7 reference non-existent customers. This setup demonstrates why outer joins matter.

Basic Outer Join Syntax

Polars uses the join() method with a how parameter to specify join type. The syntax is straightforward:

# Full outer join - keeps all records from both DataFrames
full_join = customers.join(
    orders,
    on="customer_id",
    how="full"
)
print("Full Outer Join:")
print(full_join)

Output:

Full Outer Join:
shape: (7, 5)
┌─────────────┬─────────┬────────┬──────────┬────────┐
│ customer_id ┆ name    ┆ region ┆ order_id ┆ amount │
│ ---         ┆ ---     ┆ ---    ┆ ---      ┆ ---    │
│ i64         ┆ str     ┆ str    ┆ i64      ┆ f64    │
╞═════════════╪═════════╪════════╪══════════╪════════╡
│ 1           ┆ Alice   ┆ West   ┆ 101      ┆ 150.0  │
│ 2           ┆ Bob     ┆ East   ┆ 102      ┆ 200.0  │
│ 2           ┆ Bob     ┆ East   ┆ 103      ┆ 75.0   │
│ 3           ┆ Charlie ┆ West   ┆ null     ┆ null   │
│ 4           ┆ Diana   ┆ East   ┆ null     ┆ null   │
│ 5           ┆ Eve     ┆ North  ┆ null     ┆ null   │
│ 6           ┆ null    ┆ null   ┆ 104      ┆ 300.0  │
│ 7           ┆ null    ┆ null   ┆ 105      ┆ 50.0   │
└─────────────┴─────────┴────────┴──────────┴────────┘

For left and right joins, simply change the how parameter:

# Left outer join - keeps all customers
left_join = customers.join(
    orders,
    on="customer_id",
    how="left"
)

# Right outer join - keeps all orders
right_join = customers.join(
    orders,
    on="customer_id",
    how="right"
)

print("Left Join (all customers kept):")
print(left_join)
print("\nRight Join (all orders kept):")
print(right_join)

The left join produces 5 rows (one per customer, with Bob appearing twice due to his two orders), while the right join produces 5 rows (one per order).

Handling Multiple Join Keys

Real-world joins often require matching on multiple columns. Pass a list to the on parameter:

# Sales data with composite keys
sales_2023 = pl.DataFrame({
    "customer_id": [1, 1, 2, 3],
    "quarter": ["Q1", "Q2", "Q1", "Q3"],
    "revenue": [1000, 1500, 2000, 800]
})

sales_2024 = pl.DataFrame({
    "customer_id": [1, 2, 2, 4],
    "quarter": ["Q1", "Q1", "Q2", "Q1"],
    "revenue": [1200, 2200, 1800, 500]
})

# Join on both customer_id AND quarter
yoy_comparison = sales_2023.join(
    sales_2024,
    on=["customer_id", "quarter"],
    how="full",
    suffix="_2024"
)

print("Year-over-Year Comparison:")
print(yoy_comparison)

When join columns have different names in each DataFrame, use left_on and right_on:

# Different column names
df_left = pl.DataFrame({
    "user_id": [1, 2, 3],
    "score": [85, 90, 78]
})

df_right = pl.DataFrame({
    "participant_id": [2, 3, 4],
    "bonus": [10, 15, 20]
})

result = df_left.join(
    df_right,
    left_on="user_id",
    right_on="participant_id",
    how="full"
)
print(result)

Managing Column Name Conflicts

When both DataFrames have columns with the same name (beyond the join keys), Polars appends a suffix to disambiguate. The default is _right, but you can customize it:

# Both DataFrames have a 'status' column
employees = pl.DataFrame({
    "emp_id": [1, 2, 3],
    "name": ["Alice", "Bob", "Charlie"],
    "status": ["active", "active", "inactive"]
})

projects = pl.DataFrame({
    "emp_id": [1, 2, 4],
    "project": ["Alpha", "Beta", "Gamma"],
    "status": ["ongoing", "completed", "ongoing"]
})

# Default suffix
default_result = employees.join(projects, on="emp_id", how="full")
print("Default suffix (_right):")
print(default_result)

# Custom suffix for clarity
custom_result = employees.join(
    projects,
    on="emp_id",
    how="full",
    suffix="_project"
)
print("\nCustom suffix (_project):")
print(custom_result)

For better readability, I recommend using descriptive suffixes like _2023/_2024 for temporal comparisons or _expected/_actual for validation scenarios.

Outer Joins in Lazy Mode

Polars really shines when you use lazy evaluation. Instead of executing operations immediately, LazyFrame builds a query plan that Polars optimizes before execution:

import time

# Create larger DataFrames for meaningful timing
n_rows = 1_000_000
large_left = pl.DataFrame({
    "key": range(n_rows),
    "value_a": range(n_rows)
})

large_right = pl.DataFrame({
    "key": range(500_000, 1_500_000),  # Partial overlap
    "value_b": range(1_000_000)
})

# Eager execution
start = time.perf_counter()
eager_result = large_left.join(large_right, on="key", how="full")
eager_time = time.perf_counter() - start

# Lazy execution
start = time.perf_counter()
lazy_result = (
    large_left.lazy()
    .join(large_right.lazy(), on="key", how="full")
    .collect()
)
lazy_time = time.perf_counter() - start

print(f"Eager execution: {eager_time:.3f}s")
print(f"Lazy execution:  {lazy_time:.3f}s")
print(f"Results match: {eager_result.equals(lazy_result)}")

The performance difference becomes more pronounced when you chain operations. Lazy mode allows Polars to push down filters, eliminate unnecessary columns, and parallelize more effectively:

# Chained operations benefit most from lazy evaluation
result = (
    large_left.lazy()
    .join(large_right.lazy(), on="key", how="full")
    .filter(pl.col("value_a").is_not_null())
    .select(["key", "value_a", "value_b"])
    .collect()
)

In lazy mode, Polars knows about the filter before executing the join and can optimize accordingly.

Common Pitfalls and Best Practices

Handling Nulls After Joins

Outer joins introduce nulls by design. Handle them explicitly:

# After a full outer join, fill nulls appropriately
result = customers.join(orders, on="customer_id", how="full")

# Fill numeric nulls with zero
result_filled = result.with_columns([
    pl.col("amount").fill_null(0),
    pl.col("order_id").fill_null(-1)  # Sentinel value
])

# Or use coalesce for conditional filling
result_coalesced = result.with_columns(
    pl.coalesce(pl.col("name"), pl.lit("Unknown Customer")).alias("name")
)

print(result_coalesced)

Memory Considerations

Full outer joins can produce significantly more rows than either input DataFrame. Before joining large datasets:

# Estimate result size
left_unique = large_left.select("key").n_unique()
right_unique = large_right.select("key").n_unique()
overlap = large_left.join(large_right, on="key", how="inner").height

max_rows = left_unique + right_unique - overlap
print(f"Maximum possible rows after full join: {max_rows}")

When to Use Alternatives

Outer joins aren’t always the right tool:

  • Use concat() for stacking: If you’re combining DataFrames with identical schemas, pl.concat() is faster and clearer.
  • Use join_asof() for time-series: When joining on timestamps with tolerance, join_asof() handles the fuzzy matching.
  • Consider filtering first: If you only need a subset of data, filter before joining to reduce memory usage.
# Instead of joining then filtering
bad_approach = customers.join(orders, on="customer_id", how="left").filter(
    pl.col("region") == "West"
)

# Filter first, then join
good_approach = customers.filter(pl.col("region") == "West").join(
    orders, on="customer_id", how="left"
)

Polars outer joins are powerful and performant, but they require intentional use. Understand your data’s cardinality, handle nulls explicitly, and leverage lazy evaluation for complex pipelines. The explicit API might feel verbose compared to pandas at first, but that verbosity prevents the subtle bugs that plague data pipelines in production.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.