How to Left Join in Polars
Left joins are fundamental to data analysis. You have a primary dataset and want to enrich it with information from a secondary dataset, keeping all rows from the left table regardless of whether a...
Key Insights
- Polars uses
join()withhow="left"for left joins, offering significant performance improvements over pandas through Rust-based execution and lazy evaluation - Always verify that join key columns have matching data types before joining—Polars is stricter than pandas and won’t silently coerce types
- Use lazy mode (
lazy()orscan_csv()) for large datasets to let Polars optimize the query plan before execution
Introduction
Left joins are fundamental to data analysis. You have a primary dataset and want to enrich it with information from a secondary dataset, keeping all rows from the left table regardless of whether a match exists on the right. Simple concept, but execution matters when you’re dealing with millions of rows.
Polars has emerged as the go-to pandas alternative for performance-critical workloads. Built in Rust with a Python API, it handles large datasets efficiently through columnar memory layout, multi-threaded execution, and lazy evaluation. If you’re still using pandas for joins on datasets that make your laptop fan spin up, Polars is worth your attention.
This article covers everything you need to perform left joins in Polars effectively, from basic syntax to lazy evaluation patterns.
Basic Left Join Syntax
The join() method is your entry point for all join operations in Polars. For a left join, you specify how="left" and define your join keys.
Here’s the basic signature:
df_left.join(df_right, on="key_column", how="left")
The key parameters are:
on: Column name when both DataFrames share the same key column nameleft_onandright_on: Use these when key columns have different nameshow: Join type—“left”, “inner”, “outer”, “cross”, “semi”, or “anti”
Let’s work through a practical example. You have an orders table and want to add customer information:
import polars as pl
# Create sample DataFrames
orders = pl.DataFrame({
"order_id": [1, 2, 3, 4, 5],
"customer_id": [101, 102, 101, 103, 104],
"amount": [250.00, 150.00, 300.00, 450.00, 200.00],
"order_date": ["2024-01-15", "2024-01-16", "2024-01-17", "2024-01-18", "2024-01-19"]
})
customers = pl.DataFrame({
"customer_id": [101, 102, 103],
"customer_name": ["Alice", "Bob", "Charlie"],
"region": ["West", "East", "West"]
})
# Perform left join
result = orders.join(customers, on="customer_id", how="left")
print(result)
Output:
shape: (5, 6)
┌──────────┬─────────────┬────────┬────────────┬───────────────┬────────┐
│ order_id ┆ customer_id ┆ amount ┆ order_date ┆ customer_name ┆ region │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ f64 ┆ str ┆ str ┆ str │
╞══════════╪═════════════╪════════╪════════════╪═══════════════╪════════╡
│ 1 ┆ 101 ┆ 250.0 ┆ 2024-01-15 ┆ Alice ┆ West │
│ 2 ┆ 102 ┆ 150.0 ┆ 2024-01-16 ┆ Bob ┆ East │
│ 3 ┆ 101 ┆ 300.0 ┆ 2024-01-17 ┆ Alice ┆ West │
│ 4 ┆ 103 ┆ 450.0 ┆ 2024-01-18 ┆ Charlie ┆ West │
│ 5 ┆ 104 ┆ 200.0 ┆ 2024-01-19 ┆ null ┆ null │
└──────────┴─────────────┴────────┴────────────┴───────────────┴────────┘
Notice that order 5 (customer_id 104) has null values for customer_name and region because that customer doesn’t exist in the customers table. That’s the defining behavior of a left join—all left rows preserved, unmatched right values become null.
When your key columns have different names, use left_on and right_on:
orders_v2 = orders.rename({"customer_id": "cust_id"})
result = orders_v2.join(
customers,
left_on="cust_id",
right_on="customer_id",
how="left"
)
Joining on Multiple Columns
Real-world data often requires composite keys. Pass a list of column names to join on multiple columns simultaneously.
Consider a scenario where you’re tracking product sales across years and need to pull in pricing data:
sales = pl.DataFrame({
"year": [2023, 2023, 2024, 2024, 2024],
"product_id": ["A001", "A002", "A001", "A002", "A003"],
"units_sold": [150, 200, 180, 220, 50]
})
pricing = pl.DataFrame({
"year": [2023, 2023, 2024, 2024],
"product_id": ["A001", "A002", "A001", "A002"],
"unit_price": [29.99, 49.99, 32.99, 52.99]
})
# Join on composite key
result = sales.join(
pricing,
on=["year", "product_id"],
how="left"
)
print(result)
Output:
shape: (5, 4)
┌──────┬────────────┬────────────┬────────────┐
│ year ┆ product_id ┆ units_sold ┆ unit_price │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ i64 ┆ f64 │
╞══════╪════════════╪════════════╪════════════╡
│ 2023 ┆ A001 ┆ 150 ┆ 29.99 │
│ 2023 ┆ A002 ┆ 200 ┆ 49.99 │
│ 2024 ┆ A001 ┆ 180 ┆ 32.99 │
│ 2024 ┆ A002 ┆ 220 ┆ 52.99 │
│ 2024 ┆ A003 ┆ 50 ┆ null │
└──────┴────────────┴────────────┴────────────┘
Product A003 in 2024 has no pricing data, so unit_price is null. The join correctly matches on both year and product_id together.
Handling Column Name Conflicts
When both DataFrames contain columns with identical names (beyond the join keys), Polars needs to differentiate them. The suffix parameter controls what gets appended to the right DataFrame’s conflicting column names.
employees = pl.DataFrame({
"emp_id": [1, 2, 3],
"name": ["Alice", "Bob", "Charlie"],
"department_id": [10, 20, 10]
})
departments = pl.DataFrame({
"department_id": [10, 20, 30],
"name": ["Engineering", "Sales", "Marketing"],
"budget": [500000, 300000, 200000]
})
# Both DataFrames have a "name" column
result = employees.join(
departments,
on="department_id",
how="left",
suffix="_dept"
)
print(result)
Output:
shape: (3, 5)
┌────────┬─────────┬───────────────┬─────────────┬────────┐
│ emp_id ┆ name ┆ department_id ┆ name_dept ┆ budget │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ i64 ┆ str ┆ i64 │
╞════════╪═════════╪═══════════════╪═════════════╪════════╡
│ 1 ┆ Alice ┆ 10 ┆ Engineering ┆ 500000 │
│ 2 ┆ Bob ┆ 20 ┆ Sales ┆ 300000 │
│ 3 ┆ Charlie ┆ 10 ┆ Engineering ┆ 500000 │
└────────┴─────────┴───────────────┴─────────────┴────────┘
The default suffix is _right, but using descriptive suffixes like _dept makes your code more readable. If you need to rename columns afterward, chain a rename() call:
result = (
employees
.join(departments, on="department_id", how="left", suffix="_dept")
.rename({"name": "employee_name", "name_dept": "department_name"})
)
Left Join in Lazy Mode
Polars really shines when you use lazy evaluation. Instead of executing operations immediately, lazy mode builds a query plan that Polars optimizes before execution. This enables predicate pushdown, projection pushdown, and other optimizations that can dramatically reduce memory usage and execution time.
Convert an eager DataFrame to lazy with .lazy(), or read files directly into lazy mode with scan_csv(), scan_parquet(), etc. Execute the plan with .collect().
# Create lazy frames
orders_lazy = pl.DataFrame({
"order_id": [1, 2, 3, 4, 5],
"customer_id": [101, 102, 101, 103, 104],
"amount": [250.00, 150.00, 300.00, 450.00, 200.00]
}).lazy()
customers_lazy = pl.DataFrame({
"customer_id": [101, 102, 103],
"customer_name": ["Alice", "Bob", "Charlie"],
"region": ["West", "East", "West"]
}).lazy()
# Build the query plan
query = (
orders_lazy
.join(customers_lazy, on="customer_id", how="left")
.filter(pl.col("amount") > 200)
.select(["order_id", "customer_name", "amount"])
)
# Execute
result = query.collect()
print(result)
For file-based workflows, use scan_csv() to avoid loading entire files into memory:
# Reading from CSV files in lazy mode
orders_lazy = pl.scan_csv("orders.csv")
customers_lazy = pl.scan_csv("customers.csv")
result = (
orders_lazy
.join(customers_lazy, on="customer_id", how="left")
.collect()
)
Polars will only read the columns and rows it actually needs based on your query plan. On large datasets, this difference is substantial.
Common Pitfalls and Tips
Data Type Mismatches
Polars is strict about types. If your join keys have different data types, the join will fail. This commonly happens when one column is read as Int64 and another as Utf8 (string).
# This will fail - type mismatch
df1 = pl.DataFrame({"id": [1, 2, 3]}) # Int64
df2 = pl.DataFrame({"id": ["1", "2", "3"], "value": ["a", "b", "c"]}) # Utf8
# Fix by casting before joining
df2_fixed = df2.with_columns(pl.col("id").cast(pl.Int64))
result = df1.join(df2_fixed, on="id", how="left")
Always check your schema with df.schema before joining if you’re getting unexpected errors.
Null Handling
Polars treats null values differently than pandas. By default, nulls do not match other nulls in joins. If you need null keys to match, you’ll need to handle this explicitly:
# Nulls won't match by default
df1 = pl.DataFrame({"key": [1, 2, None], "val1": ["a", "b", "c"]})
df2 = pl.DataFrame({"key": [1, None], "val2": ["x", "y"]})
# Standard join - null doesn't match null
result = df1.join(df2, on="key", how="left")
# Row with key=None in df1 will have null for val2
Memory Considerations
Left joins can expand your dataset significantly if there are multiple matches on the right side. Before joining large datasets, check the cardinality of your join keys:
# Check for duplicates in join key
print(f"Unique keys in right table: {customers['customer_id'].n_unique()}")
print(f"Total rows in right table: {len(customers)}")
If these numbers differ significantly, you might be creating a many-to-many join unintentionally, which can explode your row count.
Conclusion
Left joins in Polars follow a clean, consistent API. Use join() with how="left", specify your keys with on or left_on/right_on, and handle column conflicts with suffix. For production workloads with large datasets, lazy mode is non-negotiable—it lets Polars optimize your query before touching the data.
The key syntax patterns to remember:
- Single key:
df1.join(df2, on="key", how="left") - Multiple keys:
df1.join(df2, on=["key1", "key2"], how="left") - Different key names:
df1.join(df2, left_on="id", right_on="key", how="left") - Lazy mode:
df1.lazy().join(df2.lazy(), on="key", how="left").collect()
Polars supports other join types—inner, outer, cross, semi, and anti—using the same join() method with different how values. The official Polars documentation covers these in detail, along with more advanced features like asof joins for time-series data.