How to Cross Join in Polars
A cross join produces the Cartesian product of two tables—every row from the first table paired with every row from the second. If table A has 10 rows and table B has 5 rows, the result contains 50...
Key Insights
- Polars cross joins use
join(other, how="cross")syntax with no join keys required, producing a Cartesian product of all row combinations - Always prefer LazyFrames for cross joins on larger datasets—Polars can optimize the query plan and manage memory more efficiently
- Cross joins explode row counts multiplicatively (n × m rows), so filter aggressively after joining or reconsider your approach for large datasets
What Is a Cross Join?
A cross join produces the Cartesian product of two tables—every row from the first table paired with every row from the second. If table A has 10 rows and table B has 5 rows, the result contains 50 rows.
This sounds like a recipe for disaster, and often it is. But cross joins solve specific problems elegantly:
- Combinatorial generation: Create all product variants from separate attribute tables
- Parameter grids: Build hyperparameter combinations for machine learning
- Pairwise comparisons: Compare every item against every other item
- Dense matrices: Generate all date-product or user-item combinations for time series analysis
Polars handles cross joins efficiently, especially through its lazy execution engine. Let’s explore how to use them effectively.
Basic Cross Join Syntax
Polars implements cross joins through the standard join method with how="cross". Unlike other join types, you don’t specify join keys—the operation pairs every row with every other row by definition.
import polars as pl
# Create sample DataFrames
colors = pl.DataFrame({
"color": ["red", "blue", "green"],
"color_code": ["#FF0000", "#0000FF", "#00FF00"]
})
sizes = pl.DataFrame({
"size": ["S", "M", "L", "XL"],
"size_order": [1, 2, 3, 4]
})
# Cross join to create all product variants
variants = colors.join(sizes, how="cross")
print(variants)
Output:
shape: (12, 4)
┌───────┬────────────┬──────┬────────────┐
│ color ┆ color_code ┆ size ┆ size_order │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ i64 │
╞═══════╪════════════╪══════╪════════════╡
│ red ┆ #FF0000 ┆ S ┆ 1 │
│ red ┆ #FF0000 ┆ M ┆ 2 │
│ red ┆ #FF0000 ┆ L ┆ 3 │
│ red ┆ #FF0000 ┆ XL ┆ 4 │
│ blue ┆ #0000FF ┆ S ┆ 1 │
│ … ┆ … ┆ … ┆ … │
└───────┴────────────┴──────┴────────────┘
Three colors times four sizes equals twelve variants. Every combination exists exactly once.
Cross Join with LazyFrames
For production workloads, use Polars’ lazy API. LazyFrames defer execution until you call collect(), allowing Polars to optimize the entire query plan. This matters especially for cross joins, where subsequent filters can potentially be pushed down.
import polars as pl
# Create LazyFrames directly or from eager DataFrames
colors_lazy = pl.LazyFrame({
"color": ["red", "blue", "green"],
"color_code": ["#FF0000", "#0000FF", "#00FF00"]
})
sizes_lazy = pl.LazyFrame({
"size": ["S", "M", "L", "XL"],
"size_order": [1, 2, 3, 4]
})
# Build the query plan
variants_lazy = colors_lazy.join(sizes_lazy, how="cross")
# Execute when ready
variants = variants_lazy.collect()
print(variants)
When reading from files, use scan_csv or scan_parquet to stay lazy from the start:
# Lazy from disk
products = pl.scan_csv("products.csv")
regions = pl.scan_csv("regions.csv")
# Cross join stays lazy
product_regions = products.join(regions, how="cross")
# Only materializes on collect
result = product_regions.collect()
The lazy approach becomes critical when you chain operations after the cross join. Polars can sometimes optimize filter predicates, reducing the intermediate result size.
Practical Use Cases
Parameter Grid for Hyperparameter Tuning
Machine learning practitioners often need every combination of hyperparameters. Cross joins make this trivial:
import polars as pl
# Define hyperparameter options
learning_rates = pl.DataFrame({"learning_rate": [0.001, 0.01, 0.1]})
batch_sizes = pl.DataFrame({"batch_size": [16, 32, 64, 128]})
dropout_rates = pl.DataFrame({"dropout": [0.0, 0.1, 0.2, 0.3]})
# Chain cross joins for full grid
param_grid = (
learning_rates
.join(batch_sizes, how="cross")
.join(dropout_rates, how="cross")
)
print(f"Total combinations: {param_grid.height}")
print(param_grid.head(10))
Output:
Total combinations: 48
shape: (10, 3)
┌───────────────┬────────────┬─────────┐
│ learning_rate ┆ batch_size ┆ dropout │
│ --- ┆ --- ┆ --- │
│ f64 ┆ i64 ┆ f64 │
╞═══════════════╪════════════╪═════════╡
│ 0.001 ┆ 16 ┆ 0.0 │
│ 0.001 ┆ 16 ┆ 0.1 │
│ 0.001 ┆ 16 ┆ 0.2 │
│ 0.001 ┆ 16 ┆ 0.3 │
│ 0.001 ┆ 32 ┆ 0.0 │
│ … ┆ … ┆ … │
└───────────────┴────────────┴─────────┘
Date × Product Matrix for Time Series
When analyzing sales data, you often need a row for every date-product combination, even when no sales occurred:
import polars as pl
from datetime import date
# All dates in range
dates = pl.DataFrame({
"date": pl.date_range(date(2024, 1, 1), date(2024, 1, 7), eager=True)
})
# All products
products = pl.DataFrame({
"product_id": ["SKU001", "SKU002", "SKU003"],
"product_name": ["Widget", "Gadget", "Gizmo"]
})
# Create dense matrix
date_product_matrix = dates.join(products, how="cross")
print(date_product_matrix)
This creates a scaffold you can left-join actual sales data onto, with nulls representing zero-sale days.
Filtering After Cross Join
The most common cross join pattern involves immediate filtering. You generate all combinations, then keep only those meeting specific criteria.
import polars as pl
# Employees with their skills
employees = pl.DataFrame({
"employee_id": [1, 2, 3, 4],
"name": ["Alice", "Bob", "Carol", "David"],
"department": ["Engineering", "Engineering", "Marketing", "Marketing"],
"skill_level": [3, 2, 3, 1]
})
# Projects with requirements
projects = pl.DataFrame({
"project_id": ["P1", "P2", "P3"],
"project_name": ["API Rebuild", "Brand Campaign", "Data Pipeline"],
"required_department": ["Engineering", "Marketing", "Engineering"],
"min_skill_level": [2, 2, 3]
})
# Find all valid employee-project assignments
assignments = (
employees.lazy()
.join(projects.lazy(), how="cross")
.filter(
(pl.col("department") == pl.col("required_department")) &
(pl.col("skill_level") >= pl.col("min_skill_level"))
)
.select([
"name",
"project_name",
"department",
"skill_level"
])
.collect()
)
print(assignments)
Output:
shape: (4, 4)
┌───────┬────────────────┬─────────────┬─────────────┐
│ name ┆ project_name ┆ department ┆ skill_level │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ i64 │
╞═══════╪════════════════╪═════════════╪═════════════╡
│ Alice ┆ API Rebuild ┆ Engineering ┆ 3 │
│ Alice ┆ Data Pipeline ┆ Engineering ┆ 3 │
│ Bob ┆ API Rebuild ┆ Engineering ┆ 2 │
│ Carol ┆ Brand Campaign ┆ Marketing ┆ 3 │
└───────┴────────────────┴─────────────┴─────────────┘
The cross join created 12 combinations (4 employees × 3 projects), and filtering reduced it to 4 valid assignments.
Performance Considerations
Cross joins are dangerous. The output size grows multiplicatively, and it’s easy to accidentally create billions of rows.
import polars as pl
# Demonstrate row explosion
def calculate_cross_join_size(rows_a: int, rows_b: int) -> None:
result_rows = rows_a * rows_b
# Rough estimate: 100 bytes per row
estimated_mb = (result_rows * 100) / (1024 * 1024)
print(f"Table A: {rows_a:,} rows")
print(f"Table B: {rows_b:,} rows")
print(f"Cross join result: {result_rows:,} rows")
print(f"Estimated memory: {estimated_mb:,.0f} MB")
# Seemingly small tables
calculate_cross_join_size(10_000, 10_000)
Output:
Table A: 10,000 rows
Table B: 10,000 rows
Cross join result: 100,000,000 rows
Estimated memory: 9,537 MB
Two 10,000-row tables produce 100 million rows. That’s nearly 10 GB of memory for what seemed like modest inputs.
Mitigation strategies:
- Filter early: If possible, reduce input tables before the cross join
- Use lazy evaluation: Polars can sometimes optimize filter predicates
- Sample for testing: Develop with small subsets before running on full data
- Question the approach: If you need a cross join of large tables, reconsider whether there’s a better algorithm
Compared to pandas’ merge(how='cross'), Polars generally performs better due to its Rust backend and lazy optimization. But no library can save you from algorithmic complexity—a 10,000 × 10,000 cross join produces 100 million rows regardless of implementation.
Conclusion
Cross joins in Polars use straightforward syntax: df1.join(df2, how="cross") with no join keys. The operation produces every possible row combination, making it ideal for combinatorial problems like parameter grids, dense matrices, and pairwise comparisons.
For production code, prefer LazyFrames. The lazy API lets Polars optimize your query plan, which matters when you filter after cross joining. Always respect the multiplicative nature of cross joins—small inputs create large outputs, and large inputs create impossibly large outputs.
Use cross joins when you genuinely need all combinations. When you don’t, use a more targeted join type. The Polars documentation on joins covers additional options for when cross joins aren’t the right tool.