How to Cross Join in Polars

Key Insights

Polars cross joins use join(other, how="cross") syntax with no join keys required, producing a Cartesian product of all row combinations
Always prefer LazyFrames for cross joins on larger datasets—Polars can optimize the query plan and manage memory more efficiently
Cross joins explode row counts multiplicatively (n × m rows), so filter aggressively after joining or reconsider your approach for large datasets

What Is a Cross Join?

A cross join produces the Cartesian product of two tables—every row from the first table paired with every row from the second. If table A has 10 rows and table B has 5 rows, the result contains 50 rows.

This sounds like a recipe for disaster, and often it is. But cross joins solve specific problems elegantly:

Combinatorial generation: Create all product variants from separate attribute tables
Parameter grids: Build hyperparameter combinations for machine learning
Pairwise comparisons: Compare every item against every other item
Dense matrices: Generate all date-product or user-item combinations for time series analysis

Polars handles cross joins efficiently, especially through its lazy execution engine. Let’s explore how to use them effectively.

Basic Cross Join Syntax

Polars implements cross joins through the standard join method with how="cross". Unlike other join types, you don’t specify join keys—the operation pairs every row with every other row by definition.

import polars as pl

# Create sample DataFrames
colors = pl.DataFrame({
    "color": ["red", "blue", "green"],
    "color_code": ["#FF0000", "#0000FF", "#00FF00"]
})

sizes = pl.DataFrame({
    "size": ["S", "M", "L", "XL"],
    "size_order": [1, 2, 3, 4]
})

# Cross join to create all product variants
variants = colors.join(sizes, how="cross")

print(variants)

Output:

shape: (12, 4)
┌───────┬────────────┬──────┬────────────┐
│ color ┆ color_code ┆ size ┆ size_order │
│ ---   ┆ ---        ┆ ---  ┆ ---        │
│ str   ┆ str        ┆ str  ┆ i64        │
╞═══════╪════════════╪══════╪════════════╡
│ red   ┆ #FF0000    ┆ S    ┆ 1          │
│ red   ┆ #FF0000    ┆ M    ┆ 2          │
│ red   ┆ #FF0000    ┆ L    ┆ 3          │
│ red   ┆ #FF0000    ┆ XL   ┆ 4          │
│ blue  ┆ #0000FF    ┆ S    ┆ 1          │
│ …     ┆ …          ┆ …    ┆ …          │
└───────┴────────────┴──────┴────────────┘

Three colors times four sizes equals twelve variants. Every combination exists exactly once.

Cross Join with LazyFrames

For production workloads, use Polars’ lazy API. LazyFrames defer execution until you call collect(), allowing Polars to optimize the entire query plan. This matters especially for cross joins, where subsequent filters can potentially be pushed down.

import polars as pl

# Create LazyFrames directly or from eager DataFrames
colors_lazy = pl.LazyFrame({
    "color": ["red", "blue", "green"],
    "color_code": ["#FF0000", "#0000FF", "#00FF00"]
})

sizes_lazy = pl.LazyFrame({
    "size": ["S", "M", "L", "XL"],
    "size_order": [1, 2, 3, 4]
})

# Build the query plan
variants_lazy = colors_lazy.join(sizes_lazy, how="cross")

# Execute when ready
variants = variants_lazy.collect()

print(variants)

When reading from files, use scan_csv or scan_parquet to stay lazy from the start:

# Lazy from disk
products = pl.scan_csv("products.csv")
regions = pl.scan_csv("regions.csv")

# Cross join stays lazy
product_regions = products.join(regions, how="cross")

# Only materializes on collect
result = product_regions.collect()

The lazy approach becomes critical when you chain operations after the cross join. Polars can sometimes optimize filter predicates, reducing the intermediate result size.

Practical Use Cases

Parameter Grid for Hyperparameter Tuning

Machine learning practitioners often need every combination of hyperparameters. Cross joins make this trivial:

import polars as pl

# Define hyperparameter options
learning_rates = pl.DataFrame({"learning_rate": [0.001, 0.01, 0.1]})
batch_sizes = pl.DataFrame({"batch_size": [16, 32, 64, 128]})
dropout_rates = pl.DataFrame({"dropout": [0.0, 0.1, 0.2, 0.3]})

# Chain cross joins for full grid
param_grid = (
    learning_rates
    .join(batch_sizes, how="cross")
    .join(dropout_rates, how="cross")
)

print(f"Total combinations: {param_grid.height}")
print(param_grid.head(10))

Output:

Total combinations: 48
shape: (10, 3)
┌───────────────┬────────────┬─────────┐
│ learning_rate ┆ batch_size ┆ dropout │
│ ---           ┆ ---        ┆ ---     │
│ f64           ┆ i64        ┆ f64     │
╞═══════════════╪════════════╪═════════╡
│ 0.001         ┆ 16         ┆ 0.0     │
│ 0.001         ┆ 16         ┆ 0.1     │
│ 0.001         ┆ 16         ┆ 0.2     │
│ 0.001         ┆ 16         ┆ 0.3     │
│ 0.001         ┆ 32         ┆ 0.0     │
│ …             ┆ …          ┆ …       │
└───────────────┴────────────┴─────────┘

Date × Product Matrix for Time Series

When analyzing sales data, you often need a row for every date-product combination, even when no sales occurred:

import polars as pl
from datetime import date

# All dates in range
dates = pl.DataFrame({
    "date": pl.date_range(date(2024, 1, 1), date(2024, 1, 7), eager=True)
})

# All products
products = pl.DataFrame({
    "product_id": ["SKU001", "SKU002", "SKU003"],
    "product_name": ["Widget", "Gadget", "Gizmo"]
})

# Create dense matrix
date_product_matrix = dates.join(products, how="cross")

print(date_product_matrix)

This creates a scaffold you can left-join actual sales data onto, with nulls representing zero-sale days.

Filtering After Cross Join

The most common cross join pattern involves immediate filtering. You generate all combinations, then keep only those meeting specific criteria.

import polars as pl

# Employees with their skills
employees = pl.DataFrame({
    "employee_id": [1, 2, 3, 4],
    "name": ["Alice", "Bob", "Carol", "David"],
    "department": ["Engineering", "Engineering", "Marketing", "Marketing"],
    "skill_level": [3, 2, 3, 1]
})

# Projects with requirements
projects = pl.DataFrame({
    "project_id": ["P1", "P2", "P3"],
    "project_name": ["API Rebuild", "Brand Campaign", "Data Pipeline"],
    "required_department": ["Engineering", "Marketing", "Engineering"],
    "min_skill_level": [2, 2, 3]
})

# Find all valid employee-project assignments
assignments = (
    employees.lazy()
    .join(projects.lazy(), how="cross")
    .filter(
        (pl.col("department") == pl.col("required_department")) &
        (pl.col("skill_level") >= pl.col("min_skill_level"))
    )
    .select([
        "name",
        "project_name",
        "department",
        "skill_level"
    ])
    .collect()
)

print(assignments)

Output:

shape: (4, 4)
┌───────┬────────────────┬─────────────┬─────────────┐
│ name  ┆ project_name   ┆ department  ┆ skill_level │
│ ---   ┆ ---            ┆ ---         ┆ ---         │
│ str   ┆ str            ┆ str         ┆ i64         │
╞═══════╪════════════════╪═════════════╪═════════════╡
│ Alice ┆ API Rebuild    ┆ Engineering ┆ 3           │
│ Alice ┆ Data Pipeline  ┆ Engineering ┆ 3           │
│ Bob   ┆ API Rebuild    ┆ Engineering ┆ 2           │
│ Carol ┆ Brand Campaign ┆ Marketing   ┆ 3           │
└───────┴────────────────┴─────────────┴─────────────┘

The cross join created 12 combinations (4 employees × 3 projects), and filtering reduced it to 4 valid assignments.

Performance Considerations

Cross joins are dangerous. The output size grows multiplicatively, and it’s easy to accidentally create billions of rows.

import polars as pl

# Demonstrate row explosion
def calculate_cross_join_size(rows_a: int, rows_b: int) -> None:
    result_rows = rows_a * rows_b
    # Rough estimate: 100 bytes per row
    estimated_mb = (result_rows * 100) / (1024 * 1024)
    
    print(f"Table A: {rows_a:,} rows")
    print(f"Table B: {rows_b:,} rows")
    print(f"Cross join result: {result_rows:,} rows")
    print(f"Estimated memory: {estimated_mb:,.0f} MB")

# Seemingly small tables
calculate_cross_join_size(10_000, 10_000)

Output:

Table A: 10,000 rows
Table B: 10,000 rows
Cross join result: 100,000,000 rows
Estimated memory: 9,537 MB

Two 10,000-row tables produce 100 million rows. That’s nearly 10 GB of memory for what seemed like modest inputs.

Mitigation strategies:

Filter early: If possible, reduce input tables before the cross join
Use lazy evaluation: Polars can sometimes optimize filter predicates
Sample for testing: Develop with small subsets before running on full data
Question the approach: If you need a cross join of large tables, reconsider whether there’s a better algorithm

Compared to pandas’ merge(how='cross'), Polars generally performs better due to its Rust backend and lazy optimization. But no library can save you from algorithmic complexity—a 10,000 × 10,000 cross join produces 100 million rows regardless of implementation.

Conclusion

Cross joins in Polars use straightforward syntax: df1.join(df2, how="cross") with no join keys. The operation produces every possible row combination, making it ideal for combinatorial problems like parameter grids, dense matrices, and pairwise comparisons.

For production code, prefer LazyFrames. The lazy API lets Polars optimize your query plan, which matters when you filter after cross joining. Always respect the multiplicative nature of cross joins—small inputs create large outputs, and large inputs create impossibly large outputs.

Use cross joins when you genuinely need all combinations. When you don’t, use a more targeted join type. The Polars documentation on joins covers additional options for when cross joins aren’t the right tool.

What Is a Cross Join?

Basic Cross Join Syntax

Cross Join with LazyFrames

Practical Use Cases

Parameter Grid for Hyperparameter Tuning

Date × Product Matrix for Time Series

Filtering After Cross Join

Performance Considerations

Conclusion

Liked this? There's more.

Similar Articles