How to Cross Join in PySpark

Key Insights

Cross joins produce a Cartesian product where every row from the first DataFrame pairs with every row from the second, making output size grow multiplicatively (1,000 × 1,000 = 1,000,000 rows)
Spark disables implicit cross joins by default—you must use crossJoin() explicitly or enable spark.sql.crossJoin.enabled to prevent accidental data explosions
Always broadcast the smaller DataFrame and filter results as early as possible to keep cross join operations manageable in production

What Is a Cross Join?

A cross join, also called a Cartesian product, combines every row from one dataset with every row from another. Unlike inner or left joins that match rows based on key columns, cross joins have no matching condition—they simply produce all possible combinations.

If DataFrame A has 100 rows and DataFrame B has 50 rows, a cross join produces 5,000 rows. This multiplicative relationship makes cross joins both powerful and dangerous. You need them when generating exhaustive combinations, but they can easily overwhelm your cluster if used carelessly.

Common use cases include generating all date-product combinations for sales forecasting, creating test matrices that cover every parameter combination, building user-item pairs for recommendation systems, and expanding dimension tables for reporting.

Cross Join Syntax in PySpark

PySpark provides two approaches for cross joins: the explicit crossJoin() method and the join() method with no condition.

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("CrossJoinDemo").getOrCreate()

# Create sample DataFrames
colors = spark.createDataFrame([
    ("red",),
    ("blue",),
    ("green",)
], ["color"])

sizes = spark.createDataFrame([
    ("small",),
    ("medium",),
    ("large",)
], ["size"])

# Method 1: Explicit crossJoin()
combinations = colors.crossJoin(sizes)
combinations.show()

Output:

+-----+------+
|color|  size|
+-----+------+
|  red| small|
|  red|medium|
|  red| large|
| blue| small|
| blue|medium|
| blue| large|
|green| small|
|green|medium|
|green| large|
+-----+------+

The explicit crossJoin() method is the preferred approach. It communicates intent clearly and works without additional configuration.

# Method 2: join() with no condition (requires config)
spark.conf.set("spark.sql.crossJoin.enabled", "true")
combinations_alt = colors.join(sizes)

I recommend always using crossJoin(). The second method requires enabling a configuration flag and makes your code less readable. Explicit is better than implicit.

Practical Use Cases

Cross joins shine when you need exhaustive combinations. Here’s a realistic example: creating a calendar-product grid for sales analysis where you want a row for every product on every date, even when no sales occurred.

from pyspark.sql.functions import explode, sequence, to_date, col

# Generate date range
dates = spark.createDataFrame([
    ("2024-01-01", "2024-01-07")
], ["start_date", "end_date"])

date_range = dates.select(
    explode(
        sequence(
            to_date(col("start_date")),
            to_date(col("end_date"))
        )
    ).alias("date")
)

# Product catalog
products = spark.createDataFrame([
    ("SKU001", "Widget A", "Electronics"),
    ("SKU002", "Widget B", "Electronics"),
    ("SKU003", "Gadget X", "Accessories")
], ["sku", "product_name", "category"])

# Create the grid: every product × every date
sales_grid = products.crossJoin(date_range)
sales_grid.show(10)

Output:

+------+------------+-----------+----------+
|   sku|product_name|   category|      date|
+------+------------+-----------+----------+
|SKU001|    Widget A|Electronics|2024-01-01|
|SKU001|    Widget A|Electronics|2024-01-02|
|SKU001|    Widget A|Electronics|2024-01-03|
|SKU001|    Widget A|Electronics|2024-01-04|
|SKU001|    Widget A|Electronics|2024-01-05|
|SKU001|    Widget A|Electronics|2024-01-06|
|SKU001|    Widget A|Electronics|2024-01-07|
|SKU002|    Widget B|Electronics|2024-01-01|
|SKU002|    Widget B|Electronics|2024-01-02|
|SKU002|    Widget B|Electronics|2024-01-03|
+------+------------+-----------+----------+

This grid becomes the foundation for left-joining actual sales data, ensuring you capture zero-sale days in your analysis.

Another common pattern is generating parameter combinations for testing or simulation:

# Model hyperparameter grid
learning_rates = spark.createDataFrame([(0.01,), (0.1,), (1.0,)], ["learning_rate"])
batch_sizes = spark.createDataFrame([(32,), (64,), (128,)], ["batch_size"])
epochs = spark.createDataFrame([(10,), (50,), (100,)], ["epochs"])

# All combinations to test
param_grid = learning_rates.crossJoin(batch_sizes).crossJoin(epochs)
print(f"Total combinations: {param_grid.count()}")  # 27 combinations

Performance Considerations and Warnings

Cross joins are expensive. The output size grows as O(n × m), and Spark must shuffle data across the cluster to produce every combination. A cross join between two 10,000-row DataFrames produces 100 million rows. Between two 100,000-row DataFrames? 10 billion rows.

Spark protects you from accidental cross joins. If you use join() without a condition and haven’t enabled cross joins, Spark throws an error:

# This fails by default
spark.conf.set("spark.sql.crossJoin.enabled", "false")

try:
    result = colors.join(sizes)  # No join condition
    result.show()
except Exception as e:
    print(f"Error: {e}")
# AnalysisException: Detected implicit cartesian product...

Always calculate expected output size before executing:

def estimate_cross_join_size(df1, df2):
    """Estimate cross join output size and warn if large."""
    count1 = df1.count()
    count2 = df2.count()
    result_size = count1 * count2
    
    print(f"DataFrame 1: {count1:,} rows")
    print(f"DataFrame 2: {count2:,} rows")
    print(f"Cross join result: {result_size:,} rows")
    
    if result_size > 10_000_000:
        print("WARNING: Result exceeds 10M rows. Consider filtering or sampling.")
    
    return result_size

# Check before executing
estimate_cross_join_size(products, date_range)

Memory pressure is the primary concern. Each executor must hold portions of both DataFrames to produce combinations. Monitor your Spark UI for spill to disk, which indicates memory exhaustion and dramatically slows processing.

Optimizing Cross Joins

When you must perform a cross join on larger datasets, these techniques minimize resource consumption.

Broadcast the smaller DataFrame. Broadcasting sends the entire smaller DataFrame to every executor, eliminating shuffle overhead for that side of the join.

from pyspark.sql.functions import broadcast

# Small dimension table (fits in memory)
regions = spark.createDataFrame([
    ("NA", "North America"),
    ("EU", "Europe"),
    ("APAC", "Asia Pacific")
], ["region_code", "region_name"])

# Larger fact table
customers = spark.createDataFrame([
    (1, "Alice", "Premium"),
    (2, "Bob", "Standard"),
    (3, "Carol", "Premium"),
    # ... imagine thousands more
], ["customer_id", "name", "tier"])

# Broadcast the small table
all_customer_regions = customers.crossJoin(broadcast(regions))

The broadcast() hint tells Spark to send the entire regions DataFrame to all executors. This works well when one DataFrame is under 10MB (configurable via spark.sql.autoBroadcastJoinThreshold).

Filter early. If you’ll filter the cross join result anyway, apply filters to input DataFrames first:

# Bad: Cross join everything, then filter
result = large_df1.crossJoin(large_df2).filter(col("category") == "Electronics")

# Good: Filter first, then cross join
filtered_df1 = large_df1.filter(col("category") == "Electronics")
result = filtered_df1.crossJoin(large_df2)

Partition strategically. For very large cross joins, repartition DataFrames to balance work across executors:

# Repartition before cross join
df1_partitioned = large_df1.repartition(200)
df2_partitioned = large_df2.repartition(200)
result = df1_partitioned.crossJoin(broadcast(small_df))

Alternatives to Full Cross Joins

Often, you don’t actually need every combination. Before reaching for crossJoin(), consider whether these alternatives solve your problem.

Conditional join instead of cross join + filter. If you’re cross joining and immediately filtering, you probably want a regular join:

# Inefficient: Cross join then filter
orders = spark.createDataFrame([
    (1, "SKU001", 100),
    (2, "SKU002", 200)
], ["order_id", "sku", "amount"])

products = spark.createDataFrame([
    ("SKU001", "Widget", 10.0),
    ("SKU002", "Gadget", 20.0),
    ("SKU003", "Gizmo", 30.0)
], ["sku", "name", "price"])

# Bad approach
bad_result = orders.crossJoin(products).filter(orders.sku == products.sku)

# Good approach: Direct join
good_result = orders.join(products, "sku")

Window functions for row comparisons. When comparing rows within the same DataFrame, window functions often eliminate the need for self-cross-joins:

from pyspark.sql.window import Window
from pyspark.sql.functions import lag, lead

# Instead of cross joining a table with itself to compare consecutive rows
window_spec = Window.orderBy("timestamp")

df_with_comparison = df.withColumn(
    "prev_value", lag("value").over(window_spec)
).withColumn(
    "next_value", lead("value").over(window_spec)
)

Sampling for exploratory analysis. During development, sample your DataFrames before cross joining:

# Sample 1% of each DataFrame for testing
sample1 = large_df1.sample(0.01)
sample2 = large_df2.sample(0.01)
test_result = sample1.crossJoin(sample2)

Conclusion

Cross joins are a specialized tool that generates every possible combination between two DataFrames. Use them when you genuinely need exhaustive pairing—date-product grids, parameter matrices, or user-item combinations for recommendations.

Remember the key principles: output size grows multiplicatively, so always calculate expected row counts before execution. Use the explicit crossJoin() method for clarity. Broadcast smaller DataFrames to reduce shuffle overhead. Filter input DataFrames before joining, not after. And question whether you truly need a cross join—often a conditional join or window function solves the actual problem more efficiently.

Cross joins aren’t inherently bad, but they demand respect. Understand the cost, optimize where possible, and your Spark jobs will thank you.