How to Cross Join in Pandas

A cross join, also called a Cartesian product, combines every row from one table with every row from another table. If DataFrame A has 3 rows and DataFrame B has 4 rows, the result contains 12...

Key Insights

  • Pandas 1.2.0 introduced merge(how='cross'), making cross joins a one-liner—use this method unless you’re stuck on an older version.
  • Cross joins produce m × n rows, so a join between two 1,000-row DataFrames creates 1,000,000 rows—always calculate the output size before executing.
  • The most common legitimate use cases are generating parameter grids, creating test data combinations, and building pricing or scheduling matrices.

What Is a Cross Join?

A cross join, also called a Cartesian product, combines every row from one table with every row from another table. If DataFrame A has 3 rows and DataFrame B has 4 rows, the result contains 12 rows—every possible pairing.

Unlike inner, left, or outer joins, cross joins don’t require a matching key. You’re not looking for relationships; you’re generating combinations.

Common use cases include:

  • Parameter grids: Testing all combinations of hyperparameters in machine learning
  • Test data generation: Creating comprehensive test cases from input variations
  • Pricing matrices: Combining products with regions, currencies, or customer tiers
  • Scheduling: Pairing employees with time slots or tasks with resources

Let’s explore the different ways to accomplish this in Pandas.

The Modern Approach: Using merge() with how=‘cross’

Pandas 1.2.0 (released December 2020) introduced native cross join support. This is now the recommended approach for any modern codebase.

import pandas as pd

# Create sample DataFrames
colors = pd.DataFrame({'color': ['red', 'blue', 'green']})
sizes = pd.DataFrame({'size': ['S', 'M', 'L', 'XL']})

# Cross join with how='cross'
combinations = pd.merge(colors, sizes, how='cross')
print(combinations)

Output:

   color size
0    red    S
1    red    M
2    red    L
3    red   XL
4   blue    S
5   blue    M
6   blue    L
7   blue   XL
8  green    S
9  green    M
10 green    L
11 green   XL

The syntax is clean and self-documenting. Anyone reading your code immediately understands the intent. No dummy columns, no workarounds—just declare what you want.

You can also cross join DataFrames with multiple columns:

products = pd.DataFrame({
    'product_id': [101, 102],
    'product_name': ['Widget', 'Gadget']
})

regions = pd.DataFrame({
    'region_code': ['NA', 'EU', 'APAC'],
    'currency': ['USD', 'EUR', 'JPY']
})

product_regions = pd.merge(products, regions, how='cross')
print(product_regions)

Output:

   product_id product_name region_code currency
0         101       Widget          NA      USD
1         101       Widget          EU      EUR
2         101       Widget        APAC      JPY
3         102       Gadget          NA      USD
4         102       Gadget          EU      EUR
5         102       Gadget        APAC      JPY

This gives you every product-region combination with all associated columns preserved.

Legacy Method: Temporary Key Column

If you’re maintaining code that runs on Pandas versions before 1.2.0, you’ll need a workaround. The standard technique involves adding a temporary constant column to both DataFrames, joining on that column, then removing it.

import pandas as pd

colors = pd.DataFrame({'color': ['red', 'blue', 'green']})
sizes = pd.DataFrame({'size': ['S', 'M', 'L', 'XL']})

# Add temporary key column to both DataFrames
colors['_tmp_key'] = 1
sizes['_tmp_key'] = 1

# Merge on the temporary key
combinations = pd.merge(colors, sizes, on='_tmp_key')

# Remove the temporary key
combinations = combinations.drop('_tmp_key', axis=1)

print(combinations)

This produces identical output to the how='cross' method. The trick works because every row in both DataFrames shares the same key value (1), so every row matches every other row.

You can wrap this in a reusable function for cleaner code:

def cross_join_legacy(df1, df2):
    """
    Perform a cross join on two DataFrames.
    Compatible with Pandas versions before 1.2.0.
    """
    tmp_key = '_cross_join_key'
    
    df1 = df1.copy()
    df2 = df2.copy()
    
    df1[tmp_key] = 1
    df2[tmp_key] = 1
    
    result = pd.merge(df1, df2, on=tmp_key)
    result = result.drop(tmp_key, axis=1)
    
    return result

# Usage
combinations = cross_join_legacy(colors, sizes)

The .copy() calls prevent modifying the original DataFrames—a defensive practice that avoids subtle bugs.

Alternative: Using itertools.product()

Python’s itertools.product() function generates Cartesian products natively. You can leverage this by converting DataFrame rows to tuples, computing the product, and reconstructing a DataFrame.

import pandas as pd
from itertools import product

colors = pd.DataFrame({'color': ['red', 'blue', 'green']})
sizes = pd.DataFrame({'size': ['S', 'M', 'L', 'XL']})

# Generate Cartesian product using itertools
color_values = colors['color'].tolist()
size_values = sizes['size'].tolist()

combinations_list = list(product(color_values, size_values))

# Convert back to DataFrame
combinations = pd.DataFrame(combinations_list, columns=['color', 'size'])
print(combinations)

This approach works well for simple cases but becomes cumbersome with multiple columns:

import pandas as pd
from itertools import product

products = pd.DataFrame({
    'product_id': [101, 102],
    'product_name': ['Widget', 'Gadget']
})

regions = pd.DataFrame({
    'region_code': ['NA', 'EU', 'APAC'],
    'currency': ['USD', 'EUR', 'JPY']
})

# Convert rows to tuples
product_rows = [tuple(row) for row in products.values]
region_rows = [tuple(row) for row in regions.values]

# Generate product
combinations_list = [p + r for p, r in product(product_rows, region_rows)]

# Reconstruct DataFrame with combined columns
all_columns = list(products.columns) + list(regions.columns)
combinations = pd.DataFrame(combinations_list, columns=all_columns)
print(combinations)

I recommend this method only when you’re working outside Pandas anyway or need fine-grained control over the iteration process. For typical DataFrame operations, stick with merge().

Performance Considerations

Cross joins are memory-intensive operations. The output size grows multiplicatively, not additively. This catches many developers off guard.

import pandas as pd

def calculate_cross_join_size(df1_rows, df2_rows, bytes_per_row=100):
    """
    Estimate the memory impact of a cross join.
    """
    output_rows = df1_rows * df2_rows
    estimated_memory_mb = (output_rows * bytes_per_row) / (1024 * 1024)
    
    return output_rows, estimated_memory_mb

# Example calculations
scenarios = [
    (100, 100),
    (1_000, 1_000),
    (10_000, 10_000),
    (100_000, 100_000),
]

print("DataFrame Sizes -> Output Rows (Estimated Memory)")
print("-" * 55)

for rows1, rows2 in scenarios:
    output_rows, memory_mb = calculate_cross_join_size(rows1, rows2)
    print(f"{rows1:>7,} × {rows2:>7,} -> {output_rows:>15,} rows (~{memory_mb:,.0f} MB)")

Output:

DataFrame Sizes -> Output Rows (Estimated Memory)
-------------------------------------------------------
    100 ×     100 ->          10,000 rows (~1 MB)
  1,000 ×   1,000 ->       1,000,000 rows (~95 MB)
 10,000 ×  10,000 ->     100,000,000 rows (~9,537 MB)
100,000 × 100,000 ->  10,000,000,000 rows (~953,674 MB)

That last scenario would require nearly a terabyte of memory. Before executing any cross join, calculate the expected output size:

expected_rows = len(df1) * len(df2)
print(f"Cross join will produce {expected_rows:,} rows")

if expected_rows > 10_000_000:
    print("WARNING: Consider filtering inputs or using chunked processing")

If you need a cross join on large datasets, consider these strategies:

  1. Filter first: Reduce input DataFrames before joining
  2. Process in chunks: Generate combinations in batches
  3. Use a database: Let PostgreSQL or another RDBMS handle the heavy lifting
  4. Reconsider the approach: Sometimes a cross join indicates a design problem

Practical Example: Building a Pricing Matrix

Let’s walk through a realistic scenario. You’re building an e-commerce system that needs region-specific pricing for all products across multiple customer tiers.

import pandas as pd

# Product catalog
products = pd.DataFrame({
    'sku': ['LAPTOP-001', 'PHONE-002', 'TABLET-003', 'WATCH-004'],
    'base_price': [999.99, 799.99, 499.99, 299.99],
    'category': ['Electronics', 'Electronics', 'Electronics', 'Wearables']
})

# Regional pricing factors
regions = pd.DataFrame({
    'region': ['US', 'EU', 'UK', 'JP'],
    'currency': ['USD', 'EUR', 'GBP', 'JPY'],
    'price_multiplier': [1.0, 0.95, 0.88, 110.0]  # JP uses yen conversion
})

# Customer tiers with discounts
tiers = pd.DataFrame({
    'tier': ['Standard', 'Premium', 'Enterprise'],
    'discount_pct': [0, 10, 20]
})

# Step 1: Cross join products with regions
product_regions = pd.merge(products, regions, how='cross')

# Step 2: Cross join result with customer tiers
pricing_matrix = pd.merge(product_regions, tiers, how='cross')

# Step 3: Calculate final prices
pricing_matrix['regional_price'] = (
    pricing_matrix['base_price'] * pricing_matrix['price_multiplier']
)
pricing_matrix['final_price'] = (
    pricing_matrix['regional_price'] * (1 - pricing_matrix['discount_pct'] / 100)
)

# Round for display
pricing_matrix['final_price'] = pricing_matrix['final_price'].round(2)

print(f"Total pricing combinations: {len(pricing_matrix)}")
print("\nSample output (first 10 rows):")
print(pricing_matrix[['sku', 'region', 'tier', 'currency', 'final_price']].head(10))

Output:

Total pricing combinations: 48

Sample output (first 10 rows):
          sku region        tier currency  final_price
0  LAPTOP-001     US    Standard      USD       999.99
1  LAPTOP-001     US     Premium      USD       899.99
2  LAPTOP-001     US  Enterprise      USD       799.99
3  LAPTOP-001     EU    Standard      EUR       949.99
4  LAPTOP-001     EU     Premium      EUR       854.99
5  LAPTOP-001     EU  Enterprise      EUR       759.99
6  LAPTOP-001     UK    Standard      GBP       879.99
7  LAPTOP-001     UK     Premium      GBP       791.99
8  LAPTOP-001     UK  Enterprise      GBP       703.99
9  LAPTOP-001     JP    Standard      JPY    109998.90

This pattern—cross join followed by calculated columns—is the foundation for many business applications involving combinatorial data.

Summary

Here’s a quick reference for choosing your cross join method:

Method Pandas Version Readability Performance Best For
merge(how='cross') 1.2.0+ Excellent Good All modern projects
Temporary key column Any Moderate Good Legacy codebases
itertools.product() Any Poor Variable Non-DataFrame workflows

My recommendation: Use merge(how='cross') unless you have a specific reason not to. It’s the clearest expression of intent, requires no cleanup, and performs well for reasonable dataset sizes.

Always calculate your expected output size before executing. Cross joins are powerful but unforgiving—a moment of carelessness can crash your kernel or exhaust your memory. Filter your inputs, validate your assumptions, and test with small samples first.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.