How to Cross Join in Pandas
A cross join, also called a Cartesian product, combines every row from one table with every row from another table. If DataFrame A has 3 rows and DataFrame B has 4 rows, the result contains 12...
Key Insights
- Pandas 1.2.0 introduced
merge(how='cross'), making cross joins a one-liner—use this method unless you’re stuck on an older version. - Cross joins produce m × n rows, so a join between two 1,000-row DataFrames creates 1,000,000 rows—always calculate the output size before executing.
- The most common legitimate use cases are generating parameter grids, creating test data combinations, and building pricing or scheduling matrices.
What Is a Cross Join?
A cross join, also called a Cartesian product, combines every row from one table with every row from another table. If DataFrame A has 3 rows and DataFrame B has 4 rows, the result contains 12 rows—every possible pairing.
Unlike inner, left, or outer joins, cross joins don’t require a matching key. You’re not looking for relationships; you’re generating combinations.
Common use cases include:
- Parameter grids: Testing all combinations of hyperparameters in machine learning
- Test data generation: Creating comprehensive test cases from input variations
- Pricing matrices: Combining products with regions, currencies, or customer tiers
- Scheduling: Pairing employees with time slots or tasks with resources
Let’s explore the different ways to accomplish this in Pandas.
The Modern Approach: Using merge() with how=‘cross’
Pandas 1.2.0 (released December 2020) introduced native cross join support. This is now the recommended approach for any modern codebase.
import pandas as pd
# Create sample DataFrames
colors = pd.DataFrame({'color': ['red', 'blue', 'green']})
sizes = pd.DataFrame({'size': ['S', 'M', 'L', 'XL']})
# Cross join with how='cross'
combinations = pd.merge(colors, sizes, how='cross')
print(combinations)
Output:
color size
0 red S
1 red M
2 red L
3 red XL
4 blue S
5 blue M
6 blue L
7 blue XL
8 green S
9 green M
10 green L
11 green XL
The syntax is clean and self-documenting. Anyone reading your code immediately understands the intent. No dummy columns, no workarounds—just declare what you want.
You can also cross join DataFrames with multiple columns:
products = pd.DataFrame({
'product_id': [101, 102],
'product_name': ['Widget', 'Gadget']
})
regions = pd.DataFrame({
'region_code': ['NA', 'EU', 'APAC'],
'currency': ['USD', 'EUR', 'JPY']
})
product_regions = pd.merge(products, regions, how='cross')
print(product_regions)
Output:
product_id product_name region_code currency
0 101 Widget NA USD
1 101 Widget EU EUR
2 101 Widget APAC JPY
3 102 Gadget NA USD
4 102 Gadget EU EUR
5 102 Gadget APAC JPY
This gives you every product-region combination with all associated columns preserved.
Legacy Method: Temporary Key Column
If you’re maintaining code that runs on Pandas versions before 1.2.0, you’ll need a workaround. The standard technique involves adding a temporary constant column to both DataFrames, joining on that column, then removing it.
import pandas as pd
colors = pd.DataFrame({'color': ['red', 'blue', 'green']})
sizes = pd.DataFrame({'size': ['S', 'M', 'L', 'XL']})
# Add temporary key column to both DataFrames
colors['_tmp_key'] = 1
sizes['_tmp_key'] = 1
# Merge on the temporary key
combinations = pd.merge(colors, sizes, on='_tmp_key')
# Remove the temporary key
combinations = combinations.drop('_tmp_key', axis=1)
print(combinations)
This produces identical output to the how='cross' method. The trick works because every row in both DataFrames shares the same key value (1), so every row matches every other row.
You can wrap this in a reusable function for cleaner code:
def cross_join_legacy(df1, df2):
"""
Perform a cross join on two DataFrames.
Compatible with Pandas versions before 1.2.0.
"""
tmp_key = '_cross_join_key'
df1 = df1.copy()
df2 = df2.copy()
df1[tmp_key] = 1
df2[tmp_key] = 1
result = pd.merge(df1, df2, on=tmp_key)
result = result.drop(tmp_key, axis=1)
return result
# Usage
combinations = cross_join_legacy(colors, sizes)
The .copy() calls prevent modifying the original DataFrames—a defensive practice that avoids subtle bugs.
Alternative: Using itertools.product()
Python’s itertools.product() function generates Cartesian products natively. You can leverage this by converting DataFrame rows to tuples, computing the product, and reconstructing a DataFrame.
import pandas as pd
from itertools import product
colors = pd.DataFrame({'color': ['red', 'blue', 'green']})
sizes = pd.DataFrame({'size': ['S', 'M', 'L', 'XL']})
# Generate Cartesian product using itertools
color_values = colors['color'].tolist()
size_values = sizes['size'].tolist()
combinations_list = list(product(color_values, size_values))
# Convert back to DataFrame
combinations = pd.DataFrame(combinations_list, columns=['color', 'size'])
print(combinations)
This approach works well for simple cases but becomes cumbersome with multiple columns:
import pandas as pd
from itertools import product
products = pd.DataFrame({
'product_id': [101, 102],
'product_name': ['Widget', 'Gadget']
})
regions = pd.DataFrame({
'region_code': ['NA', 'EU', 'APAC'],
'currency': ['USD', 'EUR', 'JPY']
})
# Convert rows to tuples
product_rows = [tuple(row) for row in products.values]
region_rows = [tuple(row) for row in regions.values]
# Generate product
combinations_list = [p + r for p, r in product(product_rows, region_rows)]
# Reconstruct DataFrame with combined columns
all_columns = list(products.columns) + list(regions.columns)
combinations = pd.DataFrame(combinations_list, columns=all_columns)
print(combinations)
I recommend this method only when you’re working outside Pandas anyway or need fine-grained control over the iteration process. For typical DataFrame operations, stick with merge().
Performance Considerations
Cross joins are memory-intensive operations. The output size grows multiplicatively, not additively. This catches many developers off guard.
import pandas as pd
def calculate_cross_join_size(df1_rows, df2_rows, bytes_per_row=100):
"""
Estimate the memory impact of a cross join.
"""
output_rows = df1_rows * df2_rows
estimated_memory_mb = (output_rows * bytes_per_row) / (1024 * 1024)
return output_rows, estimated_memory_mb
# Example calculations
scenarios = [
(100, 100),
(1_000, 1_000),
(10_000, 10_000),
(100_000, 100_000),
]
print("DataFrame Sizes -> Output Rows (Estimated Memory)")
print("-" * 55)
for rows1, rows2 in scenarios:
output_rows, memory_mb = calculate_cross_join_size(rows1, rows2)
print(f"{rows1:>7,} × {rows2:>7,} -> {output_rows:>15,} rows (~{memory_mb:,.0f} MB)")
Output:
DataFrame Sizes -> Output Rows (Estimated Memory)
-------------------------------------------------------
100 × 100 -> 10,000 rows (~1 MB)
1,000 × 1,000 -> 1,000,000 rows (~95 MB)
10,000 × 10,000 -> 100,000,000 rows (~9,537 MB)
100,000 × 100,000 -> 10,000,000,000 rows (~953,674 MB)
That last scenario would require nearly a terabyte of memory. Before executing any cross join, calculate the expected output size:
expected_rows = len(df1) * len(df2)
print(f"Cross join will produce {expected_rows:,} rows")
if expected_rows > 10_000_000:
print("WARNING: Consider filtering inputs or using chunked processing")
If you need a cross join on large datasets, consider these strategies:
- Filter first: Reduce input DataFrames before joining
- Process in chunks: Generate combinations in batches
- Use a database: Let PostgreSQL or another RDBMS handle the heavy lifting
- Reconsider the approach: Sometimes a cross join indicates a design problem
Practical Example: Building a Pricing Matrix
Let’s walk through a realistic scenario. You’re building an e-commerce system that needs region-specific pricing for all products across multiple customer tiers.
import pandas as pd
# Product catalog
products = pd.DataFrame({
'sku': ['LAPTOP-001', 'PHONE-002', 'TABLET-003', 'WATCH-004'],
'base_price': [999.99, 799.99, 499.99, 299.99],
'category': ['Electronics', 'Electronics', 'Electronics', 'Wearables']
})
# Regional pricing factors
regions = pd.DataFrame({
'region': ['US', 'EU', 'UK', 'JP'],
'currency': ['USD', 'EUR', 'GBP', 'JPY'],
'price_multiplier': [1.0, 0.95, 0.88, 110.0] # JP uses yen conversion
})
# Customer tiers with discounts
tiers = pd.DataFrame({
'tier': ['Standard', 'Premium', 'Enterprise'],
'discount_pct': [0, 10, 20]
})
# Step 1: Cross join products with regions
product_regions = pd.merge(products, regions, how='cross')
# Step 2: Cross join result with customer tiers
pricing_matrix = pd.merge(product_regions, tiers, how='cross')
# Step 3: Calculate final prices
pricing_matrix['regional_price'] = (
pricing_matrix['base_price'] * pricing_matrix['price_multiplier']
)
pricing_matrix['final_price'] = (
pricing_matrix['regional_price'] * (1 - pricing_matrix['discount_pct'] / 100)
)
# Round for display
pricing_matrix['final_price'] = pricing_matrix['final_price'].round(2)
print(f"Total pricing combinations: {len(pricing_matrix)}")
print("\nSample output (first 10 rows):")
print(pricing_matrix[['sku', 'region', 'tier', 'currency', 'final_price']].head(10))
Output:
Total pricing combinations: 48
Sample output (first 10 rows):
sku region tier currency final_price
0 LAPTOP-001 US Standard USD 999.99
1 LAPTOP-001 US Premium USD 899.99
2 LAPTOP-001 US Enterprise USD 799.99
3 LAPTOP-001 EU Standard EUR 949.99
4 LAPTOP-001 EU Premium EUR 854.99
5 LAPTOP-001 EU Enterprise EUR 759.99
6 LAPTOP-001 UK Standard GBP 879.99
7 LAPTOP-001 UK Premium GBP 791.99
8 LAPTOP-001 UK Enterprise GBP 703.99
9 LAPTOP-001 JP Standard JPY 109998.90
This pattern—cross join followed by calculated columns—is the foundation for many business applications involving combinatorial data.
Summary
Here’s a quick reference for choosing your cross join method:
| Method | Pandas Version | Readability | Performance | Best For |
|---|---|---|---|---|
merge(how='cross') |
1.2.0+ | Excellent | Good | All modern projects |
| Temporary key column | Any | Moderate | Good | Legacy codebases |
itertools.product() |
Any | Poor | Variable | Non-DataFrame workflows |
My recommendation: Use merge(how='cross') unless you have a specific reason not to. It’s the clearest expression of intent, requires no cleanup, and performs well for reasonable dataset sizes.
Always calculate your expected output size before executing. Cross joins are powerful but unforgiving—a moment of carelessness can crash your kernel or exhaust your memory. Filter your inputs, validate your assumptions, and test with small samples first.