Pandas - Cross Tabulation (crosstab)
• Cross tabulation transforms categorical data into frequency tables, revealing relationships between two or more variables that simple groupby operations miss
Key Insights
• Cross tabulation transforms categorical data into frequency tables, revealing relationships between two or more variables that simple groupby operations miss
• The pd.crosstab() function provides built-in normalization, margin calculations, and aggregation functions that eliminate manual pivot table configuration
• Understanding when to use crosstab versus pivot_table depends on whether you’re counting occurrences or aggregating existing numeric columns
Understanding Cross Tabulation Fundamentals
Cross tabulation creates a frequency distribution table showing the relationship between categorical variables. Unlike pivot_table(), which aggregates values from a column, crosstab() counts occurrences by default.
import pandas as pd
import numpy as np
# Sample dataset: customer purchases
data = {
'customer_id': range(1, 21),
'product': ['Laptop', 'Phone', 'Tablet', 'Laptop', 'Phone',
'Tablet', 'Laptop', 'Phone', 'Tablet', 'Laptop',
'Phone', 'Tablet', 'Laptop', 'Phone', 'Tablet',
'Laptop', 'Phone', 'Tablet', 'Laptop', 'Phone'],
'region': ['North', 'South', 'East', 'West', 'North',
'South', 'East', 'West', 'North', 'South',
'East', 'West', 'North', 'South', 'East',
'West', 'North', 'South', 'East', 'West'],
'payment_method': ['Credit', 'Debit', 'Credit', 'PayPal', 'Credit',
'Debit', 'PayPal', 'Credit', 'Debit', 'Credit',
'PayPal', 'Debit', 'Credit', 'PayPal', 'Debit',
'Credit', 'PayPal', 'Debit', 'Credit', 'PayPal']
}
df = pd.DataFrame(data)
# Basic crosstab: product by region
ct = pd.crosstab(df['product'], df['region'])
print(ct)
Output:
region East North South West
product
Laptop 2 2 1 2
Phone 2 2 2 2
Tablet 2 1 2 0
Normalization and Percentages
Normalization converts raw counts into proportions, essential for comparing distributions across different sample sizes.
# Normalize by rows (each row sums to 1)
ct_row_norm = pd.crosstab(df['product'], df['region'], normalize='index')
print(ct_row_norm)
print()
# Normalize by columns (each column sums to 1)
ct_col_norm = pd.crosstab(df['product'], df['region'], normalize='columns')
print(ct_col_norm)
print()
# Normalize by total (entire table sums to 1)
ct_all_norm = pd.crosstab(df['product'], df['region'], normalize='all')
print(ct_all_norm)
Output (row normalization):
region East North South West
product
Laptop 0.286 0.286 0.143 0.286
Phone 0.250 0.250 0.250 0.250
Tablet 0.400 0.200 0.400 0.000
This reveals that 40% of Tablet sales occur in the East, while Phones distribute evenly across regions.
Adding Margins and Subtotals
Margins provide row and column totals, offering context for interpreting frequencies.
# Add margins (totals)
ct_margins = pd.crosstab(df['product'], df['region'], margins=True, margins_name='Total')
print(ct_margins)
print()
# Combine margins with normalization
ct_norm_margins = pd.crosstab(
df['product'],
df['region'],
normalize='index',
margins=True
)
print(ct_norm_margins)
Output:
region East North South West Total
product
Laptop 2 2 1 2 7
Phone 2 2 2 2 8
Tablet 2 1 2 0 5
Total 6 5 5 4 20
Multi-Dimensional Cross Tabulation
Analyze relationships across three or more variables by passing lists to the index and columns parameters.
# Three-way crosstab: product and payment method by region
ct_3way = pd.crosstab(
[df['product'], df['payment_method']],
df['region']
)
print(ct_3way)
print()
# More readable with column hierarchies
ct_3way_cols = pd.crosstab(
df['product'],
[df['region'], df['payment_method']]
)
print(ct_3way_cols)
Output:
region East North South West
product payment_method
Laptop Credit 1 2 0 1
PayPal 1 0 0 1
Phone Credit 0 1 0 1
Debit 0 0 1 0
PayPal 2 1 1 1
Tablet Credit 1 0 0 0
Debit 1 0 2 0
Custom Aggregation Functions
While crosstab counts by default, you can aggregate numeric values using the values and aggfunc parameters.
# Add transaction amounts
df['amount'] = np.random.randint(100, 1000, size=len(df))
# Sum amounts by product and region
ct_sum = pd.crosstab(
df['product'],
df['region'],
values=df['amount'],
aggfunc='sum'
)
print(ct_sum)
print()
# Multiple aggregation functions
ct_multi_agg = pd.crosstab(
df['product'],
df['region'],
values=df['amount'],
aggfunc=['sum', 'mean', 'count']
)
print(ct_multi_agg)
Filtering and Conditional Analysis
Apply boolean indexing before crosstab to analyze specific subsets.
# High-value transactions only (>500)
high_value = df[df['amount'] > 500]
ct_high_value = pd.crosstab(high_value['product'], high_value['region'])
print("High-value transactions:")
print(ct_high_value)
print()
# Compare with low-value transactions
low_value = df[df['amount'] <= 500]
ct_low_value = pd.crosstab(low_value['product'], low_value['region'])
print("Low-value transactions:")
print(ct_low_value)
Practical Pattern: A/B Test Analysis
Cross tabulation excels at analyzing categorical outcomes in A/B tests.
# Simulate A/B test data
ab_data = {
'user_id': range(1, 201),
'variant': np.random.choice(['A', 'B'], 200),
'converted': np.random.choice([True, False], 200, p=[0.15, 0.85]),
'device': np.random.choice(['Mobile', 'Desktop', 'Tablet'], 200)
}
ab_df = pd.DataFrame(ab_data)
# Conversion rate by variant
ct_conversion = pd.crosstab(
ab_df['variant'],
ab_df['converted'],
normalize='index'
)
print("Conversion rates by variant:")
print(ct_conversion)
print()
# Conversion by variant and device
ct_device = pd.crosstab(
[ab_df['variant'], ab_df['device']],
ab_df['converted'],
normalize='index'
)
print("Conversion rates by variant and device:")
print(ct_device)
Crosstab vs Pivot Table Decision Matrix
Use crosstab() when:
- Counting occurrences of categorical combinations
- Variables exist as separate Series or arrays (not necessarily in a DataFrame)
- You need quick frequency distributions without pre-aggregated values
Use pivot_table() when:
- Aggregating existing numeric columns
- All data exists within a single DataFrame
- You need more complex hierarchical indexing with multiple aggregation functions
# Same result, different approaches
# Crosstab approach
ct_approach = pd.crosstab(df['product'], df['region'])
# Pivot table approach (requires counting column)
df['count'] = 1
pivot_approach = df.pivot_table(
values='count',
index='product',
columns='region',
aggfunc='sum',
fill_value=0
)
print("Results are identical:", ct_approach.equals(pivot_approach))
Performance Considerations
For large datasets, crosstab performance depends on cardinality of categorical variables.
# Large dataset example
large_data = {
'category_a': np.random.choice(['X', 'Y', 'Z'], 100000),
'category_b': np.random.choice(range(50), 100000)
}
large_df = pd.DataFrame(large_data)
# Time comparison
import time
start = time.time()
ct_large = pd.crosstab(large_df['category_a'], large_df['category_b'])
crosstab_time = time.time() - start
start = time.time()
grouped = large_df.groupby(['category_a', 'category_b']).size().unstack(fill_value=0)
groupby_time = time.time() - start
print(f"Crosstab: {crosstab_time:.4f}s")
print(f"Groupby+unstack: {groupby_time:.4f}s")
Cross tabulation transforms categorical data into actionable insights. Master normalization for percentage-based analysis, use margins for context, and leverage multi-dimensional crosstabs for complex relationships. The choice between crosstab and pivot_table hinges on whether you’re counting occurrences or aggregating values—understanding this distinction prevents unnecessary data manipulation.