How to Create a Cross-Tabulation in Python
Cross-tabulation, also called a contingency table, is a method for summarizing the relationship between two or more categorical variables. It displays the frequency distribution of variables in a...
Key Insights
- Cross-tabulation with
pd.crosstab()is the fastest way to analyze relationships between categorical variables, revealing patterns that summary statistics alone would miss. - The
normalizeparameter transforms raw counts into percentages, which is essential for comparing groups of different sizes—always normalize when your categories have unequal sample sizes. - Combining cross-tabulation with heatmap visualization creates an immediately interpretable view of your data that stakeholders can understand without statistical training.
Introduction to Cross-Tabulation
Cross-tabulation, also called a contingency table, is a method for summarizing the relationship between two or more categorical variables. It displays the frequency distribution of variables in a matrix format, showing how often each combination of categories occurs.
If you’ve ever asked questions like “Do customers in different regions prefer different products?” or “Is there a relationship between education level and voting preference?”, cross-tabulation is your tool. It’s foundational in survey analysis, A/B testing, market research, and any domain where you need to understand how categorical variables interact.
Unlike correlation coefficients that work with continuous data, cross-tabulation handles categorical data natively. It gives you actual counts and proportions, making it easy to spot patterns, identify segments, and prepare data for chi-square tests of independence.
Setting Up Your Environment
You need pandas for the heavy lifting. NumPy is useful for data generation, and seaborn handles visualization. Here’s the setup:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
# Set random seed for reproducibility
np.random.seed(42)
# Create sample survey data
n_respondents = 500
data = {
'region': np.random.choice(['North', 'South', 'East', 'West'], n_respondents),
'age_group': np.random.choice(['18-25', '26-35', '36-50', '51+'], n_respondents),
'product_preference': np.random.choice(['Basic', 'Premium', 'Enterprise'], n_respondents),
'satisfaction': np.random.choice(['Low', 'Medium', 'High'], n_respondents),
'purchase_amount': np.random.randint(50, 500, n_respondents),
'repeat_customer': np.random.choice(['Yes', 'No'], n_respondents)
}
df = pd.DataFrame(data)
print(df.head(10))
region age_group product_preference satisfaction purchase_amount repeat_customer
0 West 36-50 Basic Medium 168 Yes
1 South 26-35 Enterprise High 318 No
2 East 36-50 Premium Low 188 Yes
3 West 51+ Basic Medium 438 No
4 East 18-25 Enterprise Medium 306 Yes
5 North 26-35 Premium High 128 No
6 West 51+ Enterprise Low 267 Yes
7 South 36-50 Basic Medium 497 No
8 East 18-25 Premium High 145 Yes
9 North 26-35 Basic Low 392 No
This synthetic dataset simulates survey responses with region, age demographics, product preferences, satisfaction levels, and purchase behavior. Real-world data would come from your database or CSV files, but the cross-tabulation techniques remain identical.
Basic Cross-Tabulation with pandas.crosstab()
The pd.crosstab() function is purpose-built for contingency tables. Its syntax is straightforward: pass the row variable as the first argument and the column variable as the second.
# Basic cross-tabulation: Region vs Product Preference
basic_crosstab = pd.crosstab(df['region'], df['product_preference'])
print(basic_crosstab)
product_preference Basic Enterprise Premium
region
East 43 42 45
North 45 38 36
South 44 42 43
West 38 44 40
This table shows the count of respondents for each combination of region and product preference. Reading across the East row, you see 43 people chose Basic, 42 chose Enterprise, and 45 chose Premium.
The function automatically handles the grouping and counting. You don’t need to write groupby operations or pivot tables manually. It also sorts the index and columns alphabetically by default, which you can override with the dropna and reindexing options.
# Specify custom ordering for categorical display
region_order = ['North', 'South', 'East', 'West']
product_order = ['Basic', 'Premium', 'Enterprise']
ordered_crosstab = pd.crosstab(
df['region'],
df['product_preference']
).reindex(index=region_order, columns=product_order)
print(ordered_crosstab)
product_preference Basic Premium Enterprise
region
North 45 36 38
South 44 43 42
East 43 45 42
West 38 40 44
Adding Margins and Normalizing Data
Raw counts are useful, but they don’t tell the whole story. Adding margins gives you row and column totals. Normalizing converts counts to proportions, essential when comparing groups of different sizes.
# Add margins (row and column totals)
with_margins = pd.crosstab(
df['region'],
df['product_preference'],
margins=True,
margins_name='Total'
)
print(with_margins)
product_preference Basic Enterprise Premium Total
region
East 43 42 45 130
North 45 38 36 119
South 44 42 43 129
West 38 44 40 122
Total 170 166 164 500
Now you can see that the East region has 130 total respondents, and 170 people overall chose the Basic product. The margins provide context for interpreting individual cells.
Normalization is where cross-tabulation becomes analytically powerful:
# Normalize by rows (each row sums to 1)
row_normalized = pd.crosstab(
df['region'],
df['product_preference'],
normalize='index'
).round(3)
print("Row-normalized (proportions within each region):")
print(row_normalized)
# Normalize by columns (each column sums to 1)
col_normalized = pd.crosstab(
df['region'],
df['product_preference'],
normalize='columns'
).round(3)
print("\nColumn-normalized (proportions within each product):")
print(col_normalized)
# Normalize by all (entire table sums to 1)
all_normalized = pd.crosstab(
df['region'],
df['product_preference'],
normalize='all'
).round(3)
print("\nTable-normalized (proportions of total):")
print(all_normalized)
Row-normalized (proportions within each region):
product_preference Basic Enterprise Premium
region
East 0.331 0.323 0.346
North 0.378 0.319 0.303
South 0.341 0.326 0.333
West 0.311 0.361 0.328
Column-normalized (proportions within each product):
product_preference Basic Enterprise Premium
region
East 0.253 0.253 0.274
North 0.265 0.229 0.220
South 0.259 0.253 0.262
West 0.224 0.265 0.244
Table-normalized (proportions of total):
product_preference Basic Enterprise Premium
region
East 0.086 0.084 0.090
North 0.090 0.076 0.072
South 0.088 0.084 0.086
West 0.076 0.088 0.080
Use row normalization (normalize='index') when you want to compare product preferences across regions. Use column normalization (normalize='columns') when you want to see which regions dominate each product category.
Multi-Variable Cross-Tabulation
Real analysis often involves more than two variables. You can pass lists to create hierarchical cross-tabulations:
# Three-variable cross-tabulation
multi_crosstab = pd.crosstab(
[df['region'], df['age_group']],
df['product_preference']
)
print(multi_crosstab)
product_preference Basic Enterprise Premium
region age_group
East 18-25 8 13 12
26-35 13 13 10
36-50 13 9 11
51+ 9 7 12
North 18-25 15 11 10
26-35 11 11 10
36-50 10 9 10
51+ 9 7 6
South 18-25 12 11 8
26-35 10 10 12
36-50 13 10 14
51+ 9 11 9
West 18-25 9 12 13
26-35 8 11 9
36-50 13 11 10
51+ 8 10 8
This creates a hierarchical index with region and age group combinations. You can also add multiple columns:
# Multiple variables on both axes
complex_crosstab = pd.crosstab(
df['region'],
[df['product_preference'], df['repeat_customer']]
)
print(complex_crosstab)
product_preference Basic Enterprise Premium
repeat_customer No Yes No Yes No Yes
region
East 22 21 17 25 21 24
North 22 23 21 17 16 20
South 24 20 19 23 17 26
West 18 20 23 21 21 19
This shows product preference broken down by repeat customer status for each region. The hierarchical columns let you see patterns like whether repeat customers gravitate toward Premium products.
Applying Aggregation Functions
Cross-tabulation isn’t limited to counting. The aggfunc and values parameters let you compute means, sums, or custom aggregations:
# Average purchase amount by region and product preference
mean_purchase = pd.crosstab(
df['region'],
df['product_preference'],
values=df['purchase_amount'],
aggfunc='mean'
).round(2)
print("Average purchase amount:")
print(mean_purchase)
Average purchase amount:
product_preference Basic Enterprise Premium
region
East 270.16 286.55 268.40
North 253.62 290.29 282.53
South 276.82 256.71 257.63
West 280.47 270.27 273.35
Now instead of counts, you see the average purchase amount for each segment. This is invaluable for understanding which customer segments drive revenue.
# Multiple aggregations using a custom function
def purchase_stats(x):
return pd.Series({
'mean': x.mean(),
'median': x.median(),
'count': len(x)
})
# For multiple stats, use pivot_table instead
stats_table = df.pivot_table(
values='purchase_amount',
index='region',
columns='product_preference',
aggfunc=['mean', 'median', 'count']
).round(2)
print(stats_table)
mean median count
product_preference Basic Enterprise Premium Basic Enterprise Premium Basic Enterprise Premium
region
East 270.16 286.55 268.40 256.0 311.0 284.0 43 42 45
North 253.62 290.29 282.53 234.0 310.5 303.0 45 38 36
South 276.82 256.71 257.63 285.0 251.5 254.0 44 42 43
West 280.47 270.27 273.35 296.0 258.5 271.5 38 44 40
Visualizing Cross-Tabulation Results
Numbers in tables are precise but hard to scan. Heatmaps make patterns jump out immediately:
# Create a heatmap of the basic cross-tabulation
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# Raw counts heatmap
counts = pd.crosstab(df['region'], df['product_preference'])
sns.heatmap(counts, annot=True, fmt='d', cmap='Blues', ax=axes[0])
axes[0].set_title('Product Preference by Region (Counts)')
# Normalized heatmap
normalized = pd.crosstab(
df['region'],
df['product_preference'],
normalize='index'
)
sns.heatmap(normalized, annot=True, fmt='.1%', cmap='YlOrRd', ax=axes[1])
axes[1].set_title('Product Preference by Region (Row %)')
plt.tight_layout()
plt.savefig('crosstab_heatmap.png', dpi=150)
plt.show()
The annot=True parameter displays values in each cell. Use fmt='d' for integers and fmt='.1%' for percentages. The colormap choice matters—sequential colormaps like ‘Blues’ work well for counts, while diverging colormaps highlight deviations from a midpoint.
For quick styling without matplotlib, pandas has built-in options:
# Styled pandas output for notebooks
styled = pd.crosstab(
df['region'],
df['product_preference'],
normalize='index'
).style.format('{:.1%}').background_gradient(cmap='Blues')
styled
This renders directly in Jupyter notebooks with color gradients applied to cells, making high and low values immediately visible.
Cross-tabulation is a fundamental technique that belongs in every data analyst’s toolkit. Master pd.crosstab() and you’ll find yourself reaching for it constantly—it’s the fastest path from raw categorical data to actionable insights.