How to Create a Crosstab in Pandas
A crosstab—short for cross-tabulation—is a table that displays the frequency distribution of variables. Think of it as a pivot table specifically designed for categorical data. When you need to...
Key Insights
- Crosstabs are the fastest way to analyze relationships between categorical variables, revealing patterns that summary statistics miss
- The
normalizeparameter transforms raw counts into percentages, making comparisons across groups with different sizes meaningful - Combining
valuesandaggfuncparameters turns crosstabs from simple frequency tables into powerful aggregation tools for numeric data
Introduction to Crosstabs
A crosstab—short for cross-tabulation—is a table that displays the frequency distribution of variables. Think of it as a pivot table specifically designed for categorical data. When you need to answer questions like “How many employees in each department are full-time versus part-time?” or “What’s the breakdown of customer segments by region?”, a crosstab gives you the answer in seconds.
Pandas provides pd.crosstab() as a dedicated function for this task. While you can achieve similar results with pivot_table() or groupby(), crosstabs are more concise for frequency analysis and offer built-in normalization that other methods lack.
The key distinction: use crosstabs when you’re analyzing the relationship between two or more categorical variables. If you’re aggregating numeric values, pivot_table() might be more appropriate—though crosstabs can handle that too, as we’ll see.
Basic Crosstab Syntax
The pd.crosstab() function requires two arguments: index (row labels) and columns (column labels). Both should be array-like objects, typically DataFrame columns.
import pandas as pd
import numpy as np
# Sample employee data
data = {
'department': ['Engineering', 'Sales', 'Engineering', 'HR', 'Sales',
'Engineering', 'HR', 'Sales', 'Engineering', 'HR'],
'status': ['Full-time', 'Full-time', 'Part-time', 'Full-time', 'Part-time',
'Full-time', 'Part-time', 'Full-time', 'Full-time', 'Full-time'],
'level': ['Senior', 'Junior', 'Junior', 'Senior', 'Junior',
'Senior', 'Senior', 'Senior', 'Junior', 'Junior'],
'salary': [95000, 55000, 45000, 72000, 48000,
105000, 38000, 78000, 75000, 52000]
}
df = pd.DataFrame(data)
# Basic crosstab: department vs employment status
ct = pd.crosstab(df['department'], df['status'])
print(ct)
Output:
status Full-time Part-time
department
Engineering 3 1
HR 2 1
Sales 2 1
This immediately tells you that Engineering has the most full-time employees and that part-time workers are evenly distributed across departments. No loops, no manual counting—just one function call.
Adding Margins and Totals
Raw counts are useful, but you often need totals for context. The margins parameter adds row and column sums, while margins_name lets you customize the label.
# Crosstab with margins (totals)
ct_margins = pd.crosstab(
df['department'],
df['status'],
margins=True,
margins_name='Total'
)
print(ct_margins)
Output:
status Full-time Part-time Total
department
Engineering 3 1 4
HR 2 1 3
Sales 2 1 3
Total 7 3 10
Now you can see that 70% of employees are full-time (7 out of 10) and that Engineering is the largest department. The margins provide essential context that raw crosstabs lack.
A practical tip: always include margins when presenting crosstabs to stakeholders. The totals anchor the data and prevent misinterpretation.
Normalizing Values (Percentages)
Counts can be misleading when group sizes differ significantly. If Engineering has 100 employees and HR has 10, comparing raw counts is meaningless. The normalize parameter converts counts to proportions.
Three normalization options exist:
'all': percentages of the grand total'index': percentages within each row'columns': percentages within each column
# Normalize across all values (percentage of grand total)
ct_all = pd.crosstab(
df['department'],
df['status'],
normalize='all'
)
print("Normalized by all:")
print(ct_all.round(2))
# Normalize by index (row percentages)
ct_index = pd.crosstab(
df['department'],
df['status'],
normalize='index'
)
print("\nNormalized by index (row):")
print(ct_index.round(2))
# Normalize by columns (column percentages)
ct_columns = pd.crosstab(
df['department'],
df['status'],
normalize='columns'
)
print("\nNormalized by columns:")
print(ct_columns.round(2))
Output:
Normalized by all:
status Full-time Part-time
department
Engineering 0.3 0.1
HR 0.2 0.1
Sales 0.2 0.1
Normalized by index (row):
status Full-time Part-time
department
Engineering 0.75 0.25
HR 0.67 0.33
Sales 0.67 0.33
Normalized by columns:
status Full-time Part-time
department
Engineering 0.43 0.33
HR 0.29 0.33
Sales 0.29 0.33
Row normalization (normalize='index') is typically most useful—it answers “Within each department, what percentage are full-time?” This lets you compare departments fairly regardless of size.
Applying Aggregation Functions
Crosstabs aren’t limited to counting. The values and aggfunc parameters let you aggregate numeric data, transforming the crosstab into something closer to a pivot table.
# Average salary by department and status
ct_salary = pd.crosstab(
df['department'],
df['status'],
values=df['salary'],
aggfunc='mean'
)
print("Average salary:")
print(ct_salary.round(0))
# Multiple aggregations using a custom function
ct_multi = pd.crosstab(
df['department'],
df['status'],
values=df['salary'],
aggfunc=['mean', 'count', 'sum']
)
print("\nMultiple aggregations:")
print(ct_multi.round(0))
Output:
Average salary:
status Full-time Part-time
department
Engineering 91667.0 45000.0
HR 62000.0 38000.0
Sales 66500.0 48000.0
Multiple aggregations:
mean count sum
status Full-time Part-time Full-time Part-time Full-time Part-time
department
Engineering 91667.0 45000.0 3.0 1.0 275000.0 45000.0
HR 62000.0 38000.0 2.0 1.0 124000.0 38000.0
Sales 66500.0 48000.0 2.0 1.0 133000.0 48000.0
This reveals that full-time Engineering employees earn significantly more on average than other groups. You can use any aggregation function that works with groupby(): 'sum', 'min', 'max', 'std', or even custom functions.
Multi-Level Crosstabs
Real analysis often requires more than two dimensions. Pass lists to index or columns to create hierarchical crosstabs.
# Multi-level index: department and level vs status
ct_multi_index = pd.crosstab(
[df['department'], df['level']],
df['status'],
margins=True
)
print("Multi-level index:")
print(ct_multi_index)
# Multi-level columns: status vs department and level
ct_multi_cols = pd.crosstab(
df['status'],
[df['department'], df['level']]
)
print("\nMulti-level columns:")
print(ct_multi_cols)
Output:
Multi-level index:
status Full-time Part-time All
department level
Engineering Junior 1 1 2
Senior 2 0 2
HR Junior 1 1 2
Senior 1 0 1
Sales Junior 0 1 1
Senior 2 0 2
All 7 3 10
Multi-level columns:
department Engineering HR Sales
level Junior Senior Junior Senior Junior Senior
status
Full-time 1 2 1 1 0 2
Part-time 1 0 1 0 1 0
Multi-level crosstabs answer complex questions: “Among full-time senior employees, which department has the most?” The hierarchical structure keeps related data together while showing all combinations.
Styling and Exporting
Raw numbers work for analysis, but presentations require polish. Pandas styling transforms crosstabs into visual tools.
# Create a crosstab for styling
ct_style = pd.crosstab(
df['department'],
df['status'],
values=df['salary'],
aggfunc='mean'
).round(0)
# Apply heatmap styling
styled = ct_style.style.background_gradient(
cmap='YlOrRd',
axis=None
).format('${:,.0f}')
# Display in Jupyter or export
styled
For export, you have several options:
# Export to Excel with formatting preserved
ct_margins.to_excel('crosstab_report.xlsx', sheet_name='Department Analysis')
# Export to HTML for web embedding
html_output = ct_style.style.background_gradient(cmap='Blues').to_html()
with open('crosstab.html', 'w') as f:
f.write(html_output)
# Export to clipboard for pasting into other applications
ct_margins.to_clipboard()
The background_gradient() method creates heatmaps that make patterns immediately visible. High values get darker colors, drawing attention to outliers and trends that numbers alone might hide.
Practical Recommendations
After years of using crosstabs in production analysis, here’s what I’ve learned:
Start with counts, then normalize. Raw frequencies establish baseline understanding. Normalization comes second to answer comparative questions.
Always include margins for stakeholder presentations. Without totals, viewers lack context and may draw incorrect conclusions.
Use normalize='index' for most comparisons. Row percentages answer the most common business questions: “What percentage of X falls into category Y?”
Combine with .fillna(0) for sparse data. Missing combinations appear as NaN by default. Fill them explicitly when zeros are meaningful.
# Handle missing combinations
ct_filled = pd.crosstab(
df['department'],
df['status']
).fillna(0).astype(int)
Crosstabs are deceptively simple. Master them, and you’ll answer categorical questions faster than colleagues reaching for pivot tables or writing complex groupby chains.