How to Create a Contingency Table in Python
A contingency table (also called a cross-tabulation or crosstab) displays the frequency distribution of two or more categorical variables in a matrix format. Each cell shows how many observations...
Key Insights
- Pandas’
pd.crosstab()is the fastest way to create contingency tables from raw categorical data, whilepivot_table()offers more flexibility when you need custom aggregations - Always include margins for totals and use the
normalizeparameter to convert raw counts to proportions—this makes patterns immediately visible and comparisons meaningful - Contingency tables are the prerequisite for chi-square tests; scipy’s
chi2_contingency()takes your crosstab output directly to determine if categorical variables are statistically independent
What Are Contingency Tables and When Do You Need Them?
A contingency table (also called a cross-tabulation or crosstab) displays the frequency distribution of two or more categorical variables in a matrix format. Each cell shows how many observations fall into a specific combination of categories.
You’ll reach for contingency tables when you need to:
- Explore relationships between categorical variables before modeling
- Prepare data for chi-square independence tests
- Summarize survey responses across demographic groups
- Identify patterns in categorical data that summary statistics can’t reveal
If you’re asking “does product preference vary by age group?” or “is there a relationship between region and customer churn?”, a contingency table is your starting point.
Creating Contingency Tables with Pandas
The pd.crosstab() function is purpose-built for contingency tables. It takes two Series (or array-like objects) and returns a frequency table showing counts for each combination.
Let’s work with a realistic dataset—customer survey responses:
import pandas as pd
import numpy as np
# Sample survey data
np.random.seed(42)
n = 500
data = pd.DataFrame({
'customer_id': range(1, n + 1),
'age_group': np.random.choice(['18-25', '26-35', '36-45', '46-55', '56+'], n),
'product_preference': np.random.choice(['Basic', 'Standard', 'Premium'], n,
p=[0.3, 0.45, 0.25]),
'region': np.random.choice(['North', 'South', 'East', 'West'], n),
'satisfaction': np.random.choice(['Low', 'Medium', 'High'], n)
})
# Basic contingency table
contingency = pd.crosstab(data['age_group'], data['product_preference'])
print(contingency)
Output:
product_preference Basic Premium Standard
age_group
18-25 28 22 45
26-35 33 27 46
36-45 30 27 51
46-55 28 25 42
56+ 32 24 40
This immediately shows you the distribution of product preferences across age groups. The syntax is straightforward: the first argument becomes the row index, the second becomes the column headers.
For more descriptive output, use the rownames and colnames parameters:
contingency = pd.crosstab(
data['age_group'],
data['product_preference'],
rownames=['Age Group'],
colnames=['Product Tier']
)
Using pivot_table() as an Alternative
While crosstab() works directly with Series, pivot_table() operates on DataFrames and shines when you need custom aggregations or when your data structure suits it better.
For a simple frequency table, use aggfunc='count' on any column:
# Frequency table using pivot_table
freq_table = data.pivot_table(
index='age_group',
columns='product_preference',
values='customer_id',
aggfunc='count',
fill_value=0
)
print(freq_table)
The output matches crosstab(), but pivot_table() becomes essential when you need aggregations beyond counting:
# If you had a numeric column like 'purchase_amount'
# you could calculate means per cell
data['purchase_amount'] = np.random.uniform(50, 500, n)
avg_purchase = data.pivot_table(
index='age_group',
columns='product_preference',
values='purchase_amount',
aggfunc='mean'
)
print(avg_purchase.round(2))
When to use which:
crosstab(): Quick frequency tables from raw categorical datapivot_table(): When you need mean, sum, or custom aggregations, or when working within a larger DataFrame-centric workflow
Adding Margins and Normalizing
Raw counts are useful, but margins (totals) and proportions often tell a clearer story.
Adding Row and Column Totals
contingency_with_totals = pd.crosstab(
data['age_group'],
data['product_preference'],
margins=True,
margins_name='Total'
)
print(contingency_with_totals)
Output:
product_preference Basic Premium Standard Total
age_group
18-25 28 22 45 95
26-35 33 27 46 106
36-45 30 27 51 108
46-55 28 25 42 95
56+ 32 24 40 96
Total 151 125 224 500
Converting to Proportions
The normalize parameter converts counts to proportions. It accepts three values:
# Normalize over all values (proportions sum to 1.0)
prop_all = pd.crosstab(
data['age_group'],
data['product_preference'],
normalize='all'
)
print("Proportion of total:")
print(prop_all.round(3))
# Normalize over rows (each row sums to 1.0)
prop_rows = pd.crosstab(
data['age_group'],
data['product_preference'],
normalize='index'
)
print("\nProportion within each age group:")
print(prop_rows.round(3))
# Normalize over columns (each column sums to 1.0)
prop_cols = pd.crosstab(
data['age_group'],
data['product_preference'],
normalize='columns'
)
print("\nProportion within each product tier:")
print(prop_cols.round(3))
Practical guidance: Use normalize='index' when your question is “given age group X, what’s the distribution of preferences?” Use normalize='columns' when asking “given product tier Y, what’s the age distribution?”
For percentages, multiply by 100:
pct_table = pd.crosstab(
data['age_group'],
data['product_preference'],
normalize='index'
) * 100
print(pct_table.round(1))
Multi-Way Contingency Tables
Real analysis often involves more than two variables. Pandas handles this by accepting lists for rows and/or columns.
# Three-way contingency table
multi_way = pd.crosstab(
[data['age_group'], data['region']], # Multiple row variables
data['product_preference'],
margins=True
)
print(multi_way)
This creates a hierarchical row index. For even more dimensions:
# Four variables: two in rows, two in columns
four_way = pd.crosstab(
[data['age_group'], data['region']],
[data['product_preference'], data['satisfaction']]
)
print(four_way)
Multi-way tables get unwieldy fast. For exploration, they’re valuable. For presentation, consider filtering to specific categories or creating separate two-way tables.
# Filter to specific subset for clarity
subset = data[data['region'].isin(['North', 'South'])]
focused_table = pd.crosstab(
[subset['age_group'], subset['region']],
subset['product_preference'],
normalize='index'
) * 100
print(focused_table.round(1))
Statistical Analysis with Contingency Tables
Contingency tables set up the chi-square test of independence, which answers: “Are these two categorical variables independent, or is there a statistically significant relationship?”
from scipy import stats
# Create contingency table (without margins for chi-square)
observed = pd.crosstab(data['age_group'], data['product_preference'])
# Perform chi-square test
chi2, p_value, dof, expected = stats.chi2_contingency(observed)
print(f"Chi-square statistic: {chi2:.4f}")
print(f"P-value: {p_value:.4f}")
print(f"Degrees of freedom: {dof}")
print(f"\nExpected frequencies:")
print(pd.DataFrame(
expected,
index=observed.index,
columns=observed.columns
).round(2))
Interpreting results:
- If p-value < 0.05 (typical threshold), reject the null hypothesis—the variables are likely not independent
- The expected frequencies show what counts you’d see if variables were independent
- Compare observed vs. expected to identify which cells drive the relationship
# Calculate residuals to see where differences are largest
residuals = (observed - expected) / np.sqrt(expected)
print("\nStandardized residuals (>2 or <-2 are notable):")
print(residuals.round(2))
Cells with standardized residuals beyond ±2 indicate combinations that occur more or less frequently than expected under independence.
Visualization
Numbers in tables are precise but patterns emerge faster with visualization.
Heatmaps for Contingency Tables
import seaborn as sns
import matplotlib.pyplot as plt
# Create contingency table with percentages
pct_table = pd.crosstab(
data['age_group'],
data['product_preference'],
normalize='index'
) * 100
# Heatmap
plt.figure(figsize=(10, 6))
sns.heatmap(
pct_table,
annot=True,
fmt='.1f',
cmap='YlOrRd',
cbar_kws={'label': 'Percentage'}
)
plt.title('Product Preference by Age Group (%)')
plt.xlabel('Product Tier')
plt.ylabel('Age Group')
plt.tight_layout()
plt.savefig('contingency_heatmap.png', dpi=150)
plt.show()
Stacked Bar Charts
For comparing distributions across groups, stacked bars work well:
# Stacked bar chart
pct_table.plot(
kind='bar',
stacked=True,
figsize=(10, 6),
colormap='viridis'
)
plt.title('Product Preference Distribution by Age Group')
plt.xlabel('Age Group')
plt.ylabel('Percentage')
plt.legend(title='Product Tier', bbox_to_anchor=(1.02, 1))
plt.tight_layout()
plt.savefig('contingency_stacked_bar.png', dpi=150)
plt.show()
Grouped Bar Charts
When you want to compare absolute frequencies:
contingency = pd.crosstab(data['age_group'], data['product_preference'])
contingency.plot(
kind='bar',
figsize=(10, 6),
width=0.8
)
plt.title('Product Preference Counts by Age Group')
plt.xlabel('Age Group')
plt.ylabel('Count')
plt.legend(title='Product Tier')
plt.xticks(rotation=45)
plt.tight_layout()
plt.savefig('contingency_grouped_bar.png', dpi=150)
plt.show()
Practical Recommendations
After building hundreds of contingency tables, here’s what I’ve learned:
-
Always check margins first. Unbalanced groups (one category with 90% of observations) will skew your interpretation.
-
Use proportions for comparison, counts for context. Show both when presenting to stakeholders.
-
Don’t chi-square everything. The test tells you if a relationship exists, not if it’s meaningful. A p-value of 0.001 with a 2% difference in proportions may be statistically significant but practically irrelevant.
-
Order categories intentionally. Age groups should be chronological, satisfaction levels should be ordinal. Random ordering obscures patterns.
-
Filter before creating multi-way tables. A 5×4×3×3 table has 180 cells. Nobody can interpret that. Ask a specific question and filter accordingly.
Contingency tables are foundational. Master them, and categorical data analysis becomes straightforward.