How to Create a Contingency Table in Python

Key Insights

Pandas’ pd.crosstab() is the fastest way to create contingency tables from raw categorical data, while pivot_table() offers more flexibility when you need custom aggregations
Always include margins for totals and use the normalize parameter to convert raw counts to proportions—this makes patterns immediately visible and comparisons meaningful
Contingency tables are the prerequisite for chi-square tests; scipy’s chi2_contingency() takes your crosstab output directly to determine if categorical variables are statistically independent

What Are Contingency Tables and When Do You Need Them?

A contingency table (also called a cross-tabulation or crosstab) displays the frequency distribution of two or more categorical variables in a matrix format. Each cell shows how many observations fall into a specific combination of categories.

You’ll reach for contingency tables when you need to:

Explore relationships between categorical variables before modeling
Prepare data for chi-square independence tests
Summarize survey responses across demographic groups
Identify patterns in categorical data that summary statistics can’t reveal

If you’re asking “does product preference vary by age group?” or “is there a relationship between region and customer churn?”, a contingency table is your starting point.

Creating Contingency Tables with Pandas

The pd.crosstab() function is purpose-built for contingency tables. It takes two Series (or array-like objects) and returns a frequency table showing counts for each combination.

Let’s work with a realistic dataset—customer survey responses:

import pandas as pd
import numpy as np

# Sample survey data
np.random.seed(42)
n = 500

data = pd.DataFrame({
    'customer_id': range(1, n + 1),
    'age_group': np.random.choice(['18-25', '26-35', '36-45', '46-55', '56+'], n),
    'product_preference': np.random.choice(['Basic', 'Standard', 'Premium'], n, 
                                            p=[0.3, 0.45, 0.25]),
    'region': np.random.choice(['North', 'South', 'East', 'West'], n),
    'satisfaction': np.random.choice(['Low', 'Medium', 'High'], n)
})

# Basic contingency table
contingency = pd.crosstab(data['age_group'], data['product_preference'])
print(contingency)

Output:

product_preference  Basic  Premium  Standard
age_group                                   
18-25                  28       22        45
26-35                  33       27        46
36-45                  30       27        51
46-55                  28       25        42
56+                    32       24        40

This immediately shows you the distribution of product preferences across age groups. The syntax is straightforward: the first argument becomes the row index, the second becomes the column headers.

For more descriptive output, use the rownames and colnames parameters:

contingency = pd.crosstab(
    data['age_group'], 
    data['product_preference'],
    rownames=['Age Group'],
    colnames=['Product Tier']
)

Using pivot_table() as an Alternative

While crosstab() works directly with Series, pivot_table() operates on DataFrames and shines when you need custom aggregations or when your data structure suits it better.

For a simple frequency table, use aggfunc='count' on any column:

# Frequency table using pivot_table
freq_table = data.pivot_table(
    index='age_group',
    columns='product_preference',
    values='customer_id',
    aggfunc='count',
    fill_value=0
)
print(freq_table)

The output matches crosstab(), but pivot_table() becomes essential when you need aggregations beyond counting:

# If you had a numeric column like 'purchase_amount'
# you could calculate means per cell
data['purchase_amount'] = np.random.uniform(50, 500, n)

avg_purchase = data.pivot_table(
    index='age_group',
    columns='product_preference',
    values='purchase_amount',
    aggfunc='mean'
)
print(avg_purchase.round(2))

When to use which:

crosstab(): Quick frequency tables from raw categorical data
pivot_table(): When you need mean, sum, or custom aggregations, or when working within a larger DataFrame-centric workflow

Adding Margins and Normalizing

Raw counts are useful, but margins (totals) and proportions often tell a clearer story.

Adding Row and Column Totals

contingency_with_totals = pd.crosstab(
    data['age_group'], 
    data['product_preference'],
    margins=True,
    margins_name='Total'
)
print(contingency_with_totals)

Output:

product_preference  Basic  Premium  Standard  Total
age_group                                          
18-25                  28       22        45     95
26-35                  33       27        46    106
36-45                  30       27        51    108
46-55                  28       25        42     95
56+                    32       24        40     96
Total                 151      125       224    500

Converting to Proportions

The normalize parameter converts counts to proportions. It accepts three values:

# Normalize over all values (proportions sum to 1.0)
prop_all = pd.crosstab(
    data['age_group'], 
    data['product_preference'],
    normalize='all'
)
print("Proportion of total:")
print(prop_all.round(3))

# Normalize over rows (each row sums to 1.0)
prop_rows = pd.crosstab(
    data['age_group'], 
    data['product_preference'],
    normalize='index'
)
print("\nProportion within each age group:")
print(prop_rows.round(3))

# Normalize over columns (each column sums to 1.0)
prop_cols = pd.crosstab(
    data['age_group'], 
    data['product_preference'],
    normalize='columns'
)
print("\nProportion within each product tier:")
print(prop_cols.round(3))

Practical guidance: Use normalize='index' when your question is “given age group X, what’s the distribution of preferences?” Use normalize='columns' when asking “given product tier Y, what’s the age distribution?”

For percentages, multiply by 100:

pct_table = pd.crosstab(
    data['age_group'], 
    data['product_preference'],
    normalize='index'
) * 100
print(pct_table.round(1))

Multi-Way Contingency Tables

Real analysis often involves more than two variables. Pandas handles this by accepting lists for rows and/or columns.

# Three-way contingency table
multi_way = pd.crosstab(
    [data['age_group'], data['region']],  # Multiple row variables
    data['product_preference'],
    margins=True
)
print(multi_way)

This creates a hierarchical row index. For even more dimensions:

# Four variables: two in rows, two in columns
four_way = pd.crosstab(
    [data['age_group'], data['region']],
    [data['product_preference'], data['satisfaction']]
)
print(four_way)

Multi-way tables get unwieldy fast. For exploration, they’re valuable. For presentation, consider filtering to specific categories or creating separate two-way tables.

# Filter to specific subset for clarity
subset = data[data['region'].isin(['North', 'South'])]
focused_table = pd.crosstab(
    [subset['age_group'], subset['region']],
    subset['product_preference'],
    normalize='index'
) * 100
print(focused_table.round(1))

Statistical Analysis with Contingency Tables

Contingency tables set up the chi-square test of independence, which answers: “Are these two categorical variables independent, or is there a statistically significant relationship?”

from scipy import stats

# Create contingency table (without margins for chi-square)
observed = pd.crosstab(data['age_group'], data['product_preference'])

# Perform chi-square test
chi2, p_value, dof, expected = stats.chi2_contingency(observed)

print(f"Chi-square statistic: {chi2:.4f}")
print(f"P-value: {p_value:.4f}")
print(f"Degrees of freedom: {dof}")
print(f"\nExpected frequencies:")
print(pd.DataFrame(
    expected, 
    index=observed.index, 
    columns=observed.columns
).round(2))

Interpreting results:

If p-value < 0.05 (typical threshold), reject the null hypothesis—the variables are likely not independent
The expected frequencies show what counts you’d see if variables were independent
Compare observed vs. expected to identify which cells drive the relationship

# Calculate residuals to see where differences are largest
residuals = (observed - expected) / np.sqrt(expected)
print("\nStandardized residuals (>2 or <-2 are notable):")
print(residuals.round(2))

Cells with standardized residuals beyond ±2 indicate combinations that occur more or less frequently than expected under independence.

Visualization

Numbers in tables are precise but patterns emerge faster with visualization.

Heatmaps for Contingency Tables

import seaborn as sns
import matplotlib.pyplot as plt

# Create contingency table with percentages
pct_table = pd.crosstab(
    data['age_group'], 
    data['product_preference'],
    normalize='index'
) * 100

# Heatmap
plt.figure(figsize=(10, 6))
sns.heatmap(
    pct_table, 
    annot=True, 
    fmt='.1f', 
    cmap='YlOrRd',
    cbar_kws={'label': 'Percentage'}
)
plt.title('Product Preference by Age Group (%)')
plt.xlabel('Product Tier')
plt.ylabel('Age Group')
plt.tight_layout()
plt.savefig('contingency_heatmap.png', dpi=150)
plt.show()

Stacked Bar Charts

For comparing distributions across groups, stacked bars work well:

# Stacked bar chart
pct_table.plot(
    kind='bar', 
    stacked=True, 
    figsize=(10, 6),
    colormap='viridis'
)
plt.title('Product Preference Distribution by Age Group')
plt.xlabel('Age Group')
plt.ylabel('Percentage')
plt.legend(title='Product Tier', bbox_to_anchor=(1.02, 1))
plt.tight_layout()
plt.savefig('contingency_stacked_bar.png', dpi=150)
plt.show()

Grouped Bar Charts

When you want to compare absolute frequencies:

contingency = pd.crosstab(data['age_group'], data['product_preference'])

contingency.plot(
    kind='bar', 
    figsize=(10, 6),
    width=0.8
)
plt.title('Product Preference Counts by Age Group')
plt.xlabel('Age Group')
plt.ylabel('Count')
plt.legend(title='Product Tier')
plt.xticks(rotation=45)
plt.tight_layout()
plt.savefig('contingency_grouped_bar.png', dpi=150)
plt.show()

Practical Recommendations

After building hundreds of contingency tables, here’s what I’ve learned:

Always check margins first. Unbalanced groups (one category with 90% of observations) will skew your interpretation.
Use proportions for comparison, counts for context. Show both when presenting to stakeholders.
Don’t chi-square everything. The test tells you if a relationship exists, not if it’s meaningful. A p-value of 0.001 with a 2% difference in proportions may be statistically significant but practically irrelevant.
Order categories intentionally. Age groups should be chronological, satisfaction levels should be ordinal. Random ordering obscures patterns.
Filter before creating multi-way tables. A 5×4×3×3 table has 180 cells. Nobody can interpret that. Ask a specific question and filter accordingly.

Contingency tables are foundational. Master them, and categorical data analysis becomes straightforward.