How to Create a Cross-Tabulation in Python

Cross-tabulation, also called a contingency table, is a method for summarizing the relationship between two or more categorical variables. It displays the frequency distribution of variables in a...

Key Insights

  • Cross-tabulation with pd.crosstab() is the fastest way to analyze relationships between categorical variables, revealing patterns that summary statistics alone would miss.
  • The normalize parameter transforms raw counts into percentages, which is essential for comparing groups of different sizes—always normalize when your categories have unequal sample sizes.
  • Combining cross-tabulation with heatmap visualization creates an immediately interpretable view of your data that stakeholders can understand without statistical training.

Introduction to Cross-Tabulation

Cross-tabulation, also called a contingency table, is a method for summarizing the relationship between two or more categorical variables. It displays the frequency distribution of variables in a matrix format, showing how often each combination of categories occurs.

If you’ve ever asked questions like “Do customers in different regions prefer different products?” or “Is there a relationship between education level and voting preference?”, cross-tabulation is your tool. It’s foundational in survey analysis, A/B testing, market research, and any domain where you need to understand how categorical variables interact.

Unlike correlation coefficients that work with continuous data, cross-tabulation handles categorical data natively. It gives you actual counts and proportions, making it easy to spot patterns, identify segments, and prepare data for chi-square tests of independence.

Setting Up Your Environment

You need pandas for the heavy lifting. NumPy is useful for data generation, and seaborn handles visualization. Here’s the setup:

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# Set random seed for reproducibility
np.random.seed(42)

# Create sample survey data
n_respondents = 500

data = {
    'region': np.random.choice(['North', 'South', 'East', 'West'], n_respondents),
    'age_group': np.random.choice(['18-25', '26-35', '36-50', '51+'], n_respondents),
    'product_preference': np.random.choice(['Basic', 'Premium', 'Enterprise'], n_respondents),
    'satisfaction': np.random.choice(['Low', 'Medium', 'High'], n_respondents),
    'purchase_amount': np.random.randint(50, 500, n_respondents),
    'repeat_customer': np.random.choice(['Yes', 'No'], n_respondents)
}

df = pd.DataFrame(data)
print(df.head(10))
   region age_group product_preference satisfaction  purchase_amount repeat_customer
0   West     36-50              Basic       Medium              168             Yes
1  South     26-35          Enterprise         High              318              No
2   East     36-50            Premium          Low              188             Yes
3   West       51+              Basic       Medium              438              No
4   East     18-25          Enterprise       Medium              306             Yes
5  North     26-35            Premium         High              128              No
6   West       51+          Enterprise          Low              267             Yes
7  South     36-50              Basic       Medium              497              No
8   East     18-25            Premium         High              145             Yes
9  North     26-35              Basic          Low              392              No

This synthetic dataset simulates survey responses with region, age demographics, product preferences, satisfaction levels, and purchase behavior. Real-world data would come from your database or CSV files, but the cross-tabulation techniques remain identical.

Basic Cross-Tabulation with pandas.crosstab()

The pd.crosstab() function is purpose-built for contingency tables. Its syntax is straightforward: pass the row variable as the first argument and the column variable as the second.

# Basic cross-tabulation: Region vs Product Preference
basic_crosstab = pd.crosstab(df['region'], df['product_preference'])
print(basic_crosstab)
product_preference  Basic  Enterprise  Premium
region                                        
East                   43          42       45
North                  45          38       36
South                  44          42       43
West                   38          44       40

This table shows the count of respondents for each combination of region and product preference. Reading across the East row, you see 43 people chose Basic, 42 chose Enterprise, and 45 chose Premium.

The function automatically handles the grouping and counting. You don’t need to write groupby operations or pivot tables manually. It also sorts the index and columns alphabetically by default, which you can override with the dropna and reindexing options.

# Specify custom ordering for categorical display
region_order = ['North', 'South', 'East', 'West']
product_order = ['Basic', 'Premium', 'Enterprise']

ordered_crosstab = pd.crosstab(
    df['region'], 
    df['product_preference']
).reindex(index=region_order, columns=product_order)
print(ordered_crosstab)
product_preference  Basic  Premium  Enterprise
region                                        
North                  45       36          38
South                  44       43          42
East                   43       45          42
West                   38       40          44

Adding Margins and Normalizing Data

Raw counts are useful, but they don’t tell the whole story. Adding margins gives you row and column totals. Normalizing converts counts to proportions, essential when comparing groups of different sizes.

# Add margins (row and column totals)
with_margins = pd.crosstab(
    df['region'], 
    df['product_preference'],
    margins=True,
    margins_name='Total'
)
print(with_margins)
product_preference  Basic  Enterprise  Premium  Total
region                                               
East                   43          42       45    130
North                  45          38       36    119
South                  44          42       43    129
West                   38          44       40    122
Total                 170         166      164    500

Now you can see that the East region has 130 total respondents, and 170 people overall chose the Basic product. The margins provide context for interpreting individual cells.

Normalization is where cross-tabulation becomes analytically powerful:

# Normalize by rows (each row sums to 1)
row_normalized = pd.crosstab(
    df['region'], 
    df['product_preference'],
    normalize='index'
).round(3)
print("Row-normalized (proportions within each region):")
print(row_normalized)

# Normalize by columns (each column sums to 1)
col_normalized = pd.crosstab(
    df['region'], 
    df['product_preference'],
    normalize='columns'
).round(3)
print("\nColumn-normalized (proportions within each product):")
print(col_normalized)

# Normalize by all (entire table sums to 1)
all_normalized = pd.crosstab(
    df['region'], 
    df['product_preference'],
    normalize='all'
).round(3)
print("\nTable-normalized (proportions of total):")
print(all_normalized)
Row-normalized (proportions within each region):
product_preference  Basic  Enterprise  Premium
region                                        
East                0.331       0.323    0.346
North               0.378       0.319    0.303
South               0.341       0.326    0.333
West                0.311       0.361    0.328

Column-normalized (proportions within each product):
product_preference  Basic  Enterprise  Premium
region                                        
East                0.253       0.253    0.274
North               0.265       0.229    0.220
South               0.259       0.253    0.262
West                0.224       0.265    0.244

Table-normalized (proportions of total):
product_preference  Basic  Enterprise  Premium
region                                        
East                0.086       0.084    0.090
North               0.090       0.076    0.072
South               0.088       0.084    0.086
West                0.076       0.088    0.080

Use row normalization (normalize='index') when you want to compare product preferences across regions. Use column normalization (normalize='columns') when you want to see which regions dominate each product category.

Multi-Variable Cross-Tabulation

Real analysis often involves more than two variables. You can pass lists to create hierarchical cross-tabulations:

# Three-variable cross-tabulation
multi_crosstab = pd.crosstab(
    [df['region'], df['age_group']], 
    df['product_preference']
)
print(multi_crosstab)
product_preference    Basic  Enterprise  Premium
region age_group                                
East   18-25             8          13       12
       26-35            13          13       10
       36-50            13           9       11
       51+               9           7       12
North  18-25            15          11       10
       26-35            11          11       10
       36-50            10           9       10
       51+               9           7        6
South  18-25            12          11        8
       26-35            10          10       12
       36-50            13          10       14
       51+               9          11        9
West   18-25             9          12       13
       26-35             8          11        9
       36-50            13          11       10
       51+               8          10        8

This creates a hierarchical index with region and age group combinations. You can also add multiple columns:

# Multiple variables on both axes
complex_crosstab = pd.crosstab(
    df['region'], 
    [df['product_preference'], df['repeat_customer']]
)
print(complex_crosstab)
product_preference Basic     Enterprise     Premium    
repeat_customer       No Yes         No Yes      No Yes
region                                                 
East                  22  21         17  25      21  24
North                 22  23         21  17      16  20
South                 24  20         19  23      17  26
West                  18  20         23  21      21  19

This shows product preference broken down by repeat customer status for each region. The hierarchical columns let you see patterns like whether repeat customers gravitate toward Premium products.

Applying Aggregation Functions

Cross-tabulation isn’t limited to counting. The aggfunc and values parameters let you compute means, sums, or custom aggregations:

# Average purchase amount by region and product preference
mean_purchase = pd.crosstab(
    df['region'], 
    df['product_preference'],
    values=df['purchase_amount'],
    aggfunc='mean'
).round(2)
print("Average purchase amount:")
print(mean_purchase)
Average purchase amount:
product_preference   Basic  Enterprise  Premium
region                                         
East                270.16      286.55   268.40
North               253.62      290.29   282.53
South               276.82      256.71   257.63
West                280.47      270.27   273.35

Now instead of counts, you see the average purchase amount for each segment. This is invaluable for understanding which customer segments drive revenue.

# Multiple aggregations using a custom function
def purchase_stats(x):
    return pd.Series({
        'mean': x.mean(),
        'median': x.median(),
        'count': len(x)
    })

# For multiple stats, use pivot_table instead
stats_table = df.pivot_table(
    values='purchase_amount',
    index='region',
    columns='product_preference',
    aggfunc=['mean', 'median', 'count']
).round(2)
print(stats_table)
              mean                    median                    count              
product_preference  Basic Enterprise Premium  Basic Enterprise Premium Basic Enterprise Premium
region                                                                                         
East               270.16     286.55  268.40  256.0      311.0   284.0    43         42      45
North              253.62     290.29  282.53  234.0      310.5   303.0    45         38      36
South              276.82     256.71  257.63  285.0      251.5   254.0    44         42      43
West               280.47     270.27  273.35  296.0      258.5   271.5    38         44      40

Visualizing Cross-Tabulation Results

Numbers in tables are precise but hard to scan. Heatmaps make patterns jump out immediately:

# Create a heatmap of the basic cross-tabulation
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Raw counts heatmap
counts = pd.crosstab(df['region'], df['product_preference'])
sns.heatmap(counts, annot=True, fmt='d', cmap='Blues', ax=axes[0])
axes[0].set_title('Product Preference by Region (Counts)')

# Normalized heatmap
normalized = pd.crosstab(
    df['region'], 
    df['product_preference'],
    normalize='index'
)
sns.heatmap(normalized, annot=True, fmt='.1%', cmap='YlOrRd', ax=axes[1])
axes[1].set_title('Product Preference by Region (Row %)')

plt.tight_layout()
plt.savefig('crosstab_heatmap.png', dpi=150)
plt.show()

The annot=True parameter displays values in each cell. Use fmt='d' for integers and fmt='.1%' for percentages. The colormap choice matters—sequential colormaps like ‘Blues’ work well for counts, while diverging colormaps highlight deviations from a midpoint.

For quick styling without matplotlib, pandas has built-in options:

# Styled pandas output for notebooks
styled = pd.crosstab(
    df['region'], 
    df['product_preference'],
    normalize='index'
).style.format('{:.1%}').background_gradient(cmap='Blues')
styled

This renders directly in Jupyter notebooks with color gradients applied to cells, making high and low values immediately visible.

Cross-tabulation is a fundamental technique that belongs in every data analyst’s toolkit. Master pd.crosstab() and you’ll find yourself reaching for it constantly—it’s the fastest path from raw categorical data to actionable insights.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.