Pandas - Cross Tabulation (crosstab)

• Cross tabulation transforms categorical data into frequency tables, revealing relationships between two or more variables that simple groupby operations miss

Key Insights

• Cross tabulation transforms categorical data into frequency tables, revealing relationships between two or more variables that simple groupby operations miss • The pd.crosstab() function provides built-in normalization, margin calculations, and aggregation functions that eliminate manual pivot table configuration • Understanding when to use crosstab versus pivot_table depends on whether you’re counting occurrences or aggregating existing numeric columns

Understanding Cross Tabulation Fundamentals

Cross tabulation creates a frequency distribution table showing the relationship between categorical variables. Unlike pivot_table(), which aggregates values from a column, crosstab() counts occurrences by default.

import pandas as pd
import numpy as np

# Sample dataset: customer purchases
data = {
    'customer_id': range(1, 21),
    'product': ['Laptop', 'Phone', 'Tablet', 'Laptop', 'Phone', 
                'Tablet', 'Laptop', 'Phone', 'Tablet', 'Laptop',
                'Phone', 'Tablet', 'Laptop', 'Phone', 'Tablet',
                'Laptop', 'Phone', 'Tablet', 'Laptop', 'Phone'],
    'region': ['North', 'South', 'East', 'West', 'North',
               'South', 'East', 'West', 'North', 'South',
               'East', 'West', 'North', 'South', 'East',
               'West', 'North', 'South', 'East', 'West'],
    'payment_method': ['Credit', 'Debit', 'Credit', 'PayPal', 'Credit',
                       'Debit', 'PayPal', 'Credit', 'Debit', 'Credit',
                       'PayPal', 'Debit', 'Credit', 'PayPal', 'Debit',
                       'Credit', 'PayPal', 'Debit', 'Credit', 'PayPal']
}

df = pd.DataFrame(data)

# Basic crosstab: product by region
ct = pd.crosstab(df['product'], df['region'])
print(ct)

Output:

region    East  North  South  West
product                           
Laptop       2      2      1     2
Phone        2      2      2     2
Tablet       2      1      2     0

Normalization and Percentages

Normalization converts raw counts into proportions, essential for comparing distributions across different sample sizes.

# Normalize by rows (each row sums to 1)
ct_row_norm = pd.crosstab(df['product'], df['region'], normalize='index')
print(ct_row_norm)
print()

# Normalize by columns (each column sums to 1)
ct_col_norm = pd.crosstab(df['product'], df['region'], normalize='columns')
print(ct_col_norm)
print()

# Normalize by total (entire table sums to 1)
ct_all_norm = pd.crosstab(df['product'], df['region'], normalize='all')
print(ct_all_norm)

Output (row normalization):

region      East  North  South  West
product                             
Laptop     0.286  0.286  0.143  0.286
Phone      0.250  0.250  0.250  0.250
Tablet     0.400  0.200  0.400  0.000

This reveals that 40% of Tablet sales occur in the East, while Phones distribute evenly across regions.

Adding Margins and Subtotals

Margins provide row and column totals, offering context for interpreting frequencies.

# Add margins (totals)
ct_margins = pd.crosstab(df['product'], df['region'], margins=True, margins_name='Total')
print(ct_margins)
print()

# Combine margins with normalization
ct_norm_margins = pd.crosstab(
    df['product'], 
    df['region'], 
    normalize='index',
    margins=True
)
print(ct_norm_margins)

Output:

region    East  North  South  West  Total
product                                  
Laptop       2      2      1     2      7
Phone        2      2      2     2      8
Tablet       2      1      2     0      5
Total        6      5      5     4     20

Multi-Dimensional Cross Tabulation

Analyze relationships across three or more variables by passing lists to the index and columns parameters.

# Three-way crosstab: product and payment method by region
ct_3way = pd.crosstab(
    [df['product'], df['payment_method']], 
    df['region']
)
print(ct_3way)
print()

# More readable with column hierarchies
ct_3way_cols = pd.crosstab(
    df['product'],
    [df['region'], df['payment_method']]
)
print(ct_3way_cols)

Output:

region                East  North  South  West
product payment_method                        
Laptop  Credit            1      2      0     1
        PayPal            1      0      0     1
Phone   Credit            0      1      0     1
        Debit             0      0      1     0
        PayPal            2      1      1     1
Tablet  Credit            1      0      0     0
        Debit             1      0      2     0

Custom Aggregation Functions

While crosstab counts by default, you can aggregate numeric values using the values and aggfunc parameters.

# Add transaction amounts
df['amount'] = np.random.randint(100, 1000, size=len(df))

# Sum amounts by product and region
ct_sum = pd.crosstab(
    df['product'],
    df['region'],
    values=df['amount'],
    aggfunc='sum'
)
print(ct_sum)
print()

# Multiple aggregation functions
ct_multi_agg = pd.crosstab(
    df['product'],
    df['region'],
    values=df['amount'],
    aggfunc=['sum', 'mean', 'count']
)
print(ct_multi_agg)

Filtering and Conditional Analysis

Apply boolean indexing before crosstab to analyze specific subsets.

# High-value transactions only (>500)
high_value = df[df['amount'] > 500]
ct_high_value = pd.crosstab(high_value['product'], high_value['region'])
print("High-value transactions:")
print(ct_high_value)
print()

# Compare with low-value transactions
low_value = df[df['amount'] <= 500]
ct_low_value = pd.crosstab(low_value['product'], low_value['region'])
print("Low-value transactions:")
print(ct_low_value)

Practical Pattern: A/B Test Analysis

Cross tabulation excels at analyzing categorical outcomes in A/B tests.

# Simulate A/B test data
ab_data = {
    'user_id': range(1, 201),
    'variant': np.random.choice(['A', 'B'], 200),
    'converted': np.random.choice([True, False], 200, p=[0.15, 0.85]),
    'device': np.random.choice(['Mobile', 'Desktop', 'Tablet'], 200)
}
ab_df = pd.DataFrame(ab_data)

# Conversion rate by variant
ct_conversion = pd.crosstab(
    ab_df['variant'],
    ab_df['converted'],
    normalize='index'
)
print("Conversion rates by variant:")
print(ct_conversion)
print()

# Conversion by variant and device
ct_device = pd.crosstab(
    [ab_df['variant'], ab_df['device']],
    ab_df['converted'],
    normalize='index'
)
print("Conversion rates by variant and device:")
print(ct_device)

Crosstab vs Pivot Table Decision Matrix

Use crosstab() when:

  • Counting occurrences of categorical combinations
  • Variables exist as separate Series or arrays (not necessarily in a DataFrame)
  • You need quick frequency distributions without pre-aggregated values

Use pivot_table() when:

  • Aggregating existing numeric columns
  • All data exists within a single DataFrame
  • You need more complex hierarchical indexing with multiple aggregation functions
# Same result, different approaches
# Crosstab approach
ct_approach = pd.crosstab(df['product'], df['region'])

# Pivot table approach (requires counting column)
df['count'] = 1
pivot_approach = df.pivot_table(
    values='count',
    index='product',
    columns='region',
    aggfunc='sum',
    fill_value=0
)

print("Results are identical:", ct_approach.equals(pivot_approach))

Performance Considerations

For large datasets, crosstab performance depends on cardinality of categorical variables.

# Large dataset example
large_data = {
    'category_a': np.random.choice(['X', 'Y', 'Z'], 100000),
    'category_b': np.random.choice(range(50), 100000)
}
large_df = pd.DataFrame(large_data)

# Time comparison
import time

start = time.time()
ct_large = pd.crosstab(large_df['category_a'], large_df['category_b'])
crosstab_time = time.time() - start

start = time.time()
grouped = large_df.groupby(['category_a', 'category_b']).size().unstack(fill_value=0)
groupby_time = time.time() - start

print(f"Crosstab: {crosstab_time:.4f}s")
print(f"Groupby+unstack: {groupby_time:.4f}s")

Cross tabulation transforms categorical data into actionable insights. Master normalization for percentage-based analysis, use margins for context, and leverage multi-dimensional crosstabs for complex relationships. The choice between crosstab and pivot_table hinges on whether you’re counting occurrences or aggregating values—understanding this distinction prevents unnecessary data manipulation.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.