Pandas - Pivot Table with Examples | Application Architect

Key Insights

Pivot tables transform long-format data into wide-format summaries by aggregating values across row and column dimensions, essential for exploratory data analysis and reporting
The pivot_table() function provides flexible aggregation with multiple value columns, index levels, and aggregation functions including custom operations
Understanding the difference between pivot() and pivot_table() prevents common errors—use pivot() for reshaping without aggregation, pivot_table() when duplicate entries require aggregation

Basic Pivot Table Structure

A pivot table reorganizes data from a DataFrame by specifying which columns become the new index (rows), which become columns, and what values to aggregate. The fundamental syntax requires three components: values to aggregate, index for rows, and columns for the new column headers.

import pandas as pd
import numpy as np

# Sample sales data
data = {
    'date': ['2024-01-01', '2024-01-01', '2024-01-02', '2024-01-02', '2024-01-03'],
    'region': ['East', 'West', 'East', 'West', 'East'],
    'product': ['A', 'B', 'A', 'B', 'A'],
    'sales': [100, 150, 120, 180, 110],
    'quantity': [5, 7, 6, 9, 5]
}

df = pd.DataFrame(data)

# Basic pivot table
pivot = df.pivot_table(
    values='sales',
    index='region',
    columns='product',
    aggfunc='sum'
)

print(pivot)

Output:

product     A      B
region              
East      330.0    NaN
West      NaN    330.0

The aggfunc parameter defaults to 'mean' but accepts any aggregation function: 'sum', 'count', 'min', 'max', 'std', or custom functions.

Multiple Aggregation Functions

Apply different aggregation functions simultaneously to analyze data from multiple perspectives. Pass a list of functions to aggfunc or use a dictionary to map specific functions to specific value columns.

# Multiple aggregation functions on single column
pivot_multi = df.pivot_table(
    values='sales',
    index='region',
    columns='product',
    aggfunc=['sum', 'mean', 'count']
)

print(pivot_multi)

# Different functions for different columns
pivot_dict = df.pivot_table(
    values=['sales', 'quantity'],
    index='region',
    columns='product',
    aggfunc={
        'sales': 'sum',
        'quantity': 'mean'
    }
)

print(pivot_dict)

This creates hierarchical columns where the top level represents the aggregation function and the second level represents the product categories.

Multi-Level Indexing

Create pivot tables with multiple row or column levels for complex hierarchical analysis. This enables drilling down into data across multiple dimensions.

# Expanded dataset with more dimensions
data_extended = {
    'date': pd.date_range('2024-01-01', periods=20, freq='D'),
    'region': np.random.choice(['East', 'West', 'North'], 20),
    'product': np.random.choice(['A', 'B', 'C'], 20),
    'category': np.random.choice(['Electronics', 'Furniture'], 20),
    'sales': np.random.randint(100, 500, 20),
    'quantity': np.random.randint(1, 10, 20)
}

df_extended = pd.DataFrame(data_extended)

# Multi-level row index
pivot_multi_index = df_extended.pivot_table(
    values='sales',
    index=['region', 'category'],
    columns='product',
    aggfunc='sum',
    fill_value=0
)

print(pivot_multi_index)

# Multi-level column index
pivot_multi_cols = df_extended.pivot_table(
    values=['sales', 'quantity'],
    index='region',
    columns=['category', 'product'],
    aggfunc='sum',
    fill_value=0
)

print(pivot_multi_cols)

The fill_value parameter replaces NaN values with a specified value, crucial for numeric operations that fail on missing data.

Margins and Subtotals

Add row and column totals using the margins parameter. This appends an ‘All’ row and column showing aggregated totals across all categories.

pivot_margins = df_extended.pivot_table(
    values='sales',
    index='region',
    columns='product',
    aggfunc='sum',
    fill_value=0,
    margins=True,
    margins_name='Total'
)

print(pivot_margins)

# Margins with multiple aggregation functions
pivot_margins_multi = df_extended.pivot_table(
    values='sales',
    index=['region', 'category'],
    columns='product',
    aggfunc=['sum', 'mean'],
    margins=True,
    margins_name='Subtotal'
)

print(pivot_margins_multi)

The margins calculation respects the specified aggregation function. For 'sum', it sums all values; for 'mean', it calculates the mean across all data points, not the mean of means.

Custom Aggregation Functions

Define custom aggregation logic using lambda functions or named functions for specialized calculations beyond built-in aggregations.

# Custom aggregation function
def sales_range(x):
    return x.max() - x.min()

def weighted_avg(x):
    # Assumes access to quantity column
    return (x * df_extended.loc[x.index, 'quantity']).sum() / df_extended.loc[x.index, 'quantity'].sum()

pivot_custom = df_extended.pivot_table(
    values='sales',
    index='region',
    columns='product',
    aggfunc=[sales_range, 'mean', 'std'],
    fill_value=0
)

print(pivot_custom)

# Using lambda for percentile calculations
pivot_percentile = df_extended.pivot_table(
    values='sales',
    index='region',
    columns='product',
    aggfunc=lambda x: x.quantile(0.75),
    fill_value=0
)

print(pivot_percentile)

Custom functions receive a Series of values for each group and must return a scalar value. Access to the original DataFrame within custom functions requires careful index management.

Filtering and Observed Categories

Control which categories appear in the pivot table using the observed parameter for categorical data and post-pivot filtering for conditional display.

# Convert to categorical
df_extended['product'] = pd.Categorical(
    df_extended['product'],
    categories=['A', 'B', 'C', 'D']  # 'D' doesn't exist in data
)

# observed=False shows all categories (default for older pandas)
pivot_all = df_extended.pivot_table(
    values='sales',
    index='region',
    columns='product',
    aggfunc='sum',
    observed=False
)

print(pivot_all)  # Shows column for 'D' with NaN values

# observed=True shows only observed categories
pivot_observed = df_extended.pivot_table(
    values='sales',
    index='region',
    columns='product',
    aggfunc='sum',
    observed=True
)

print(pivot_observed)  # No column for 'D'

Practical Example: Sales Analysis Dashboard

Combine techniques for a real-world sales analysis scenario with multiple dimensions and aggregations.

# Generate realistic sales data
np.random.seed(42)
dates = pd.date_range('2024-01-01', '2024-03-31', freq='D')
sales_data = []

for date in dates:
    for region in ['North', 'South', 'East', 'West']:
        for product in ['Laptop', 'Phone', 'Tablet']:
            sales_data.append({
                'date': date,
                'region': region,
                'product': product,
                'sales_amount': np.random.randint(1000, 10000),
                'units_sold': np.random.randint(1, 50),
                'returns': np.random.randint(0, 5)
            })

df_sales = pd.DataFrame(sales_data)
df_sales['month'] = df_sales['date'].dt.to_period('M')

# Comprehensive pivot table
pivot_dashboard = df_sales.pivot_table(
    values=['sales_amount', 'units_sold', 'returns'],
    index=['month', 'region'],
    columns='product',
    aggfunc={
        'sales_amount': 'sum',
        'units_sold': 'sum',
        'returns': 'mean'
    },
    fill_value=0,
    margins=True
)

print(pivot_dashboard)

# Calculate return rate
pivot_simple = df_sales.pivot_table(
    values=['returns', 'units_sold'],
    index='region',
    columns='product',
    aggfunc='sum'
)

return_rate = (pivot_simple['returns'] / pivot_simple['units_sold'] * 100).round(2)
print("\nReturn Rate (%):")
print(return_rate)

This dashboard-style pivot table provides monthly regional breakdowns across products with different metrics aggregated appropriately—sums for revenue and units, means for return rates.

Performance Considerations

For large datasets, pivot tables can be memory-intensive. Use categorical data types and consider alternatives for massive data.

# Optimize with categorical types
df_sales['region'] = df_sales['region'].astype('category')
df_sales['product'] = df_sales['product'].astype('category')

# For very large datasets, consider groupby
grouped = df_sales.groupby(['region', 'product'])['sales_amount'].sum().unstack(fill_value=0)
print(grouped)  # Similar result, potentially faster

The pivot_table() method internally uses groupby() operations, so direct groupby with unstack() can offer better performance when you need only basic aggregations without pivot table’s additional features.