Pandas - Pivot (Reshape Long to Wide)

Long format stores each observation as a separate row with a variable column indicating what's being measured. Wide format spreads observations across multiple columns. Consider sales data: long...

Key Insights

  • Pivot transforms long-format data into wide-format by rotating unique values from one column into multiple columns, essential for creating summary tables and cross-tabulations
  • The pivot() method requires specifying index (row identifier), columns (values to become new column headers), and values (data to populate the cells)
  • Use pivot_table() when dealing with duplicate entries or when you need aggregation functions, as standard pivot() will raise errors on non-unique index-column combinations

Understanding Long vs Wide Format

Long format stores each observation as a separate row with a variable column indicating what’s being measured. Wide format spreads observations across multiple columns. Consider sales data: long format has one row per product-date-metric combination, while wide format has one row per product with separate columns for each metric.

import pandas as pd
import numpy as np

# Long format example
long_data = pd.DataFrame({
    'date': ['2024-01-01', '2024-01-01', '2024-01-02', '2024-01-02'],
    'product': ['Widget', 'Gadget', 'Widget', 'Gadget'],
    'sales': [100, 150, 120, 160]
})

print("Long format:")
print(long_data)

Output:

         date product  sales
0  2024-01-01  Widget    100
1  2024-01-01  Gadget    150
2  2024-01-02  Widget    120
3  2024-01-02  Gadget    160

Basic Pivot Operation

The pivot() method takes three key parameters: index defines rows, columns specifies which column’s unique values become new columns, and values indicates what data fills the cells.

# Pivot to wide format
wide_data = long_data.pivot(
    index='date',
    columns='product',
    values='sales'
)

print("\nWide format:")
print(wide_data)

Output:

product     Gadget  Widget
date                      
2024-01-01     150     100
2024-01-02     160     120

The resulting DataFrame has dates as the index, products as column headers, and sales figures populating the cells. The column name ‘product’ becomes the columns’ name attribute.

Multiple Value Columns

When your dataset contains multiple measurement columns, specify a list for the values parameter. This creates a hierarchical column structure.

# Data with multiple metrics
multi_metric = pd.DataFrame({
    'date': ['2024-01-01', '2024-01-01', '2024-01-02', '2024-01-02'],
    'product': ['Widget', 'Gadget', 'Widget', 'Gadget'],
    'sales': [100, 150, 120, 160],
    'units': [10, 12, 11, 13]
})

pivoted_multi = multi_metric.pivot(
    index='date',
    columns='product',
    values=['sales', 'units']
)

print(pivoted_multi)

Output:

           sales       units      
product   Gadget Widget Gadget Widget
date                                 
2024-01-01   150    100     12     10
2024-01-02   160    120     13     11

Access specific metric-product combinations using tuple indexing:

# Access sales for Widget
widget_sales = pivoted_multi[('sales', 'Widget')]

# Flatten column names if needed
pivoted_multi.columns = ['_'.join(col) for col in pivoted_multi.columns]
print(pivoted_multi.columns.tolist())
# ['sales_Gadget', 'sales_Widget', 'units_Gadget', 'units_Widget']

Handling Duplicate Entries with pivot_table

Standard pivot() fails when index-column combinations aren’t unique. Use pivot_table() with an aggregation function instead.

# Data with duplicates
duplicate_data = pd.DataFrame({
    'date': ['2024-01-01', '2024-01-01', '2024-01-01', '2024-01-01'],
    'product': ['Widget', 'Widget', 'Gadget', 'Gadget'],
    'region': ['North', 'South', 'North', 'South'],
    'sales': [100, 80, 150, 140]
})

# This would fail: duplicate_data.pivot(index='date', columns='product', values='sales')

# Use pivot_table with aggregation
summary = duplicate_data.pivot_table(
    index='date',
    columns='product',
    values='sales',
    aggfunc='sum'
)

print(summary)

Output:

product     Gadget  Widget
date                      
2024-01-01     290     180

Specify different aggregation functions:

# Multiple aggregations
multi_agg = duplicate_data.pivot_table(
    index='date',
    columns='product',
    values='sales',
    aggfunc=['sum', 'mean', 'count']
)

print(multi_agg)

Multi-Index Pivoting

Create more complex pivots by using lists for index or columns parameters, generating hierarchical indices.

# Extended dataset
extended_data = pd.DataFrame({
    'year': [2023, 2023, 2023, 2023, 2024, 2024, 2024, 2024],
    'quarter': ['Q1', 'Q1', 'Q2', 'Q2', 'Q1', 'Q1', 'Q2', 'Q2'],
    'product': ['Widget', 'Gadget', 'Widget', 'Gadget', 'Widget', 'Gadget', 'Widget', 'Gadget'],
    'revenue': [1000, 1200, 1100, 1300, 1050, 1250, 1150, 1350]
})

hierarchical = extended_data.pivot(
    index=['year', 'quarter'],
    columns='product',
    values='revenue'
)

print(hierarchical)

Output:

product         Gadget  Widget
year quarter                  
2023 Q1           1200    1000
     Q2           1300    1100
2024 Q1           1250    1050
     Q2           1350    1150

Access data using tuple indexing for multi-index:

# Get Q1 2024 data
q1_2024 = hierarchical.loc[(2024, 'Q1')]

# Reset index to flatten
flat = hierarchical.reset_index()
print(flat)

Practical Example: Time Series Analysis

Transform time series data for comparative analysis across categories.

# Simulated temperature readings
np.random.seed(42)
dates = pd.date_range('2024-01-01', periods=90, freq='D')
cities = ['NYC', 'LA', 'Chicago']

temp_data = pd.DataFrame({
    'date': np.tile(dates, len(cities)),
    'city': np.repeat(cities, len(dates)),
    'temperature': np.random.randint(30, 90, size=len(dates) * len(cities))
})

# Pivot for city comparison
temp_wide = temp_data.pivot(
    index='date',
    columns='city',
    values='temperature'
)

# Calculate monthly averages
temp_wide['month'] = temp_wide.index.to_period('M')
monthly_avg = temp_wide.groupby('month')[['NYC', 'LA', 'Chicago']].mean()

print(monthly_avg.head())

Combining Pivot with Fill Operations

Handle missing values in pivoted data using fill_value parameter or post-pivot operations.

# Sparse data
sparse_data = pd.DataFrame({
    'date': ['2024-01-01', '2024-01-01', '2024-01-02'],
    'product': ['Widget', 'Gadget', 'Widget'],
    'sales': [100, 150, 120]
})

# Pivot with fill_value
filled_pivot = sparse_data.pivot_table(
    index='date',
    columns='product',
    values='sales',
    fill_value=0
)

print(filled_pivot)

Output:

product     Gadget  Widget
date                      
2024-01-01     150     100
2024-01-02       0     120

For forward-fill or interpolation:

# Create pivot with NaN
pivot_with_nan = sparse_data.pivot(
    index='date',
    columns='product',
    values='sales'
)

# Forward fill missing values
forward_filled = pivot_with_nan.fillna(method='ffill')

# Or use interpolation
interpolated = pivot_with_nan.interpolate(method='linear')

Performance Considerations

For large datasets, pivot() is faster than pivot_table() when aggregation isn’t needed. Ensure your data types are optimized before pivoting.

# Optimize before pivoting
large_data = pd.DataFrame({
    'id': range(1000000),
    'category': np.random.choice(['A', 'B', 'C'], 1000000),
    'value': np.random.randn(1000000)
})

# Convert to categorical for memory efficiency
large_data['category'] = large_data['category'].astype('category')

# Use appropriate aggregation
result = large_data.pivot_table(
    index='id',
    columns='category',
    values='value',
    aggfunc='mean'
)

Pivoting is fundamental for reshaping data into analysis-ready formats. Master both pivot() for clean, unique data and pivot_table() for aggregated summaries to handle real-world datasets effectively.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.