Pandas - Pivot (Reshape Long to Wide)
Long format stores each observation as a separate row with a variable column indicating what's being measured. Wide format spreads observations across multiple columns. Consider sales data: long...
Key Insights
- Pivot transforms long-format data into wide-format by rotating unique values from one column into multiple columns, essential for creating summary tables and cross-tabulations
- The
pivot()method requires specifying index (row identifier), columns (values to become new column headers), and values (data to populate the cells) - Use
pivot_table()when dealing with duplicate entries or when you need aggregation functions, as standardpivot()will raise errors on non-unique index-column combinations
Understanding Long vs Wide Format
Long format stores each observation as a separate row with a variable column indicating what’s being measured. Wide format spreads observations across multiple columns. Consider sales data: long format has one row per product-date-metric combination, while wide format has one row per product with separate columns for each metric.
import pandas as pd
import numpy as np
# Long format example
long_data = pd.DataFrame({
'date': ['2024-01-01', '2024-01-01', '2024-01-02', '2024-01-02'],
'product': ['Widget', 'Gadget', 'Widget', 'Gadget'],
'sales': [100, 150, 120, 160]
})
print("Long format:")
print(long_data)
Output:
date product sales
0 2024-01-01 Widget 100
1 2024-01-01 Gadget 150
2 2024-01-02 Widget 120
3 2024-01-02 Gadget 160
Basic Pivot Operation
The pivot() method takes three key parameters: index defines rows, columns specifies which column’s unique values become new columns, and values indicates what data fills the cells.
# Pivot to wide format
wide_data = long_data.pivot(
index='date',
columns='product',
values='sales'
)
print("\nWide format:")
print(wide_data)
Output:
product Gadget Widget
date
2024-01-01 150 100
2024-01-02 160 120
The resulting DataFrame has dates as the index, products as column headers, and sales figures populating the cells. The column name ‘product’ becomes the columns’ name attribute.
Multiple Value Columns
When your dataset contains multiple measurement columns, specify a list for the values parameter. This creates a hierarchical column structure.
# Data with multiple metrics
multi_metric = pd.DataFrame({
'date': ['2024-01-01', '2024-01-01', '2024-01-02', '2024-01-02'],
'product': ['Widget', 'Gadget', 'Widget', 'Gadget'],
'sales': [100, 150, 120, 160],
'units': [10, 12, 11, 13]
})
pivoted_multi = multi_metric.pivot(
index='date',
columns='product',
values=['sales', 'units']
)
print(pivoted_multi)
Output:
sales units
product Gadget Widget Gadget Widget
date
2024-01-01 150 100 12 10
2024-01-02 160 120 13 11
Access specific metric-product combinations using tuple indexing:
# Access sales for Widget
widget_sales = pivoted_multi[('sales', 'Widget')]
# Flatten column names if needed
pivoted_multi.columns = ['_'.join(col) for col in pivoted_multi.columns]
print(pivoted_multi.columns.tolist())
# ['sales_Gadget', 'sales_Widget', 'units_Gadget', 'units_Widget']
Handling Duplicate Entries with pivot_table
Standard pivot() fails when index-column combinations aren’t unique. Use pivot_table() with an aggregation function instead.
# Data with duplicates
duplicate_data = pd.DataFrame({
'date': ['2024-01-01', '2024-01-01', '2024-01-01', '2024-01-01'],
'product': ['Widget', 'Widget', 'Gadget', 'Gadget'],
'region': ['North', 'South', 'North', 'South'],
'sales': [100, 80, 150, 140]
})
# This would fail: duplicate_data.pivot(index='date', columns='product', values='sales')
# Use pivot_table with aggregation
summary = duplicate_data.pivot_table(
index='date',
columns='product',
values='sales',
aggfunc='sum'
)
print(summary)
Output:
product Gadget Widget
date
2024-01-01 290 180
Specify different aggregation functions:
# Multiple aggregations
multi_agg = duplicate_data.pivot_table(
index='date',
columns='product',
values='sales',
aggfunc=['sum', 'mean', 'count']
)
print(multi_agg)
Multi-Index Pivoting
Create more complex pivots by using lists for index or columns parameters, generating hierarchical indices.
# Extended dataset
extended_data = pd.DataFrame({
'year': [2023, 2023, 2023, 2023, 2024, 2024, 2024, 2024],
'quarter': ['Q1', 'Q1', 'Q2', 'Q2', 'Q1', 'Q1', 'Q2', 'Q2'],
'product': ['Widget', 'Gadget', 'Widget', 'Gadget', 'Widget', 'Gadget', 'Widget', 'Gadget'],
'revenue': [1000, 1200, 1100, 1300, 1050, 1250, 1150, 1350]
})
hierarchical = extended_data.pivot(
index=['year', 'quarter'],
columns='product',
values='revenue'
)
print(hierarchical)
Output:
product Gadget Widget
year quarter
2023 Q1 1200 1000
Q2 1300 1100
2024 Q1 1250 1050
Q2 1350 1150
Access data using tuple indexing for multi-index:
# Get Q1 2024 data
q1_2024 = hierarchical.loc[(2024, 'Q1')]
# Reset index to flatten
flat = hierarchical.reset_index()
print(flat)
Practical Example: Time Series Analysis
Transform time series data for comparative analysis across categories.
# Simulated temperature readings
np.random.seed(42)
dates = pd.date_range('2024-01-01', periods=90, freq='D')
cities = ['NYC', 'LA', 'Chicago']
temp_data = pd.DataFrame({
'date': np.tile(dates, len(cities)),
'city': np.repeat(cities, len(dates)),
'temperature': np.random.randint(30, 90, size=len(dates) * len(cities))
})
# Pivot for city comparison
temp_wide = temp_data.pivot(
index='date',
columns='city',
values='temperature'
)
# Calculate monthly averages
temp_wide['month'] = temp_wide.index.to_period('M')
monthly_avg = temp_wide.groupby('month')[['NYC', 'LA', 'Chicago']].mean()
print(monthly_avg.head())
Combining Pivot with Fill Operations
Handle missing values in pivoted data using fill_value parameter or post-pivot operations.
# Sparse data
sparse_data = pd.DataFrame({
'date': ['2024-01-01', '2024-01-01', '2024-01-02'],
'product': ['Widget', 'Gadget', 'Widget'],
'sales': [100, 150, 120]
})
# Pivot with fill_value
filled_pivot = sparse_data.pivot_table(
index='date',
columns='product',
values='sales',
fill_value=0
)
print(filled_pivot)
Output:
product Gadget Widget
date
2024-01-01 150 100
2024-01-02 0 120
For forward-fill or interpolation:
# Create pivot with NaN
pivot_with_nan = sparse_data.pivot(
index='date',
columns='product',
values='sales'
)
# Forward fill missing values
forward_filled = pivot_with_nan.fillna(method='ffill')
# Or use interpolation
interpolated = pivot_with_nan.interpolate(method='linear')
Performance Considerations
For large datasets, pivot() is faster than pivot_table() when aggregation isn’t needed. Ensure your data types are optimized before pivoting.
# Optimize before pivoting
large_data = pd.DataFrame({
'id': range(1000000),
'category': np.random.choice(['A', 'B', 'C'], 1000000),
'value': np.random.randn(1000000)
})
# Convert to categorical for memory efficiency
large_data['category'] = large_data['category'].astype('category')
# Use appropriate aggregation
result = large_data.pivot_table(
index='id',
columns='category',
values='value',
aggfunc='mean'
)
Pivoting is fundamental for reshaping data into analysis-ready formats. Master both pivot() for clean, unique data and pivot_table() for aggregated summaries to handle real-world datasets effectively.