Pandas: Reshaping Data with Pivot and Melt
Data rarely arrives in the format you need. Your visualization library wants wide format, your machine learning model expects long format, and your database export looks nothing like either....
Key Insights
- Pivot transforms long data to wide format (rows become columns), while melt does the opposite—mastering both gives you complete control over DataFrame structure
- Use
pivot_table()instead ofpivot()when your data has duplicate entries that need aggregation; it’s more robust for real-world messy data - Reshape operations are not always perfectly reversible—understanding when data loss occurs prevents frustrating debugging sessions
Introduction to Data Reshaping
Data rarely arrives in the format you need. Your visualization library wants wide format, your machine learning model expects long format, and your database export looks nothing like either. Reshaping data is one of those unglamorous skills that separates productive data work from hours of frustration.
The core distinction is simple: wide format stores related measurements across multiple columns (one column per time period, category, or variable), while long format stacks them into fewer columns with an additional identifier column.
Consider survey responses. Wide format gives you Q1, Q2, Q3, Q4 as separate columns—easy to read, hard to aggregate. Long format gives you a single quarter column with values 1-4 and a response column—verbose to read, trivial to filter and group.
Neither format is inherently better. The right format depends on what you’re doing next.
Understanding Pivot: Long to Wide
The pivot() function transforms long data into wide format by turning unique values from one column into new column headers.
import pandas as pd
# Sales data in long format
sales = pd.DataFrame({
'date': ['2024-01-01', '2024-01-01', '2024-01-02', '2024-01-02'],
'product': ['Widget', 'Gadget', 'Widget', 'Gadget'],
'revenue': [100, 150, 120, 140]
})
print("Long format:")
print(sales)
# date product revenue
# 0 2024-01-01 Widget 100
# 1 2024-01-01 Gadget 150
# 2 2024-01-02 Widget 120
# 3 2024-01-02 Gadget 140
# Pivot: dates as rows, products as columns
wide_sales = sales.pivot(index='date', columns='product', values='revenue')
print("\nWide format:")
print(wide_sales)
# product Gadget Widget
# date
# 2024-01-01 150 100
# 2024-01-02 140 120
The three parameters are straightforward: index becomes row labels, columns values become column headers, and values fills the cells.
Here’s where beginners hit a wall: pivot() fails with duplicate entries.
# Data with duplicates (two Widget sales on same date)
sales_dupes = pd.DataFrame({
'date': ['2024-01-01', '2024-01-01', '2024-01-01'],
'product': ['Widget', 'Widget', 'Gadget'],
'revenue': [100, 50, 150]
})
# This raises ValueError: Index contains duplicate entries
# sales_dupes.pivot(index='date', columns='product', values='revenue')
When you have duplicate index-column combinations, pivot() doesn’t know which value to use. You need pivot_table().
Pivot Tables for Aggregation
pivot_table() handles duplicates by aggregating them. It’s the more robust choice for real-world data.
# Detailed sales with multiple transactions per day/product
detailed_sales = pd.DataFrame({
'date': ['2024-01-01', '2024-01-01', '2024-01-01', '2024-01-02', '2024-01-02'],
'region': ['North', 'North', 'South', 'North', 'South'],
'product': ['Widget', 'Widget', 'Gadget', 'Gadget', 'Widget'],
'revenue': [100, 50, 200, 175, 125]
})
# Sum revenue by region and product
summary = pd.pivot_table(
detailed_sales,
values='revenue',
index='region',
columns='product',
aggfunc='sum',
fill_value=0,
margins=True,
margins_name='Total'
)
print(summary)
# product Gadget Widget Total
# region
# North 175 150 325
# South 200 125 325
# Total 375 275 650
The margins=True parameter adds row and column totals—extremely useful for reports. The fill_value=0 handles missing combinations (no North-Gadget sales on day one, for instance).
You can also create multi-index pivots by passing lists:
multi_pivot = pd.pivot_table(
detailed_sales,
values='revenue',
index=['region', 'date'],
columns='product',
aggfunc=['sum', 'count']
)
Understanding Melt: Wide to Long
melt() is pivot’s inverse. It unpivots wide data by collapsing multiple columns into key-value pairs.
# Survey data in wide format
survey = pd.DataFrame({
'respondent_id': [1, 2, 3],
'name': ['Alice', 'Bob', 'Carol'],
'Q1': [4, 5, 3],
'Q2': [3, 4, 4],
'Q3': [5, 3, 5],
'Q4': [4, 4, 4]
})
print("Wide format:")
print(survey)
# respondent_id name Q1 Q2 Q3 Q4
# 0 1 Alice 4 3 5 4
# 1 2 Bob 5 4 3 4
# 2 3 Carol 3 4 5 4
# Melt: Q1-Q4 columns become rows
long_survey = survey.melt(
id_vars=['respondent_id', 'name'],
value_vars=['Q1', 'Q2', 'Q3', 'Q4'],
var_name='quarter',
value_name='score'
)
print("\nLong format:")
print(long_survey)
# respondent_id name quarter score
# 0 1 Alice Q1 4
# 1 2 Bob Q1 5
# 2 3 Carol Q1 3
# 3 1 Alice Q2 3
# ...
The parameters: id_vars are columns to keep as identifiers, value_vars are columns to unpivot, var_name names the new column holding former column headers, and value_name names the column holding the values.
If you omit value_vars, melt unpivots all columns not in id_vars. This is convenient but dangerous—be explicit when your DataFrame has columns you want to preserve.
Pivot and Melt as Inverses
In theory, pivot and melt should be perfect inverses. In practice, you need to handle index resets and column ordering.
# Start with long format
original = pd.DataFrame({
'date': ['2024-01-01', '2024-01-01', '2024-01-02', '2024-01-02'],
'metric': ['views', 'clicks', 'views', 'clicks'],
'value': [1000, 50, 1200, 65]
})
# Pivot to wide
wide = original.pivot(index='date', columns='metric', values='value')
# Melt back to long
recovered = wide.reset_index().melt(
id_vars='date',
var_name='metric',
value_name='value'
)
# Sort for comparison
original_sorted = original.sort_values(['date', 'metric']).reset_index(drop=True)
recovered_sorted = recovered.sort_values(['date', 'metric']).reset_index(drop=True)
print(original_sorted.equals(recovered_sorted)) # True
Data loss occurs when pivoting aggregates values. If you sum duplicate entries during pivot, you can’t recover the original individual values with melt.
Real-World Reshaping Patterns
Preparing Time Series for Visualization
Seaborn’s heatmap expects wide format with numeric indices:
import numpy as np
# Weather station data in long format
weather = pd.DataFrame({
'station': ['Downtown', 'Downtown', 'Downtown', 'Airport', 'Airport', 'Airport'],
'month': ['Jan', 'Feb', 'Mar', 'Jan', 'Feb', 'Mar'],
'avg_temp': [32, 35, 45, 30, 33, 43]
})
# Reshape for heatmap
heatmap_data = weather.pivot(
index='station',
columns='month',
values='avg_temp'
)
# Reorder columns chronologically
heatmap_data = heatmap_data[['Jan', 'Feb', 'Mar']]
print(heatmap_data)
# month Jan Feb Mar
# station
# Airport 30 33 43
# Downtown 32 35 45
# Now ready for: sns.heatmap(heatmap_data, annot=True)
Normalizing Denormalized Exports
Database exports often repeat data for readability. Melt helps normalize:
# Denormalized export with repeated customer info
export = pd.DataFrame({
'customer_id': [101, 102],
'customer_name': ['Acme Corp', 'Globex'],
'jan_sales': [5000, 7500],
'feb_sales': [5500, 7200],
'mar_sales': [6000, 8000]
})
# Normalize to proper relational format
normalized = export.melt(
id_vars=['customer_id', 'customer_name'],
value_vars=['jan_sales', 'feb_sales', 'mar_sales'],
var_name='month',
value_name='sales'
)
# Clean up month column
normalized['month'] = normalized['month'].str.replace('_sales', '').str.title()
print(normalized)
Performance Tips and Common Pitfalls
Memory matters. Pivoting can explode memory usage. If you have 1 million rows with 1000 unique column values, your pivoted DataFrame has 1 billion potential cells. Use pivot_table() with aggregation to reduce dimensionality first.
Reset your index. After pivoting, the index is often the former index column. Many operations expect a default integer index. Get in the habit of calling reset_index():
pivoted = df.pivot(index='date', columns='product', values='sales')
pivoted = pivoted.reset_index() # date becomes a regular column
pivoted.columns.name = None # removes the "product" label from columns
Consider stack/unstack. For MultiIndex DataFrames, stack() and unstack() offer more control:
# unstack moves an index level to columns (like pivot)
# stack moves columns to an index level (like melt)
df.set_index(['date', 'product'])['value'].unstack()
These are particularly useful when you’re already working with hierarchical indices.
Watch for silent type changes. Pivoting can introduce NaN values, which forces integer columns to float. If you need integers, use fillna(0).astype(int) after pivoting.
The reshape operations in pandas are powerful but unforgiving. Master the basics with small test DataFrames before applying them to production data. Print intermediate results. Verify row counts. Your future self will thank you.