Pandas: Reshaping Data with Pivot and Melt

Data rarely arrives in the format you need. Your visualization library wants wide format, your machine learning model expects long format, and your database export looks nothing like either....

Key Insights

  • Pivot transforms long data to wide format (rows become columns), while melt does the opposite—mastering both gives you complete control over DataFrame structure
  • Use pivot_table() instead of pivot() when your data has duplicate entries that need aggregation; it’s more robust for real-world messy data
  • Reshape operations are not always perfectly reversible—understanding when data loss occurs prevents frustrating debugging sessions

Introduction to Data Reshaping

Data rarely arrives in the format you need. Your visualization library wants wide format, your machine learning model expects long format, and your database export looks nothing like either. Reshaping data is one of those unglamorous skills that separates productive data work from hours of frustration.

The core distinction is simple: wide format stores related measurements across multiple columns (one column per time period, category, or variable), while long format stacks them into fewer columns with an additional identifier column.

Consider survey responses. Wide format gives you Q1, Q2, Q3, Q4 as separate columns—easy to read, hard to aggregate. Long format gives you a single quarter column with values 1-4 and a response column—verbose to read, trivial to filter and group.

Neither format is inherently better. The right format depends on what you’re doing next.

Understanding Pivot: Long to Wide

The pivot() function transforms long data into wide format by turning unique values from one column into new column headers.

import pandas as pd

# Sales data in long format
sales = pd.DataFrame({
    'date': ['2024-01-01', '2024-01-01', '2024-01-02', '2024-01-02'],
    'product': ['Widget', 'Gadget', 'Widget', 'Gadget'],
    'revenue': [100, 150, 120, 140]
})

print("Long format:")
print(sales)
#          date product  revenue
# 0  2024-01-01  Widget      100
# 1  2024-01-01  Gadget      150
# 2  2024-01-02  Widget      120
# 3  2024-01-02  Gadget      140

# Pivot: dates as rows, products as columns
wide_sales = sales.pivot(index='date', columns='product', values='revenue')

print("\nWide format:")
print(wide_sales)
# product     Gadget  Widget
# date                      
# 2024-01-01     150     100
# 2024-01-02     140     120

The three parameters are straightforward: index becomes row labels, columns values become column headers, and values fills the cells.

Here’s where beginners hit a wall: pivot() fails with duplicate entries.

# Data with duplicates (two Widget sales on same date)
sales_dupes = pd.DataFrame({
    'date': ['2024-01-01', '2024-01-01', '2024-01-01'],
    'product': ['Widget', 'Widget', 'Gadget'],
    'revenue': [100, 50, 150]
})

# This raises ValueError: Index contains duplicate entries
# sales_dupes.pivot(index='date', columns='product', values='revenue')

When you have duplicate index-column combinations, pivot() doesn’t know which value to use. You need pivot_table().

Pivot Tables for Aggregation

pivot_table() handles duplicates by aggregating them. It’s the more robust choice for real-world data.

# Detailed sales with multiple transactions per day/product
detailed_sales = pd.DataFrame({
    'date': ['2024-01-01', '2024-01-01', '2024-01-01', '2024-01-02', '2024-01-02'],
    'region': ['North', 'North', 'South', 'North', 'South'],
    'product': ['Widget', 'Widget', 'Gadget', 'Gadget', 'Widget'],
    'revenue': [100, 50, 200, 175, 125]
})

# Sum revenue by region and product
summary = pd.pivot_table(
    detailed_sales,
    values='revenue',
    index='region',
    columns='product',
    aggfunc='sum',
    fill_value=0,
    margins=True,
    margins_name='Total'
)

print(summary)
# product  Gadget  Widget  Total
# region                        
# North       175     150    325
# South       200     125    325
# Total       375     275    650

The margins=True parameter adds row and column totals—extremely useful for reports. The fill_value=0 handles missing combinations (no North-Gadget sales on day one, for instance).

You can also create multi-index pivots by passing lists:

multi_pivot = pd.pivot_table(
    detailed_sales,
    values='revenue',
    index=['region', 'date'],
    columns='product',
    aggfunc=['sum', 'count']
)

Understanding Melt: Wide to Long

melt() is pivot’s inverse. It unpivots wide data by collapsing multiple columns into key-value pairs.

# Survey data in wide format
survey = pd.DataFrame({
    'respondent_id': [1, 2, 3],
    'name': ['Alice', 'Bob', 'Carol'],
    'Q1': [4, 5, 3],
    'Q2': [3, 4, 4],
    'Q3': [5, 3, 5],
    'Q4': [4, 4, 4]
})

print("Wide format:")
print(survey)
#    respondent_id   name  Q1  Q2  Q3  Q4
# 0              1  Alice   4   3   5   4
# 1              2    Bob   5   4   3   4
# 2              3  Carol   3   4   5   4

# Melt: Q1-Q4 columns become rows
long_survey = survey.melt(
    id_vars=['respondent_id', 'name'],
    value_vars=['Q1', 'Q2', 'Q3', 'Q4'],
    var_name='quarter',
    value_name='score'
)

print("\nLong format:")
print(long_survey)
#     respondent_id   name quarter  score
# 0               1  Alice      Q1      4
# 1               2    Bob      Q1      5
# 2               3  Carol      Q1      3
# 3               1  Alice      Q2      3
# ...

The parameters: id_vars are columns to keep as identifiers, value_vars are columns to unpivot, var_name names the new column holding former column headers, and value_name names the column holding the values.

If you omit value_vars, melt unpivots all columns not in id_vars. This is convenient but dangerous—be explicit when your DataFrame has columns you want to preserve.

Pivot and Melt as Inverses

In theory, pivot and melt should be perfect inverses. In practice, you need to handle index resets and column ordering.

# Start with long format
original = pd.DataFrame({
    'date': ['2024-01-01', '2024-01-01', '2024-01-02', '2024-01-02'],
    'metric': ['views', 'clicks', 'views', 'clicks'],
    'value': [1000, 50, 1200, 65]
})

# Pivot to wide
wide = original.pivot(index='date', columns='metric', values='value')

# Melt back to long
recovered = wide.reset_index().melt(
    id_vars='date',
    var_name='metric',
    value_name='value'
)

# Sort for comparison
original_sorted = original.sort_values(['date', 'metric']).reset_index(drop=True)
recovered_sorted = recovered.sort_values(['date', 'metric']).reset_index(drop=True)

print(original_sorted.equals(recovered_sorted))  # True

Data loss occurs when pivoting aggregates values. If you sum duplicate entries during pivot, you can’t recover the original individual values with melt.

Real-World Reshaping Patterns

Preparing Time Series for Visualization

Seaborn’s heatmap expects wide format with numeric indices:

import numpy as np

# Weather station data in long format
weather = pd.DataFrame({
    'station': ['Downtown', 'Downtown', 'Downtown', 'Airport', 'Airport', 'Airport'],
    'month': ['Jan', 'Feb', 'Mar', 'Jan', 'Feb', 'Mar'],
    'avg_temp': [32, 35, 45, 30, 33, 43]
})

# Reshape for heatmap
heatmap_data = weather.pivot(
    index='station',
    columns='month',
    values='avg_temp'
)

# Reorder columns chronologically
heatmap_data = heatmap_data[['Jan', 'Feb', 'Mar']]

print(heatmap_data)
# month       Jan  Feb  Mar
# station                   
# Airport      30   33   43
# Downtown     32   35   45

# Now ready for: sns.heatmap(heatmap_data, annot=True)

Normalizing Denormalized Exports

Database exports often repeat data for readability. Melt helps normalize:

# Denormalized export with repeated customer info
export = pd.DataFrame({
    'customer_id': [101, 102],
    'customer_name': ['Acme Corp', 'Globex'],
    'jan_sales': [5000, 7500],
    'feb_sales': [5500, 7200],
    'mar_sales': [6000, 8000]
})

# Normalize to proper relational format
normalized = export.melt(
    id_vars=['customer_id', 'customer_name'],
    value_vars=['jan_sales', 'feb_sales', 'mar_sales'],
    var_name='month',
    value_name='sales'
)

# Clean up month column
normalized['month'] = normalized['month'].str.replace('_sales', '').str.title()

print(normalized)

Performance Tips and Common Pitfalls

Memory matters. Pivoting can explode memory usage. If you have 1 million rows with 1000 unique column values, your pivoted DataFrame has 1 billion potential cells. Use pivot_table() with aggregation to reduce dimensionality first.

Reset your index. After pivoting, the index is often the former index column. Many operations expect a default integer index. Get in the habit of calling reset_index():

pivoted = df.pivot(index='date', columns='product', values='sales')
pivoted = pivoted.reset_index()  # date becomes a regular column
pivoted.columns.name = None       # removes the "product" label from columns

Consider stack/unstack. For MultiIndex DataFrames, stack() and unstack() offer more control:

# unstack moves an index level to columns (like pivot)
# stack moves columns to an index level (like melt)
df.set_index(['date', 'product'])['value'].unstack()

These are particularly useful when you’re already working with hierarchical indices.

Watch for silent type changes. Pivoting can introduce NaN values, which forces integer columns to float. If you need integers, use fillna(0).astype(int) after pivoting.

The reshape operations in pandas are powerful but unforgiving. Master the basics with small test DataFrames before applying them to production data. Print intermediate results. Verify row counts. Your future self will thank you.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.