How to Pivot a DataFrame in Pandas
Pivoting transforms data from a 'long' format (many rows, few columns) to a 'wide' format (fewer rows, more columns). If you've ever received transactional data where each row represents a single...
Key Insights
- Use
pivot()for simple reshaping when your index/column combinations are unique; usepivot_table()when you need to aggregate duplicate entries - The
pivot_table()function is essentially a combination ofgroupby()andpivot(), making it the more versatile choice for real-world data - Always pair your pivoting knowledge with
melt()to reverse the operation—data pipelines often require converting between long and wide formats multiple times
Introduction
Pivoting transforms data from a “long” format (many rows, few columns) to a “wide” format (fewer rows, more columns). If you’ve ever received transactional data where each row represents a single observation and needed to create a summary table with categories as columns, you’ve needed a pivot operation.
This transformation is fundamental to data analysis. Reporting tools expect wide format. Visualization libraries often work better with pivoted data. And sometimes you simply need to compare values side-by-side rather than scanning through hundreds of rows.
Pandas provides two primary methods for pivoting: pivot() for straightforward reshaping and pivot_table() for reshaping with aggregation. Understanding when to use each will save you hours of debugging and produce cleaner analysis code.
Understanding the Source Data
Before pivoting, you need data in long format. This is the typical structure you get from databases, CSV exports, and APIs—one observation per row with categorical columns that could become your new column headers.
Let’s create a realistic dataset representing quarterly sales data:
import pandas as pd
import numpy as np
# Create sample sales data in long format
data = {
'date': ['2024-Q1', '2024-Q1', '2024-Q1', '2024-Q1',
'2024-Q2', '2024-Q2', '2024-Q2', '2024-Q2',
'2024-Q3', '2024-Q3', '2024-Q3', '2024-Q3'],
'product': ['Widget', 'Widget', 'Gadget', 'Gadget',
'Widget', 'Widget', 'Gadget', 'Gadget',
'Widget', 'Widget', 'Gadget', 'Gadget'],
'region': ['North', 'South', 'North', 'South',
'North', 'South', 'North', 'South',
'North', 'South', 'North', 'South'],
'revenue': [15000, 12000, 8000, 9500,
18000, 14000, 11000, 10000,
21000, 16000, 13000, 12500],
'units_sold': [150, 120, 40, 47,
180, 140, 55, 50,
210, 160, 65, 62]
}
df = pd.DataFrame(data)
print(df)
date product region revenue units_sold
0 2024-Q1 Widget North 15000 150
1 2024-Q1 Widget South 12000 120
2 2024-Q1 Gadget North 8000 40
3 2024-Q1 Gadget South 9500 47
4 2024-Q2 Widget North 18000 180
5 2024-Q2 Widget South 14000 140
6 2024-Q2 Gadget North 11000 55
7 2024-Q2 Gadget South 10000 50
8 2024-Q3 Widget North 21000 210
9 2024-Q3 Widget South 16000 160
10 2024-Q3 Gadget North 13000 65
11 2024-Q3 Gadget South 12500 62
This long format works well for storage but makes comparison difficult. Try quickly answering: “How did Widget revenue in the North change over time?” You’d need to scan multiple rows. A pivoted table makes this trivial.
Basic Pivoting with pivot()
The pivot() method performs a pure reshape operation. It takes three arguments: index (what becomes row labels), columns (what becomes column headers), and values (what fills the cells).
# Pivot to show revenue by date and product
pivoted = df.pivot(index='date', columns='product', values='revenue')
print(pivoted)
Wait—this will fail. Our data has duplicate combinations of date and product (North and South regions for each). The pivot() method cannot handle this:
# This raises ValueError: Index contains duplicate entries, cannot reshape
try:
pivoted = df.pivot(index='date', columns='product', values='revenue')
except ValueError as e:
print(f"Error: {e}")
For pivot() to work, each index-column combination must be unique. Let’s create a subset where this holds:
# Filter to North region only—now combinations are unique
df_north = df[df['region'] == 'North']
pivoted_north = df_north.pivot(index='date', columns='product', values='revenue')
print(pivoted_north)
product Gadget Widget
date
2024-Q1 8000 15000
2024-Q2 11000 18000
2024-Q3 13000 21000
Now you can instantly see the revenue trend for each product. The pivot() method is fast and simple, but its uniqueness requirement limits its usefulness with real-world data that typically contains duplicates.
Aggregating with pivot_table()
The pivot_table() function solves the duplicate problem by aggregating values. When multiple rows match the same index-column combination, it applies an aggregation function to combine them.
# Pivot with sum aggregation—combines North and South revenue
pivot_sum = df.pivot_table(
index='date',
columns='product',
values='revenue',
aggfunc='sum'
)
print(pivot_sum)
product Gadget Widget
date
2024-Q1 17500 27000
2024-Q2 21000 32000
2024-Q3 25500 37000
The revenue for each product now represents the total across all regions. You can use any aggregation function—'mean', 'count', 'min', 'max', 'std', or even custom functions:
# Average revenue per product per quarter
pivot_mean = df.pivot_table(
index='date',
columns='product',
values='revenue',
aggfunc='mean'
)
print(pivot_mean)
product Gadget Widget
date
2024-Q1 8750.0 13500.0
2024-Q2 10500.0 16000.0
2024-Q3 12750.0 18500.0
Need multiple aggregations? Pass a list:
# Multiple aggregation functions
pivot_multi = df.pivot_table(
index='date',
columns='product',
values='revenue',
aggfunc=['sum', 'mean', 'count']
)
print(pivot_multi)
sum mean count
product Gadget Widget Gadget Widget Gadget Widget
date
2024-Q1 17500 27000 8750.0 13500.0 2 2
2024-Q2 21000 32000 10500.0 16000.0 2 2
2024-Q3 25500 37000 12750.0 18500.0 2 2
This creates a hierarchical column index, which you can flatten if needed:
# Flatten multi-level columns
pivot_multi.columns = ['_'.join(col).strip() for col in pivot_multi.columns.values]
print(pivot_multi.columns.tolist())
# ['sum_Gadget', 'sum_Widget', 'mean_Gadget', 'mean_Widget', 'count_Gadget', 'count_Widget']
Advanced Options
Real-world pivot tables often need totals, handling for missing data, and multiple value columns. The pivot_table() function handles all of these.
Adding Margins for Totals
The margins parameter adds row and column totals:
# Add totals row and column
pivot_with_totals = df.pivot_table(
index='date',
columns='product',
values='revenue',
aggfunc='sum',
margins=True,
margins_name='Total'
)
print(pivot_with_totals)
product Gadget Widget Total
date
2024-Q1 17500 27000 44500
2024-Q2 21000 32000 53000
2024-Q3 25500 37000 62500
Total 64000 96000 160000
Handling Missing Values
When certain index-column combinations don’t exist in your data, the result contains NaN. Use fill_value to replace these:
# Create data with a missing combination
df_sparse = df[~((df['date'] == '2024-Q2') & (df['product'] == 'Gadget'))]
pivot_sparse = df_sparse.pivot_table(
index='date',
columns='product',
values='revenue',
aggfunc='sum',
fill_value=0 # Replace NaN with 0
)
print(pivot_sparse)
product Gadget Widget
date
2024-Q1 17500 27000
2024-Q2 0 32000
2024-Q3 25500 37000
Multiple Value Columns
You can pivot multiple value columns simultaneously:
# Pivot both revenue and units_sold
pivot_multiple_values = df.pivot_table(
index='date',
columns='product',
values=['revenue', 'units_sold'],
aggfunc='sum'
)
print(pivot_multiple_values)
revenue units_sold
product Gadget Widget Gadget Widget
date
2024-Q1 17500 27000 87 270
2024-Q2 21000 32000 105 320
2024-Q3 25500 37000 127 370
Reversing the Pivot with melt()
Data pipelines often require converting between formats. The melt() function reverses a pivot operation, converting wide data back to long format:
# Start with our pivoted data
pivoted = df.pivot_table(
index='date',
columns='product',
values='revenue',
aggfunc='sum'
).reset_index()
print("Wide format:")
print(pivoted)
# Melt back to long format
melted = pivoted.melt(
id_vars=['date'], # Columns to keep as-is
value_vars=['Gadget', 'Widget'], # Columns to unpivot
var_name='product', # Name for the new category column
value_name='total_revenue' # Name for the new value column
)
print("\nLong format:")
print(melted)
Wide format:
product date Gadget Widget
0 2024-Q1 17500 27000
1 2024-Q2 21000 32000
2 2024-Q3 25500 37000
Long format:
date product total_revenue
0 2024-Q1 Gadget 17500
1 2024-Q2 Gadget 21000
2 2024-Q3 Gadget 25500
3 2024-Q1 Widget 27000
4 2024-Q2 Widget 32000
5 2024-Q3 Widget 37000
Note that melting aggregated data doesn’t recover the original granularity—we now have total revenue per product, not the regional breakdown we started with.
Conclusion
Choose pivot() when your data has unique index-column combinations and you need a simple reshape without aggregation. It’s faster and the code is more explicit about your intent.
Choose pivot_table() for everything else. It handles duplicates through aggregation, supports margins for totals, fills missing values, and works with multiple value columns. The slight overhead is negligible for most datasets.
For related operations, explore stack() and unstack() which pivot between row and column indices in hierarchical DataFrames, and crosstab() for computing frequency tables. These tools, combined with melt() for the reverse operation, give you complete control over DataFrame shape transformations.