Pandas - Stack and Unstack
• Stack converts column labels into row index levels (wide to long), while unstack does the reverse (long to wide), making them essential for reshaping hierarchical data structures
Key Insights
• Stack converts column labels into row index levels (wide to long), while unstack does the reverse (long to wide), making them essential for reshaping hierarchical data structures • These operations preserve all data relationships through MultiIndex structures, enabling seamless transitions between different analytical perspectives of the same dataset • Understanding level parameters and fill_value options prevents common pitfalls like lost data or unexpected NaN propagation during reshaping operations
Understanding Stack and Unstack Fundamentals
Stack and unstack are complementary reshaping operations in pandas that manipulate the structure of DataFrames with hierarchical indices. Stack “compresses” a DataFrame by rotating column labels into the innermost row index level. Unstack performs the inverse operation, pivoting row index levels into columns.
import pandas as pd
import numpy as np
# Create a DataFrame with MultiIndex columns
data = {
('Sales', 'Q1'): [100, 150, 200],
('Sales', 'Q2'): [110, 160, 210],
('Profit', 'Q1'): [20, 30, 40],
('Profit', 'Q2'): [25, 35, 45]
}
df = pd.DataFrame(data, index=['Store_A', 'Store_B', 'Store_C'])
print(df)
Output:
Sales Profit
Q1 Q2 Q1 Q2
Store_A 100 110 20 25
Store_B 150 160 30 35
Store_C 200 210 40 45
Applying stack:
stacked = df.stack()
print(stacked)
Output:
Sales Profit
Store_A Q1 100 20
Q2 110 25
Store_B Q1 150 30
Q2 160 35
Store_C Q1 200 40
Q2 210 45
Working with Level Parameters
The level parameter controls which index level gets stacked or unstacked. This becomes critical when working with DataFrames containing multiple index levels.
# Create a DataFrame with 3-level MultiIndex
arrays = [
['North', 'North', 'South', 'South'],
['Store_1', 'Store_2', 'Store_1', 'Store_2'],
['2023', '2023', '2023', '2023']
]
index = pd.MultiIndex.from_arrays(arrays, names=['Region', 'Store', 'Year'])
df_multi = pd.DataFrame({
'Revenue': [1000, 1200, 800, 900],
'Costs': [600, 700, 500, 550]
}, index=index)
print(df_multi)
Output:
Revenue Costs
Region Store Year
North Store_1 2023 1000 600
Store_2 2023 1200 700
South Store_1 2023 800 500
Store_2 2023 900 550
Unstacking different levels:
# Unstack the innermost level (Year)
unstacked_year = df_multi.unstack(level='Year')
print(unstacked_year)
# Unstack the Store level
unstacked_store = df_multi.unstack(level='Store')
print(unstacked_store)
The level parameter accepts integers (0-based position), level names, or lists for multiple levels. Negative integers count from the innermost level (-1 is the last level).
# Unstack multiple levels
unstacked_multi = df_multi.unstack(level=['Store', 'Year'])
print(unstacked_multi)
Handling Missing Data with fill_value
When unstacking creates new column combinations that don’t exist in the original data, pandas inserts NaN values. The fill_value parameter provides control over this behavior.
# Create sparse data
data = {
'Product': ['Widget', 'Widget', 'Gadget', 'Gadget'],
'Quarter': ['Q1', 'Q3', 'Q2', 'Q4'],
'Sales': [100, 150, 200, 175]
}
df_sparse = pd.DataFrame(data).set_index(['Product', 'Quarter'])
print(df_sparse)
# Unstack without fill_value
unstacked_nan = df_sparse.unstack()
print(unstacked_nan)
# Unstack with fill_value
unstacked_filled = df_sparse.unstack(fill_value=0)
print(unstacked_filled)
Output with fill_value=0:
Sales
Quarter Q1 Q2 Q3 Q4
Product
Gadget 0 200 0 175
Widget 100 0 150 0
This is particularly useful for time series data where missing periods should be treated as zero rather than unknown values.
Practical Application: Sales Data Transformation
Real-world datasets often require reshaping for different analytical perspectives. Here’s a complete workflow transforming sales data between formats.
# Create realistic sales data
sales_data = {
'Date': pd.date_range('2023-01-01', periods=12, freq='M'),
'Region': ['East', 'West', 'East', 'West'] * 3,
'Product': ['A', 'A', 'B', 'B'] * 3,
'Revenue': np.random.randint(1000, 5000, 12),
'Units': np.random.randint(50, 200, 12)
}
sales_df = pd.DataFrame(sales_data)
# Reshape for regional comparison
regional_view = sales_df.pivot_table(
values='Revenue',
index='Date',
columns=['Region', 'Product'],
aggfunc='sum'
)
print(regional_view.head())
# Stack to get long format for modeling
long_format = regional_view.stack(level=['Region', 'Product'])
print(long_format.head(10))
# Unstack to create Product-focused view
product_view = long_format.unstack(level='Product')
print(product_view.head())
Combining Stack/Unstack with GroupBy Operations
Stack and unstack integrate seamlessly with groupby operations for complex aggregations.
# Multi-dimensional aggregation
data = {
'Department': ['Sales', 'Sales', 'IT', 'IT', 'Sales', 'IT'],
'Employee': ['John', 'Jane', 'Bob', 'Alice', 'John', 'Bob'],
'Metric': ['Hours', 'Hours', 'Hours', 'Hours', 'Tasks', 'Tasks'],
'Value': [40, 38, 42, 40, 25, 30]
}
df_emp = pd.DataFrame(data)
# Pivot and stack for analysis
pivoted = df_emp.pivot_table(
values='Value',
index=['Department', 'Employee'],
columns='Metric',
aggfunc='sum'
)
print(pivoted)
# Stack to calculate per-employee metrics
stacked_metrics = pivoted.stack()
print(stacked_metrics)
# Unstack Department for cross-department comparison
dept_comparison = stacked_metrics.unstack(level='Department')
print(dept_comparison)
Performance Considerations and Best Practices
Stack and unstack operations create new DataFrames, so memory usage matters with large datasets.
import time
# Create large DataFrame
large_df = pd.DataFrame(
np.random.randn(10000, 50),
columns=[f'col_{i}' for i in range(50)]
)
large_df['category'] = np.random.choice(['A', 'B', 'C'], 10000)
large_df = large_df.set_index('category', append=True)
# Time the operation
start = time.time()
stacked_large = large_df.stack()
stack_time = time.time() - start
start = time.time()
unstacked_large = stacked_large.unstack()
unstack_time = time.time() - start
print(f"Stack time: {stack_time:.4f}s")
print(f"Unstack time: {unstack_time:.4f}s")
print(f"Original shape: {large_df.shape}")
print(f"Stacked shape: {stacked_large.shape}")
Key optimization strategies:
# Use dropna parameter to remove NaN rows after stacking
stacked_clean = df.stack(dropna=True)
# Specify dtypes to reduce memory
df_typed = df.astype({col: 'int32' for col in df.select_dtypes(include=['int64']).columns})
# Use categorical data for repeated values
df['category'] = df['category'].astype('category')
Handling Edge Cases
Stack and unstack behavior with duplicate indices requires attention.
# DataFrame with duplicate indices
df_dup = pd.DataFrame({
'A': [1, 2, 3],
'B': [4, 5, 6]
}, index=['x', 'x', 'y'])
# Stacking with duplicates
stacked_dup = df_dup.stack()
print(stacked_dup)
# This creates a Series with non-unique index
print(f"Index is unique: {stacked_dup.index.is_unique}")
For time series with irregular intervals:
# Handle irregular time series
ts_data = pd.DataFrame({
'value': [10, 20, 30],
'category': ['A', 'B', 'A']
}, index=pd.DatetimeIndex(['2023-01-01', '2023-01-15', '2023-02-01']))
ts_pivot = ts_data.pivot(columns='category', values='value')
print(ts_pivot)
# Reindex to fill gaps
full_range = pd.date_range('2023-01-01', '2023-02-28', freq='D')
ts_complete = ts_pivot.reindex(full_range)
print(ts_complete.head(10))
These operations form the foundation for advanced data manipulation workflows, enabling transitions between analytical formats without data loss or integrity issues.