Pandas - Stack and Unstack | Application Architect

Key Insights

• Stack converts column labels into row index levels (wide to long), while unstack does the reverse (long to wide), making them essential for reshaping hierarchical data structures • These operations preserve all data relationships through MultiIndex structures, enabling seamless transitions between different analytical perspectives of the same dataset • Understanding level parameters and fill_value options prevents common pitfalls like lost data or unexpected NaN propagation during reshaping operations

Understanding Stack and Unstack Fundamentals

Stack and unstack are complementary reshaping operations in pandas that manipulate the structure of DataFrames with hierarchical indices. Stack “compresses” a DataFrame by rotating column labels into the innermost row index level. Unstack performs the inverse operation, pivoting row index levels into columns.

import pandas as pd
import numpy as np

# Create a DataFrame with MultiIndex columns
data = {
    ('Sales', 'Q1'): [100, 150, 200],
    ('Sales', 'Q2'): [110, 160, 210],
    ('Profit', 'Q1'): [20, 30, 40],
    ('Profit', 'Q2'): [25, 35, 45]
}
df = pd.DataFrame(data, index=['Store_A', 'Store_B', 'Store_C'])
print(df)

Output:

         Sales      Profit    
            Q1  Q2     Q1  Q2
Store_A    100 110     20  25
Store_B    150 160     30  35
Store_C    200 210     40  45

Applying stack:

stacked = df.stack()
print(stacked)

Output:

                Sales  Profit
Store_A Q1        100      20
        Q2        110      25
Store_B Q1        150      30
        Q2        160      35
Store_C Q1        200      40
        Q2        210      45

Working with Level Parameters

The level parameter controls which index level gets stacked or unstacked. This becomes critical when working with DataFrames containing multiple index levels.

# Create a DataFrame with 3-level MultiIndex
arrays = [
    ['North', 'North', 'South', 'South'],
    ['Store_1', 'Store_2', 'Store_1', 'Store_2'],
    ['2023', '2023', '2023', '2023']
]
index = pd.MultiIndex.from_arrays(arrays, names=['Region', 'Store', 'Year'])
df_multi = pd.DataFrame({
    'Revenue': [1000, 1200, 800, 900],
    'Costs': [600, 700, 500, 550]
}, index=index)

print(df_multi)

Output:

                      Revenue  Costs
Region Store   Year                 
North  Store_1 2023      1000    600
       Store_2 2023      1200    700
South  Store_1 2023       800    500
       Store_2 2023       900    550

Unstacking different levels:

# Unstack the innermost level (Year)
unstacked_year = df_multi.unstack(level='Year')
print(unstacked_year)

# Unstack the Store level
unstacked_store = df_multi.unstack(level='Store')
print(unstacked_store)

The level parameter accepts integers (0-based position), level names, or lists for multiple levels. Negative integers count from the innermost level (-1 is the last level).

# Unstack multiple levels
unstacked_multi = df_multi.unstack(level=['Store', 'Year'])
print(unstacked_multi)

Handling Missing Data with fill_value

When unstacking creates new column combinations that don’t exist in the original data, pandas inserts NaN values. The fill_value parameter provides control over this behavior.

# Create sparse data
data = {
    'Product': ['Widget', 'Widget', 'Gadget', 'Gadget'],
    'Quarter': ['Q1', 'Q3', 'Q2', 'Q4'],
    'Sales': [100, 150, 200, 175]
}
df_sparse = pd.DataFrame(data).set_index(['Product', 'Quarter'])
print(df_sparse)

# Unstack without fill_value
unstacked_nan = df_sparse.unstack()
print(unstacked_nan)

# Unstack with fill_value
unstacked_filled = df_sparse.unstack(fill_value=0)
print(unstacked_filled)

Output with fill_value=0:

         Sales            
Quarter     Q1   Q2   Q3   Q4
Product                      
Gadget       0  200    0  175
Widget     100    0  150    0

This is particularly useful for time series data where missing periods should be treated as zero rather than unknown values.

Practical Application: Sales Data Transformation

Real-world datasets often require reshaping for different analytical perspectives. Here’s a complete workflow transforming sales data between formats.

# Create realistic sales data
sales_data = {
    'Date': pd.date_range('2023-01-01', periods=12, freq='M'),
    'Region': ['East', 'West', 'East', 'West'] * 3,
    'Product': ['A', 'A', 'B', 'B'] * 3,
    'Revenue': np.random.randint(1000, 5000, 12),
    'Units': np.random.randint(50, 200, 12)
}
sales_df = pd.DataFrame(sales_data)

# Reshape for regional comparison
regional_view = sales_df.pivot_table(
    values='Revenue',
    index='Date',
    columns=['Region', 'Product'],
    aggfunc='sum'
)
print(regional_view.head())

# Stack to get long format for modeling
long_format = regional_view.stack(level=['Region', 'Product'])
print(long_format.head(10))

# Unstack to create Product-focused view
product_view = long_format.unstack(level='Product')
print(product_view.head())

Combining Stack/Unstack with GroupBy Operations

Stack and unstack integrate seamlessly with groupby operations for complex aggregations.

# Multi-dimensional aggregation
data = {
    'Department': ['Sales', 'Sales', 'IT', 'IT', 'Sales', 'IT'],
    'Employee': ['John', 'Jane', 'Bob', 'Alice', 'John', 'Bob'],
    'Metric': ['Hours', 'Hours', 'Hours', 'Hours', 'Tasks', 'Tasks'],
    'Value': [40, 38, 42, 40, 25, 30]
}
df_emp = pd.DataFrame(data)

# Pivot and stack for analysis
pivoted = df_emp.pivot_table(
    values='Value',
    index=['Department', 'Employee'],
    columns='Metric',
    aggfunc='sum'
)
print(pivoted)

# Stack to calculate per-employee metrics
stacked_metrics = pivoted.stack()
print(stacked_metrics)

# Unstack Department for cross-department comparison
dept_comparison = stacked_metrics.unstack(level='Department')
print(dept_comparison)

Performance Considerations and Best Practices

Stack and unstack operations create new DataFrames, so memory usage matters with large datasets.

import time

# Create large DataFrame
large_df = pd.DataFrame(
    np.random.randn(10000, 50),
    columns=[f'col_{i}' for i in range(50)]
)
large_df['category'] = np.random.choice(['A', 'B', 'C'], 10000)
large_df = large_df.set_index('category', append=True)

# Time the operation
start = time.time()
stacked_large = large_df.stack()
stack_time = time.time() - start

start = time.time()
unstacked_large = stacked_large.unstack()
unstack_time = time.time() - start

print(f"Stack time: {stack_time:.4f}s")
print(f"Unstack time: {unstack_time:.4f}s")
print(f"Original shape: {large_df.shape}")
print(f"Stacked shape: {stacked_large.shape}")

Key optimization strategies:

# Use dropna parameter to remove NaN rows after stacking
stacked_clean = df.stack(dropna=True)

# Specify dtypes to reduce memory
df_typed = df.astype({col: 'int32' for col in df.select_dtypes(include=['int64']).columns})

# Use categorical data for repeated values
df['category'] = df['category'].astype('category')

Handling Edge Cases

Stack and unstack behavior with duplicate indices requires attention.

# DataFrame with duplicate indices
df_dup = pd.DataFrame({
    'A': [1, 2, 3],
    'B': [4, 5, 6]
}, index=['x', 'x', 'y'])

# Stacking with duplicates
stacked_dup = df_dup.stack()
print(stacked_dup)

# This creates a Series with non-unique index
print(f"Index is unique: {stacked_dup.index.is_unique}")

For time series with irregular intervals:

# Handle irregular time series
ts_data = pd.DataFrame({
    'value': [10, 20, 30],
    'category': ['A', 'B', 'A']
}, index=pd.DatetimeIndex(['2023-01-01', '2023-01-15', '2023-02-01']))

ts_pivot = ts_data.pivot(columns='category', values='value')
print(ts_pivot)

# Reindex to fill gaps
full_range = pd.date_range('2023-01-01', '2023-02-28', freq='D')
ts_complete = ts_pivot.reindex(full_range)
print(ts_complete.head(10))

These operations form the foundation for advanced data manipulation workflows, enabling transitions between analytical formats without data loss or integrity issues.

Understanding Stack and Unstack Fundamentals

Working with Level Parameters

Handling Missing Data with fill_value

Practical Application: Sales Data Transformation

Combining Stack/Unstack with GroupBy Operations

Performance Considerations and Best Practices

Handling Edge Cases

Liked this? There's more.