Pandas - Melt (Reshape Wide to Long)
• The melt operation transforms wide-format data into long-format by unpivoting columns into rows, making it easier to analyze categorical data and perform group-based operations
Key Insights
• The melt operation transforms wide-format data into long-format by unpivoting columns into rows, making it easier to analyze categorical data and perform group-based operations • Understanding id_vars (columns to preserve) and value_vars (columns to unpivot) is essential for controlling which columns get reshaped and which remain as identifiers • Melt is the inverse of pivot operations and forms the foundation for tidy data principles where each variable is a column and each observation is a row
Understanding the Melt Operation
Pandas melt reshapes data from wide format (multiple columns representing different categories) to long format (categorical values stacked in rows). This transformation is critical when working with data that violates tidy data principles or when preparing data for visualization libraries and statistical models that expect long-format input.
import pandas as pd
# Wide format data - each product has its own column
df_wide = pd.DataFrame({
'date': ['2024-01-01', '2024-01-02', '2024-01-03'],
'product_a': [100, 150, 120],
'product_b': [200, 180, 210],
'product_c': [150, 160, 155]
})
print("Wide format:")
print(df_wide)
# Convert to long format
df_long = pd.melt(df_wide, id_vars=['date'], var_name='product', value_name='sales')
print("\nLong format:")
print(df_long)
Output:
Wide format:
date product_a product_b product_c
0 2024-01-01 100 200 150
1 2024-01-02 150 180 160
2 2024-01-03 120 210 155
Long format:
date product sales
0 2024-01-01 product_a 100
1 2024-01-02 product_a 150
2 2024-01-03 product_a 120
3 2024-01-01 product_b 200
4 2024-01-02 product_b 180
5 2024-01-03 product_b 210
6 2024-01-01 product_c 150
7 2024-01-02 product_c 160
8 2024-01-03 product_c 155
Core Parameters and Control
The melt function provides precise control over which columns to preserve and which to unpivot through its parameters.
# Complex dataset with multiple identifier columns
df = pd.DataFrame({
'store_id': [1, 2, 3],
'region': ['North', 'South', 'East'],
'q1_sales': [50000, 62000, 48000],
'q2_sales': [55000, 59000, 51000],
'q1_profit': [12000, 15000, 11000],
'q2_profit': [13500, 14200, 12300]
})
# Melt only sales columns, preserve store_id and region
df_sales = pd.melt(
df,
id_vars=['store_id', 'region'],
value_vars=['q1_sales', 'q2_sales'],
var_name='quarter',
value_name='sales_amount'
)
print(df_sales)
Output:
store_id region quarter sales_amount
0 1 North q1_sales 50000
1 2 South q1_sales 62000
2 3 East q1_sales 48000
3 1 North q2_sales 55000
4 2 South q2_sales 59000
5 3 East q2_sales 51000
When value_vars is omitted, melt automatically unpivots all columns except those specified in id_vars:
# Automatic value_vars selection
df_auto = pd.melt(
df,
id_vars=['store_id', 'region'],
var_name='metric',
value_name='value'
)
print(df_auto.head(8))
Handling Multi-Level Data
Real-world scenarios often require melting data with multiple measurement types simultaneously. Here’s how to handle complex reshaping tasks.
# Dataset with temperature and humidity readings
weather_df = pd.DataFrame({
'date': pd.date_range('2024-01-01', periods=3),
'location': ['Station_A', 'Station_B', 'Station_C'],
'temp_morning': [15.2, 16.1, 14.8],
'temp_evening': [22.3, 21.7, 23.1],
'humidity_morning': [65, 70, 68],
'humidity_evening': [45, 48, 42]
})
# First melt
df_melted = pd.melt(
weather_df,
id_vars=['date', 'location'],
var_name='measurement',
value_name='value'
)
# Extract metric type and time period
df_melted[['metric', 'period']] = df_melted['measurement'].str.split('_', expand=True)
# Pivot to separate metrics
df_final = df_melted.pivot_table(
index=['date', 'location', 'period'],
columns='metric',
values='value',
aggfunc='first'
).reset_index()
print(df_final)
Output:
metric date location period humidity temp
0 2024-01-01 Station_A evening 45.0 22.3
1 2024-01-01 Station_A morning 65.0 15.2
2 2024-01-02 Station_B evening 48.0 21.7
3 2024-01-02 Station_B morning 70.0 16.1
4 2024-01-03 Station_C evening 42.0 23.1
5 2024-01-03 Station_C morning 68.0 14.8
Practical Applications
Time Series Analysis
Melt excels at preparing time series data for analysis and visualization:
# Monthly revenue across different channels
revenue_df = pd.DataFrame({
'month': ['Jan', 'Feb', 'Mar'],
'online': [45000, 48000, 52000],
'retail': [78000, 75000, 81000],
'wholesale': [120000, 125000, 118000]
})
# Prepare for time series plotting
revenue_long = pd.melt(
revenue_df,
id_vars=['month'],
var_name='channel',
value_name='revenue'
)
# Now easy to calculate channel-wise statistics
channel_stats = revenue_long.groupby('channel')['revenue'].agg(['mean', 'std', 'min', 'max'])
print(channel_stats)
Output:
mean std min max
channel
online 48333.33 3511.885 45000 52000
retail 78000.00 3000.000 75000 81000
wholesale 121000.00 3605.551 118000 125000
Data Validation and Quality Checks
Melt simplifies identifying missing or anomalous values across multiple columns:
# Sensor data with potential missing values
sensor_df = pd.DataFrame({
'timestamp': pd.date_range('2024-01-01', periods=4, freq='H'),
'sensor_1': [23.5, None, 24.1, 23.8],
'sensor_2': [45.2, 44.8, None, 45.5],
'sensor_3': [67.1, 66.9, 67.3, None]
})
# Melt to analyze missing data patterns
sensor_long = pd.melt(
sensor_df,
id_vars=['timestamp'],
var_name='sensor',
value_name='reading'
)
# Identify missing readings
missing_data = sensor_long[sensor_long['reading'].isna()]
print("Missing readings:")
print(missing_data)
# Calculate completeness by sensor
completeness = sensor_long.groupby('sensor')['reading'].apply(
lambda x: (x.notna().sum() / len(x)) * 100
)
print("\nData completeness by sensor:")
print(completeness)
Performance Considerations
For large datasets, melt operations can be memory-intensive. Consider these optimization strategies:
import numpy as np
# Large dataset simulation
large_df = pd.DataFrame({
'id': range(100000),
**{f'col_{i}': np.random.randn(100000) for i in range(50)}
})
# Memory-efficient approach: process in chunks
def chunked_melt(df, id_vars, chunk_size=10000):
chunks = []
for start in range(0, len(df), chunk_size):
chunk = df.iloc[start:start + chunk_size]
melted_chunk = pd.melt(chunk, id_vars=id_vars)
chunks.append(melted_chunk)
return pd.concat(chunks, ignore_index=True)
# For specific columns only
df_subset = large_df[['id', 'col_0', 'col_1', 'col_2']]
result = pd.melt(df_subset, id_vars=['id'])
print(f"Original shape: {large_df.shape}")
print(f"Melted shape: {result.shape}")
Combining Melt with GroupBy Operations
The true power of melt emerges when combined with groupby operations for advanced analytics:
# Sales data across regions and products
sales_df = pd.DataFrame({
'region': ['North', 'South', 'North', 'South'],
'manager': ['Alice', 'Bob', 'Charlie', 'David'],
'product_x': [1000, 1200, 900, 1100],
'product_y': [800, 950, 850, 1000],
'product_z': [1500, 1400, 1600, 1450]
})
# Melt and analyze
sales_long = pd.melt(
sales_df,
id_vars=['region', 'manager'],
var_name='product',
value_name='sales'
)
# Regional performance by product
regional_analysis = sales_long.groupby(['region', 'product'])['sales'].agg([
('total_sales', 'sum'),
('avg_sales', 'mean'),
('num_managers', 'count')
]).round(2)
print(regional_analysis)
Output:
total_sales avg_sales num_managers
region product
North product_x 1900.0 950.0 2
product_y 1650.0 825.0 2
product_z 3100.0 1550.0 2
South product_x 2300.0 1150.0 2
product_y 1950.0 975.0 2
product_z 2850.0 1425.0 2
Melt transforms rigid wide-format data into flexible long-format structures, enabling sophisticated analysis patterns that would be cumbersome or impossible with wide data. Master this operation to unlock efficient data manipulation workflows.