Pandas - Melt (Reshape Wide to Long)

• The melt operation transforms wide-format data into long-format by unpivoting columns into rows, making it easier to analyze categorical data and perform group-based operations

Key Insights

• The melt operation transforms wide-format data into long-format by unpivoting columns into rows, making it easier to analyze categorical data and perform group-based operations • Understanding id_vars (columns to preserve) and value_vars (columns to unpivot) is essential for controlling which columns get reshaped and which remain as identifiers • Melt is the inverse of pivot operations and forms the foundation for tidy data principles where each variable is a column and each observation is a row

Understanding the Melt Operation

Pandas melt reshapes data from wide format (multiple columns representing different categories) to long format (categorical values stacked in rows). This transformation is critical when working with data that violates tidy data principles or when preparing data for visualization libraries and statistical models that expect long-format input.

import pandas as pd

# Wide format data - each product has its own column
df_wide = pd.DataFrame({
    'date': ['2024-01-01', '2024-01-02', '2024-01-03'],
    'product_a': [100, 150, 120],
    'product_b': [200, 180, 210],
    'product_c': [150, 160, 155]
})

print("Wide format:")
print(df_wide)

# Convert to long format
df_long = pd.melt(df_wide, id_vars=['date'], var_name='product', value_name='sales')

print("\nLong format:")
print(df_long)

Output:

Wide format:
         date  product_a  product_b  product_c
0  2024-01-01        100        200        150
1  2024-01-02        150        180        160
2  2024-01-03        120        210        155

Long format:
         date    product  sales
0  2024-01-01  product_a    100
1  2024-01-02  product_a    150
2  2024-01-03  product_a    120
3  2024-01-01  product_b    200
4  2024-01-02  product_b    180
5  2024-01-03  product_b    210
6  2024-01-01  product_c    150
7  2024-01-02  product_c    160
8  2024-01-03  product_c    155

Core Parameters and Control

The melt function provides precise control over which columns to preserve and which to unpivot through its parameters.

# Complex dataset with multiple identifier columns
df = pd.DataFrame({
    'store_id': [1, 2, 3],
    'region': ['North', 'South', 'East'],
    'q1_sales': [50000, 62000, 48000],
    'q2_sales': [55000, 59000, 51000],
    'q1_profit': [12000, 15000, 11000],
    'q2_profit': [13500, 14200, 12300]
})

# Melt only sales columns, preserve store_id and region
df_sales = pd.melt(
    df,
    id_vars=['store_id', 'region'],
    value_vars=['q1_sales', 'q2_sales'],
    var_name='quarter',
    value_name='sales_amount'
)

print(df_sales)

Output:

   store_id region    quarter  sales_amount
0         1  North  q1_sales         50000
1         2  South  q1_sales         62000
2         3   East  q1_sales         48000
3         1  North  q2_sales         55000
4         2  South  q2_sales         59000
5         3   East  q2_sales         51000

When value_vars is omitted, melt automatically unpivots all columns except those specified in id_vars:

# Automatic value_vars selection
df_auto = pd.melt(
    df,
    id_vars=['store_id', 'region'],
    var_name='metric',
    value_name='value'
)

print(df_auto.head(8))

Handling Multi-Level Data

Real-world scenarios often require melting data with multiple measurement types simultaneously. Here’s how to handle complex reshaping tasks.

# Dataset with temperature and humidity readings
weather_df = pd.DataFrame({
    'date': pd.date_range('2024-01-01', periods=3),
    'location': ['Station_A', 'Station_B', 'Station_C'],
    'temp_morning': [15.2, 16.1, 14.8],
    'temp_evening': [22.3, 21.7, 23.1],
    'humidity_morning': [65, 70, 68],
    'humidity_evening': [45, 48, 42]
})

# First melt
df_melted = pd.melt(
    weather_df,
    id_vars=['date', 'location'],
    var_name='measurement',
    value_name='value'
)

# Extract metric type and time period
df_melted[['metric', 'period']] = df_melted['measurement'].str.split('_', expand=True)

# Pivot to separate metrics
df_final = df_melted.pivot_table(
    index=['date', 'location', 'period'],
    columns='metric',
    values='value',
    aggfunc='first'
).reset_index()

print(df_final)

Output:

metric        date   location    period  humidity  temp
0       2024-01-01  Station_A   evening      45.0  22.3
1       2024-01-01  Station_A   morning      65.0  15.2
2       2024-01-02  Station_B   evening      48.0  21.7
3       2024-01-02  Station_B   morning      70.0  16.1
4       2024-01-03  Station_C   evening      42.0  23.1
5       2024-01-03  Station_C   morning      68.0  14.8

Practical Applications

Time Series Analysis

Melt excels at preparing time series data for analysis and visualization:

# Monthly revenue across different channels
revenue_df = pd.DataFrame({
    'month': ['Jan', 'Feb', 'Mar'],
    'online': [45000, 48000, 52000],
    'retail': [78000, 75000, 81000],
    'wholesale': [120000, 125000, 118000]
})

# Prepare for time series plotting
revenue_long = pd.melt(
    revenue_df,
    id_vars=['month'],
    var_name='channel',
    value_name='revenue'
)

# Now easy to calculate channel-wise statistics
channel_stats = revenue_long.groupby('channel')['revenue'].agg(['mean', 'std', 'min', 'max'])
print(channel_stats)

Output:

                mean         std    min     max
channel                                        
online      48333.33  3511.885   45000   52000
retail      78000.00  3000.000   75000   81000
wholesale  121000.00  3605.551  118000  125000

Data Validation and Quality Checks

Melt simplifies identifying missing or anomalous values across multiple columns:

# Sensor data with potential missing values
sensor_df = pd.DataFrame({
    'timestamp': pd.date_range('2024-01-01', periods=4, freq='H'),
    'sensor_1': [23.5, None, 24.1, 23.8],
    'sensor_2': [45.2, 44.8, None, 45.5],
    'sensor_3': [67.1, 66.9, 67.3, None]
})

# Melt to analyze missing data patterns
sensor_long = pd.melt(
    sensor_df,
    id_vars=['timestamp'],
    var_name='sensor',
    value_name='reading'
)

# Identify missing readings
missing_data = sensor_long[sensor_long['reading'].isna()]
print("Missing readings:")
print(missing_data)

# Calculate completeness by sensor
completeness = sensor_long.groupby('sensor')['reading'].apply(
    lambda x: (x.notna().sum() / len(x)) * 100
)
print("\nData completeness by sensor:")
print(completeness)

Performance Considerations

For large datasets, melt operations can be memory-intensive. Consider these optimization strategies:

import numpy as np

# Large dataset simulation
large_df = pd.DataFrame({
    'id': range(100000),
    **{f'col_{i}': np.random.randn(100000) for i in range(50)}
})

# Memory-efficient approach: process in chunks
def chunked_melt(df, id_vars, chunk_size=10000):
    chunks = []
    for start in range(0, len(df), chunk_size):
        chunk = df.iloc[start:start + chunk_size]
        melted_chunk = pd.melt(chunk, id_vars=id_vars)
        chunks.append(melted_chunk)
    return pd.concat(chunks, ignore_index=True)

# For specific columns only
df_subset = large_df[['id', 'col_0', 'col_1', 'col_2']]
result = pd.melt(df_subset, id_vars=['id'])

print(f"Original shape: {large_df.shape}")
print(f"Melted shape: {result.shape}")

Combining Melt with GroupBy Operations

The true power of melt emerges when combined with groupby operations for advanced analytics:

# Sales data across regions and products
sales_df = pd.DataFrame({
    'region': ['North', 'South', 'North', 'South'],
    'manager': ['Alice', 'Bob', 'Charlie', 'David'],
    'product_x': [1000, 1200, 900, 1100],
    'product_y': [800, 950, 850, 1000],
    'product_z': [1500, 1400, 1600, 1450]
})

# Melt and analyze
sales_long = pd.melt(
    sales_df,
    id_vars=['region', 'manager'],
    var_name='product',
    value_name='sales'
)

# Regional performance by product
regional_analysis = sales_long.groupby(['region', 'product'])['sales'].agg([
    ('total_sales', 'sum'),
    ('avg_sales', 'mean'),
    ('num_managers', 'count')
]).round(2)

print(regional_analysis)

Output:

                  total_sales  avg_sales  num_managers
region product                                        
North  product_x       1900.0      950.0             2
       product_y       1650.0      825.0             2
       product_z       3100.0     1550.0             2
South  product_x       2300.0     1150.0             2
       product_y       1950.0      975.0             2
       product_z       2850.0     1425.0             2

Melt transforms rigid wide-format data into flexible long-format structures, enabling sophisticated analysis patterns that would be cumbersome or impossible with wide data. Master this operation to unlock efficient data manipulation workflows.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.