Pivot/Unpivot in PySpark vs Pandas vs SQL

Data rarely arrives in the shape you need. Pivot and unpivot operations are fundamental transformations that reshape your data between wide and long formats. A pivot takes distinct values from one...

Key Insights

  • Pivot transforms rows into columns (long to wide format), while unpivot does the reverse—mastering both is essential for reshaping data in ETL pipelines and reporting workflows.
  • Syntax varies dramatically: Pandas uses pivot_table() and melt(), PySpark relies on groupBy().pivot() and the stack() function, while SQL uses PIVOT/UNPIVOT clauses or CASE WHEN workarounds.
  • Choose your tool based on data scale: Pandas for datasets under 1GB, PySpark for distributed processing of larger datasets, and SQL when the transformation should happen at the database layer.

Introduction & Use Cases

Data rarely arrives in the shape you need. Pivot and unpivot operations are fundamental transformations that reshape your data between wide and long formats. A pivot takes distinct values from one column and spreads them across multiple columns—useful when you need quarterly revenue as separate columns for a dashboard. An unpivot does the opposite, collapsing multiple columns into key-value pairs—essential when loading denormalized spreadsheets into normalized database tables.

You’ll encounter these transformations constantly: preparing data for visualization tools that expect wide format, normalizing exported Excel reports, reshaping time-series data for machine learning models, and building aggregation layers in data warehouses. The challenge is that each tool handles these operations differently, and picking the wrong approach can tank performance or produce incorrect results.

Sample Dataset Setup

Let’s establish a consistent dataset across all three environments. We’re working with quarterly sales data containing product, region, quarter, and revenue.

Pandas:

import pandas as pd

data = {
    'product': ['Widget', 'Widget', 'Widget', 'Widget', 'Gadget', 'Gadget', 'Gadget', 'Gadget'],
    'region': ['North', 'North', 'South', 'South', 'North', 'North', 'South', 'South'],
    'quarter': ['Q1', 'Q2', 'Q1', 'Q2', 'Q1', 'Q2', 'Q1', 'Q2'],
    'revenue': [1000, 1200, 800, 950, 1500, 1800, 1100, 1300]
}
df = pd.DataFrame(data)

PySpark:

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType

spark = SparkSession.builder.appName("PivotDemo").getOrCreate()

schema = StructType([
    StructField("product", StringType(), True),
    StructField("region", StringType(), True),
    StructField("quarter", StringType(), True),
    StructField("revenue", IntegerType(), True)
])

data = [
    ("Widget", "North", "Q1", 1000), ("Widget", "North", "Q2", 1200),
    ("Widget", "South", "Q1", 800), ("Widget", "South", "Q2", 950),
    ("Gadget", "North", "Q1", 1500), ("Gadget", "North", "Q2", 1800),
    ("Gadget", "South", "Q1", 1100), ("Gadget", "South", "Q2", 1300)
]
df = spark.createDataFrame(data, schema)

SQL (table creation):

CREATE TABLE sales (
    product VARCHAR(50),
    region VARCHAR(50),
    quarter VARCHAR(10),
    revenue INT
);

INSERT INTO sales VALUES
('Widget', 'North', 'Q1', 1000), ('Widget', 'North', 'Q2', 1200),
('Widget', 'South', 'Q1', 800), ('Widget', 'South', 'Q2', 950),
('Gadget', 'North', 'Q1', 1500), ('Gadget', 'North', 'Q2', 1800),
('Gadget', 'South', 'Q1', 1100), ('Gadget', 'South', 'Q2', 1300);

Pivot Operations Compared

Pivoting converts your long-format data into wide format, spreading values from one column across multiple new columns. Here’s how each tool handles pivoting revenue by quarter for each product-region combination.

Pandas pivot_table():

pivoted = df.pivot_table(
    index=['product', 'region'],
    columns='quarter',
    values='revenue',
    aggfunc='sum'
).reset_index()

# Result columns: product, region, Q1, Q2
print(pivoted)

Pandas requires an aggregation function even when there’s only one value per cell. Use aggfunc='first' if you’re certain there are no duplicates and want to avoid summing.

PySpark groupBy().pivot():

from pyspark.sql.functions import sum as spark_sum

pivoted = df.groupBy("product", "region") \
    .pivot("quarter") \
    .agg(spark_sum("revenue"))

pivoted.show()

PySpark’s pivot is lazy—it won’t execute until you call an action. For better performance with known pivot values, pass them explicitly:

pivoted = df.groupBy("product", "region") \
    .pivot("quarter", ["Q1", "Q2", "Q3", "Q4"]) \
    .agg(spark_sum("revenue"))

This avoids an extra job to discover distinct values.

SQL PIVOT clause (SQL Server, Oracle):

SELECT product, region, [Q1], [Q2]
FROM sales
PIVOT (
    SUM(revenue)
    FOR quarter IN ([Q1], [Q2])
) AS pivot_table;

SQL CASE WHEN alternative (universal):

SELECT 
    product,
    region,
    SUM(CASE WHEN quarter = 'Q1' THEN revenue END) AS Q1,
    SUM(CASE WHEN quarter = 'Q2' THEN revenue END) AS Q2
FROM sales
GROUP BY product, region;

The CASE WHEN approach works on every SQL database and is often more readable for complex pivots. Use it as your default unless you’re certain your database supports PIVOT.

Unpivot Operations Compared

Unpivoting reverses the process, converting columns back into rows. This is trickier because you’re essentially deconstructing the table structure.

Pandas melt():

# Assuming we have the pivoted dataframe with Q1, Q2 columns
melted = pivoted.melt(
    id_vars=['product', 'region'],
    value_vars=['Q1', 'Q2'],
    var_name='quarter',
    value_name='revenue'
)

The id_vars parameter specifies columns to keep as identifiers, while value_vars lists the columns to unpivot. If you omit value_vars, Pandas unpivots all columns not in id_vars.

PySpark stack() function:

from pyspark.sql.functions import expr

unpivoted = pivoted.select(
    "product",
    "region",
    expr("stack(2, 'Q1', Q1, 'Q2', Q2) as (quarter, revenue)")
).where("revenue IS NOT NULL")

unpivoted.show()

The stack() function takes the number of rows to create per input row, followed by pairs of (column_name_literal, column_value). It’s verbose but powerful. The where clause filters out nulls that appear when a column had no value.

SQL UNPIVOT clause:

SELECT product, region, quarter, revenue
FROM pivoted_sales
UNPIVOT (
    revenue FOR quarter IN ([Q1], [Q2])
) AS unpivot_table;

SQL UNION ALL alternative:

SELECT product, region, 'Q1' AS quarter, Q1 AS revenue FROM pivoted_sales WHERE Q1 IS NOT NULL
UNION ALL
SELECT product, region, 'Q2' AS quarter, Q2 AS revenue FROM pivoted_sales WHERE Q2 IS NOT NULL;

The UNION ALL approach is universally supported and explicit about what’s happening. For more than a few columns, it becomes unwieldy—consider generating the SQL dynamically.

Handling Edge Cases

Real data brings complications. Here’s how to handle common edge cases.

Null values during pivot:

# Pandas: fill nulls with 0
pivoted = df.pivot_table(
    index=['product', 'region'],
    columns='quarter',
    values='revenue',
    aggfunc='sum',
    fill_value=0
)

# PySpark: fill after pivot
pivoted = df.groupBy("product", "region") \
    .pivot("quarter") \
    .agg(spark_sum("revenue")) \
    .fillna(0)

Multiple value columns:

# Pandas: pivot multiple values
data['units'] = [100, 120, 80, 95, 150, 180, 110, 130]
multi_pivot = df.pivot_table(
    index=['product', 'region'],
    columns='quarter',
    values=['revenue', 'units'],
    aggfunc='sum'
)
# Creates hierarchical columns: (revenue, Q1), (revenue, Q2), (units, Q1), etc.

# Flatten column names
multi_pivot.columns = ['_'.join(col).strip() for col in multi_pivot.columns.values]

Dynamic column names in PySpark:

# Get distinct values first
quarters = df.select("quarter").distinct().rdd.flatMap(lambda x: x).collect()

pivoted = df.groupBy("product", "region") \
    .pivot("quarter", quarters) \
    .agg(spark_sum("revenue"))

Performance Considerations

Pandas loads everything into memory. Pivot operations create a new dataframe, potentially doubling memory usage temporarily. For datasets over 1GB, you’ll hit memory limits. Consider chunking or switching to PySpark.

PySpark distributes the work but pivot operations cause shuffles—data moves between executors based on grouping keys. This is expensive. Mitigate by:

  • Specifying pivot values explicitly to avoid a discovery job
  • Filtering data before pivoting to reduce shuffle size
  • Increasing spark.sql.shuffle.partitions for large datasets

SQL performance depends entirely on your database engine. Most optimizers handle PIVOT/CASE WHEN efficiently, but wide pivots (50+ columns) can stress query planners. Check your execution plan and consider materializing intermediate results for complex transformations.

Scale guidelines:

  • Under 100MB: Use Pandas for simplicity
  • 100MB to 10GB: PySpark or database SQL
  • Over 10GB: PySpark with explicit pivot values and appropriate partitioning

Quick Reference Cheat Sheet

Operation Pandas PySpark SQL
Pivot pivot_table(index, columns, values, aggfunc) groupBy().pivot().agg() PIVOT (agg FOR col IN (...)) or CASE WHEN
Unpivot melt(id_vars, value_vars, var_name, value_name) select(expr("stack(n, ...)")) UNPIVOT (val FOR col IN (...)) or UNION ALL
Fill nulls fill_value=0 in pivot_table .fillna(0) after pivot COALESCE(col, 0)
Multiple aggs aggfunc={'col1': 'sum', 'col2': 'mean'} Multiple .agg() calls Multiple PIVOT subqueries

Copy-paste templates:

# Pandas pivot template
df.pivot_table(index=['key_cols'], columns='spread_col', values='value_col', aggfunc='sum', fill_value=0)

# Pandas unpivot template  
df.melt(id_vars=['key_cols'], value_vars=['col1', 'col2'], var_name='category', value_name='value')

# PySpark pivot template
df.groupBy("key_cols").pivot("spread_col", ["val1", "val2"]).agg(spark_sum("value_col")).fillna(0)

# PySpark unpivot template
df.select("key_cols", expr("stack(2, 'col1', col1, 'col2', col2) as (category, value)"))

Master these patterns and you’ll handle any data reshaping task thrown at you. Start with the simplest syntax that works, then optimize when performance becomes an issue.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.