How to Unpivot a DataFrame in PySpark

Unpivoting transforms data from wide format to long format. You take multiple columns and collapse them into key-value pairs, creating more rows but fewer columns. This is the inverse of the pivot...

Key Insights

  • PySpark 3.4+ includes a native unpivot() function that simplifies wide-to-long transformations, but the stack() SQL function remains the reliable approach for older versions and offers more control.
  • Dynamic unpivoting using Python list comprehensions to build stack() expressions handles variable column counts gracefully and makes your code maintainable as schemas evolve.
  • Unpivoting explodes row counts proportionally to the number of pivoted columns—plan your partitioning strategy accordingly and filter nulls early to minimize downstream processing costs.

Introduction to Unpivoting

Unpivoting transforms data from wide format to long format. You take multiple columns and collapse them into key-value pairs, creating more rows but fewer columns. This is the inverse of the pivot operation.

Consider a sales report where each month has its own column:

| product | jan | feb | mar |
|---------|-----|-----|-----|
| Widget  | 100 | 150 | 200 |
| Gadget  | 80  | 90  | 110 |

After unpivoting, you get:

| product | month | sales |
|---------|-------|-------|
| Widget  | jan   | 100   |
| Widget  | feb   | 150   |
| Widget  | mar   | 200   |
| Gadget  | jan   | 80    |
| Gadget  | feb   | 90    |
| Gadget  | mar   | 110   |

This transformation is essential for data normalization, time series analysis, feeding BI tools that expect long format, and preparing data for machine learning pipelines. Most analytical databases and visualization tools work better with normalized data structures.

Understanding the Source Data

Before unpivoting, you need to identify two column groups: the identifier columns that remain fixed, and the value columns that will be transformed into rows.

Let’s create a sample DataFrame representing quarterly sales across regions:

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType

spark = SparkSession.builder.appName("UnpivotDemo").getOrCreate()

data = [
    ("North", "Electronics", 1200, 1500, 1800),
    ("South", "Electronics", 900, 1100, 1300),
    ("North", "Clothing", 500, 600, 750),
    ("South", "Clothing", 400, 450, 550),
]

schema = StructType([
    StructField("region", StringType(), True),
    StructField("category", StringType(), True),
    StructField("jan", IntegerType(), True),
    StructField("feb", IntegerType(), True),
    StructField("mar", IntegerType(), True),
])

df = spark.createDataFrame(data, schema)
df.show()

Output:

+------+-----------+----+----+----+
|region|   category| jan| feb| mar|
+------+-----------+----+----+----+
| North|Electronics|1200|1500|1800|
| South|Electronics| 900|1100|1300|
| North|   Clothing| 500| 600| 750|
| South|   Clothing| 400| 450| 550|
+------+-----------+----+----+----+

Here, region and category are identifier columns. The month columns (jan, feb, mar) are the values to unpivot. This distinction drives how you structure your unpivot operation.

Using the unpivot() Function (Spark 3.4+)

Spark 3.4 introduced a native unpivot() method on DataFrames. This is now the cleanest approach if you’re running a recent Spark version.

# Define identifier columns and value columns
id_cols = ["region", "category"]
value_cols = ["jan", "feb", "mar"]

# Unpivot the DataFrame
unpivoted_df = df.unpivot(
    ids=id_cols,
    values=value_cols,
    variableColumnName="month",
    valueColumnName="sales"
)

unpivoted_df.show()

Output:

+------+-----------+-----+-----+
|region|   category|month|sales|
+------+-----------+-----+-----+
| North|Electronics|  jan| 1200|
| North|Electronics|  feb| 1500|
| North|Electronics|  mar| 1800|
| South|Electronics|  jan|  900|
| South|Electronics|  feb| 1100|
| South|Electronics|  mar| 1300|
| North|   Clothing|  jan|  500|
| North|   Clothing|  feb|  600|
| North|   Clothing|  mar|  750|
| South|   Clothing|  jan|  400|
| South|   Clothing|  feb|  450|
| South|   Clothing|  mar|  550|
+------+-----------+-----+-----+

The unpivot() function takes four parameters:

  • ids: Columns to keep as identifiers (list of column names or Column objects)
  • values: Columns to unpivot (list of column names)
  • variableColumnName: Name for the new column containing original column names
  • valueColumnName: Name for the new column containing the values

If you omit values, Spark unpivots all columns not listed in ids. This is convenient but explicit is usually better for production code.

Using stack() for Older Spark Versions

For Spark versions before 3.4, the stack() SQL function is your tool. It’s a generator function that creates multiple output rows from a single input row.

unpivoted_df = df.selectExpr(
    "region",
    "category",
    "stack(3, 'jan', jan, 'feb', feb, 'mar', mar) as (month, sales)"
)

unpivoted_df.show()

The stack() function takes these arguments:

  1. The number of rows to generate per input row (3 in this case)
  2. Alternating pairs of literal values (column names as strings) and column references

The as (month, sales) clause names the two output columns. The first element of each pair becomes the month value, the second becomes the sales value.

You can also use spark.sql() if you prefer pure SQL:

df.createOrReplaceTempView("sales_wide")

unpivoted_df = spark.sql("""
    SELECT 
        region,
        category,
        stack(3, 'jan', jan, 'feb', feb, 'mar', mar) as (month, sales)
    FROM sales_wide
""")

Both approaches produce identical results. Choose based on your team’s preference for DataFrame API versus SQL.

Dynamic Unpivoting with Column Lists

Hardcoding column names works for static schemas, but real-world data often has variable columns. Building the stack() expression dynamically handles this elegantly.

from pyspark.sql import functions as F

# Identify columns programmatically
id_columns = ["region", "category"]
value_columns = [c for c in df.columns if c not in id_columns]

# Build the stack expression
num_cols = len(value_columns)
stack_pairs = ", ".join([f"'{col}', {col}" for col in value_columns])
stack_expr = f"stack({num_cols}, {stack_pairs}) as (month, sales)"

# Apply the transformation
unpivoted_df = df.selectExpr(*id_columns, stack_expr)
unpivoted_df.show()

This approach scales to any number of columns. If your schema adds apr, may, jun columns later, the code adapts automatically.

For more complex scenarios, wrap this in a reusable function:

def unpivot_dataframe(df, id_cols, value_cols, var_name, value_name):
    """
    Unpivot a DataFrame from wide to long format.
    
    Works with all Spark versions using stack().
    """
    num_cols = len(value_cols)
    stack_pairs = ", ".join([f"'{col}', `{col}`" for col in value_cols])
    stack_expr = f"stack({num_cols}, {stack_pairs}) as ({var_name}, {value_name})"
    
    return df.selectExpr(*[f"`{c}`" for c in id_cols], stack_expr)


# Usage
result = unpivot_dataframe(
    df,
    id_cols=["region", "category"],
    value_cols=["jan", "feb", "mar"],
    var_name="month",
    value_name="sales"
)

Note the backticks around column names. This handles columns with spaces or special characters that would otherwise break the SQL expression.

Handling Null Values and Data Types

Unpivoting often surfaces null values that were hidden in the wide format. Decide early whether to keep or filter them.

# Create sample data with nulls
data_with_nulls = [
    ("North", "Electronics", 1200, None, 1800),
    ("South", "Electronics", None, 1100, 1300),
]

df_nulls = spark.createDataFrame(data_with_nulls, schema)

# Unpivot and filter nulls
unpivoted_clean = (
    unpivot_dataframe(df_nulls, ["region", "category"], ["jan", "feb", "mar"], "month", "sales")
    .filter(F.col("sales").isNotNull())
)

unpivoted_clean.show()

For data type consistency, ensure all value columns share the same type before unpivoting. Mixed types cause runtime errors or unexpected casting:

# Cast columns to consistent type before unpivoting
value_cols = ["jan", "feb", "mar"]
df_typed = df.select(
    *[F.col(c) for c in id_columns],
    *[F.col(c).cast("double").alias(c) for c in value_cols]
)

If your value columns have different types by design (mixing integers and strings), you’ll need separate unpivot operations or a common type like string.

Performance Considerations

Unpivoting multiplies your row count. If you have 1 million rows and unpivot 12 monthly columns, you now have 12 million rows. Plan accordingly.

Partition strategy matters. If your downstream operations filter by the unpivoted column (e.g., filtering by month), consider repartitioning after the unpivot:

unpivoted_df = (
    unpivot_dataframe(df, id_cols, value_cols, "month", "sales")
    .repartition("month")
)

Filter early. If you only need certain months or non-null values, apply filters immediately after unpivoting to reduce data volume before shuffles:

unpivoted_df = (
    unpivot_dataframe(df, id_cols, value_cols, "month", "sales")
    .filter(F.col("sales").isNotNull())
    .filter(F.col("month").isin(["jan", "feb"]))
)

Consider when to unpivot. If you’re joining with other tables, it’s often cheaper to join in wide format first, then unpivot the result. Joining on fewer rows reduces shuffle costs.

Cache strategically. If you’ll reuse the unpivoted DataFrame multiple times, cache it after filtering nulls but before downstream transformations:

unpivoted_df = (
    unpivot_dataframe(df, id_cols, value_cols, "month", "sales")
    .filter(F.col("sales").isNotNull())
    .cache()
)

# Force materialization
unpivoted_df.count()

The stack() function is a SQL generator, meaning Spark handles it efficiently within the Catalyst optimizer. It doesn’t create intermediate DataFrames, making it performant for most use cases. The native unpivot() function in Spark 3.4+ offers similar optimization.

For extremely wide tables (hundreds of columns to unpivot), consider batching the operation or evaluating whether the wide format might actually serve your use case better. Sometimes the “correct” normalized form isn’t the most practical one.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.