PySpark - Explode Array Column to Rows

PySpark DataFrames frequently contain array columns when working with semi-structured data sources like JSON, Parquet files with nested schemas, or aggregated datasets. While arrays are efficient for...

Key Insights

  • Use explode() to transform array columns into separate rows, duplicating non-array column values for each element—essential for denormalizing nested data structures in PySpark
  • Choose explode_outer() over explode() when you need to preserve rows with null or empty arrays, preventing data loss during transformation
  • For multiple array columns with correlated elements, use arrays_zip() before exploding to maintain element-to-element correspondence rather than creating an unintended Cartesian product

Understanding Array Columns and When to Explode Them

PySpark DataFrames frequently contain array columns when working with semi-structured data sources like JSON, Parquet files with nested schemas, or aggregated datasets. While arrays are efficient for storage and certain operations, many analytical tasks require flattening these structures into individual rows.

Common scenarios where you’ll need to explode arrays include parsing event logs where each record contains multiple actions, denormalizing e-commerce data where orders contain product lists, processing IoT sensor data with batched readings, or preparing data for machine learning algorithms that expect flat feature vectors.

Let’s start with a practical example—a customer orders dataset:

from pyspark.sql import SparkSession
from pyspark.sql.functions import explode, explode_outer, arrays_zip, col
from pyspark.sql.types import StructType, StructField, StringType, ArrayType, IntegerType

spark = SparkSession.builder.appName("ExplodeExample").getOrCreate()

# Sample data with array column
data = [
    ("ORD001", "Alice", ["Laptop", "Mouse", "Keyboard"]),
    ("ORD002", "Bob", ["Monitor"]),
    ("ORD003", "Charlie", ["Headphones", "Webcam"]),
    ("ORD004", "Diana", [])
]

df = spark.createDataFrame(data, ["order_id", "customer", "products"])
df.show(truncate=False)

Output:

+--------+--------+---------------------------+
|order_id|customer|products                   |
+--------+--------+---------------------------+
|ORD001  |Alice   |[Laptop, Mouse, Keyboard]  |
|ORD002  |Bob     |[Monitor]                  |
|ORD003  |Charlie |[Headphones, Webcam]       |
|ORD004  |Diana   |[]                         |
+--------+--------+---------------------------+

The explode() Function Fundamentals

The explode() function is your primary tool for transforming array elements into separate rows. Each element in the array becomes its own row, while values in other columns are duplicated across all resulting rows.

# Basic explode operation
exploded_df = df.select(
    col("order_id"),
    col("customer"),
    explode(col("products")).alias("product")
)

exploded_df.show(truncate=False)

Output:

+--------+--------+---------+
|order_id|customer|product  |
+--------+--------+---------+
|ORD001  |Alice   |Laptop   |
|ORD001  |Alice   |Mouse    |
|ORD001  |Alice   |Keyboard |
|ORD002  |Bob     |Monitor  |
|ORD003  |Charlie |Headphones|
|ORD003  |Charlie |Webcam   |
+--------+--------+---------+

Notice that Diana’s order (ORD004) disappeared entirely. The explode() function drops rows where the array is null or empty. This behavior is often undesirable when you need to maintain referential integrity or track all records.

Preserving Rows with explode_outer()

When you need to retain rows with null or empty arrays, explode_outer() is the solution. It works identically to explode() but preserves rows by inserting null values for empty arrays.

# Using explode_outer to preserve empty arrays
exploded_outer_df = df.select(
    col("order_id"),
    col("customer"),
    explode_outer(col("products")).alias("product")
)

exploded_outer_df.show(truncate=False)

Output:

+--------+--------+---------+
|order_id|customer|product  |
+--------+--------+---------+
|ORD001  |Alice   |Laptop   |
|ORD001  |Alice   |Mouse    |
|ORD001  |Alice   |Keyboard |
|ORD002  |Bob     |Monitor  |
|ORD003  |Charlie |Headphones|
|ORD003  |Charlie |Webcam   |
|ORD004  |Diana   |null     |
+--------+--------+---------+

Now Diana’s order appears with a null product value. This is critical for maintaining accurate counts, ensuring joins don’t lose records, and preserving audit trails.

Handling Multiple Array Columns

When your DataFrame contains multiple array columns, you must decide whether elements are correlated or independent. Sequential exploding creates a Cartesian product—every element in the first array pairs with every element in the second array.

# Data with multiple array columns
multi_array_data = [
    ("ORD001", ["Laptop", "Mouse"], [1299, 25]),
    ("ORD002", ["Monitor"], [399])
]

multi_df = spark.createDataFrame(multi_array_data, ["order_id", "products", "prices"])

# Sequential explode - creates Cartesian product (usually NOT what you want)
cartesian = multi_df.select(
    col("order_id"),
    explode(col("products")).alias("product")
).select(
    col("order_id"),
    col("product"),
    explode(col("prices")).alias("price")
)

cartesian.show()

Output:

+--------+-------+-----+
|order_id|product|price|
+--------+-------+-----+
|ORD001  |Laptop |1299 |
|ORD001  |Laptop |25   |
|ORD001  |Mouse  |1299 |
|ORD001  |Mouse  |25   |
|ORD002  |Monitor|399  |
+--------+-------+-----+

This is wrong—the Laptop is now associated with both prices. For correlated arrays, use arrays_zip():

# Correct approach - zip arrays first, then explode
zipped_df = multi_df.select(
    col("order_id"),
    explode(arrays_zip(col("products"), col("prices"))).alias("item")
).select(
    col("order_id"),
    col("item.products").alias("product"),
    col("item.prices").alias("price")
)

zipped_df.show()

Output:

+--------+-------+-----+
|order_id|product|price|
+--------+-------+-----+
|ORD001  |Laptop |1299 |
|ORD001  |Mouse  |25   |
|ORD002  |Monitor|399  |
+--------+-------+-----+

Now each product correctly pairs with its corresponding price. The arrays_zip() function creates an array of structs, maintaining element correspondence.

Exploding Arrays of Structs

Real-world data often contains arrays of complex objects (structs). Exploding these requires accessing nested fields after the explosion.

from pyspark.sql.types import DoubleType

# Data with array of structs
struct_data = [
    ("ORD001", [
        {"product": "Laptop", "price": 1299.99, "quantity": 1},
        {"product": "Mouse", "price": 24.99, "quantity": 2}
    ]),
    ("ORD002", [
        {"product": "Monitor", "price": 399.99, "quantity": 1}
    ])
]

schema = StructType([
    StructField("order_id", StringType()),
    StructField("items", ArrayType(StructType([
        StructField("product", StringType()),
        StructField("price", DoubleType()),
        StructField("quantity", IntegerType())
    ])))
])

struct_df = spark.createDataFrame(struct_data, schema)

# Explode and access struct fields
result = struct_df.select(
    col("order_id"),
    explode(col("items")).alias("item")
).select(
    col("order_id"),
    col("item.product").alias("product"),
    col("item.price").alias("price"),
    col("item.quantity").alias("quantity")
)

result.show()

Output:

+--------+-------+------+--------+
|order_id|product| price|quantity|
+--------+-------+------+--------+
|ORD001  |Laptop |1299.99|       1|
|ORD001  |Mouse  |24.99  |       2|
|ORD002  |Monitor|399.99 |       1|
+--------+-------+------+--------+

Use dot notation to access struct fields after exploding. You can also use the * operator to expand all struct fields: col("item.*").

Performance Considerations

Exploding arrays significantly increases row count, which impacts memory, shuffle operations, and processing time. Consider these optimization strategies:

Partition Management: After exploding, your data distribution changes dramatically. A single row with 1000 array elements becomes 1000 rows, all with the same partition key. This creates data skew.

# Repartition after explode to redistribute data
optimized_df = exploded_df.repartition(200, "product")

Filter Before Exploding: If you only need specific array elements, filter them before exploding to reduce the explosion size.

from pyspark.sql.functions import array_contains, filter as array_filter

# Filter array elements before exploding
filtered_explode = df.select(
    col("order_id"),
    col("customer"),
    explode(
        array_filter(col("products"), lambda x: x.startswith("M"))
    ).alias("product")
)

Memory Implications: Exploding creates data duplication. A row with 10 string columns and a 100-element array will replicate those 10 strings 100 times. Monitor executor memory and adjust spark.executor.memory if you encounter OOM errors.

Alternatives for Very Large Arrays: If individual arrays contain thousands of elements, consider whether you truly need row-level granularity. Sometimes aggregating within the array using higher-order functions (transform, aggregate) is more efficient than exploding.

Real-World Example: E-Commerce Analytics

Let’s combine everything in a practical scenario—analyzing product performance from order data:

from pyspark.sql.functions import sum as _sum, count, round as _round

# Realistic e-commerce data
ecommerce_data = [
    ("ORD001", "2024-01-15", "Alice", [
        {"product": "Laptop", "category": "Electronics", "price": 1299.99, "quantity": 1},
        {"product": "Mouse", "category": "Electronics", "price": 24.99, "quantity": 2}
    ]),
    ("ORD002", "2024-01-16", "Bob", [
        {"product": "Desk", "category": "Furniture", "price": 399.99, "quantity": 1}
    ]),
    ("ORD003", "2024-01-16", "Alice", [
        {"product": "Laptop", "category": "Electronics", "price": 1299.99, "quantity": 1},
        {"product": "Keyboard", "category": "Electronics", "price": 79.99, "quantity": 1}
    ])
]

schema = StructType([
    StructField("order_id", StringType()),
    StructField("order_date", StringType()),
    StructField("customer", StringType()),
    StructField("items", ArrayType(StructType([
        StructField("product", StringType()),
        StructField("category", StringType()),
        StructField("price", DoubleType()),
        StructField("quantity", IntegerType())
    ])))
])

orders_df = spark.createDataFrame(ecommerce_data, schema)

# Explode and calculate per-product metrics
product_metrics = orders_df.select(
    col("order_id"),
    col("order_date"),
    col("customer"),
    explode(col("items")).alias("item")
).select(
    col("item.product").alias("product"),
    col("item.category").alias("category"),
    (col("item.price") * col("item.quantity")).alias("revenue"),
    col("item.quantity").alias("quantity")
).groupBy("product", "category").agg(
    _sum("revenue").alias("total_revenue"),
    _sum("quantity").alias("total_quantity"),
    count("*").alias("order_count")
).orderBy(col("total_revenue").desc())

product_metrics.show()

Output:

+--------+-----------+-------------+--------------+-----------+
| product|   category|total_revenue|total_quantity|order_count|
+--------+-----------+-------------+--------------+-----------+
|  Laptop|Electronics|      2599.98|             2|          2|
|    Desk|  Furniture|       399.99|             1|          1|
|Keyboard|Electronics|        79.99|             1|          1|
|   Mouse|Electronics|        49.98|             2|          1|
+--------+-----------+-------------+--------------+-----------+

This pattern—explode, flatten, aggregate—is fundamental to PySpark analytics. You’ve transformed nested order data into actionable product-level insights that can drive business decisions.

Mastering explode() and its variants is essential for working with real-world PySpark data. Whether you’re parsing API responses, denormalizing for BI tools, or preparing features for ML pipelines, these techniques will be in constant use. Remember to choose explode_outer() when data completeness matters, use arrays_zip() for correlated arrays, and always consider the performance implications of dramatically increasing your row count.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.