How to Explode Arrays in PySpark

Array columns are everywhere in PySpark. Whether you're parsing JSON from an API, processing log files with repeated fields, or working with denormalized data from a NoSQL database, you'll eventually...

Key Insights

  • Use explode() to flatten array columns into individual rows, but switch to explode_outer() when you need to preserve rows with null or empty arrays
  • posexplode() gives you both the element and its position index, which is essential for maintaining order-dependent logic after flattening
  • Exploding large arrays can cause significant data skew and memory pressure—filter or limit array sizes before exploding when working with production-scale data

Introduction

Array columns are everywhere in PySpark. Whether you’re parsing JSON from an API, processing log files with repeated fields, or working with denormalized data from a NoSQL database, you’ll eventually encounter nested arrays that need flattening.

The problem is straightforward: you have a DataFrame where one column contains arrays, and you need each array element to become its own row. PySpark’s explode() family of functions handles this, but choosing the wrong variant can silently drop data or leave you without positional information you need downstream.

This article covers the four main explode functions, when to use each, and the performance gotchas that will bite you in production.

Understanding the explode() Function

The explode() function takes an array column and creates one output row for each element in the array. All other columns in the row are duplicated for each new row.

Here’s the basic pattern:

from pyspark.sql import SparkSession
from pyspark.sql.functions import explode, col

spark = SparkSession.builder.appName("ExplodeDemo").getOrCreate()

# Sample data with array column
data = [
    (1, "alice", ["python", "java", "scala"]),
    (2, "bob", ["rust", "go"]),
    (3, "carol", ["javascript"])
]

df = spark.createDataFrame(data, ["id", "name", "languages"])

df.show(truncate=False)
# +---+-----+---------------------+
# |id |name |languages            |
# +---+-----+---------------------+
# |1  |alice|[python, java, scala]|
# |2  |bob  |[rust, go]           |
# |3  |carol|[javascript]         |
# +---+-----+---------------------+

# Explode the array column
exploded_df = df.select("id", "name", explode("languages").alias("language"))

exploded_df.show()
# +---+-----+----------+
# | id| name|  language|
# +---+-----+----------+
# |  1|alice|    python|
# |  1|alice|      java|
# |  1|alice|     scala|
# |  2|  bob|      rust|
# |  2|  bob|        go|
# |  3|carol|javascript|
# +---+-----+----------+

Notice that Alice’s single row became three rows—one for each language in her array. The id and name columns are duplicated across all resulting rows.

You can also use explode() within a withColumn() call if you want to keep the original array alongside the exploded values, though this is less common:

df.withColumn("language", explode("languages")).show()

Handling Null and Empty Arrays with explode_outer()

Here’s where many developers get burned: explode() silently drops rows where the array is null or empty. If you’re not expecting this, you’ll lose data without any warning.

from pyspark.sql.functions import explode, explode_outer

data_with_nulls = [
    (1, "alice", ["python", "java"]),
    (2, "bob", []),           # Empty array
    (3, "carol", None),       # Null array
    (4, "dave", ["rust"])
]

df_nulls = spark.createDataFrame(data_with_nulls, ["id", "name", "languages"])

# Regular explode - drops rows with null/empty arrays
print("Using explode():")
df_nulls.select("id", "name", explode("languages").alias("language")).show()
# +---+-----+--------+
# | id| name|language|
# +---+-----+--------+
# |  1|alice|  python|
# |  1|alice|    java|
# |  4| dave|    rust|
# +---+-----+--------+
# Bob and Carol are gone!

# explode_outer preserves rows with null values
print("Using explode_outer():")
df_nulls.select("id", "name", explode_outer("languages").alias("language")).show()
# +---+-----+--------+
# | id| name|language|
# +---+-----+--------+
# |  1|alice|  python|
# |  1|alice|    java|
# |  2|  bob|    null|
# |  3|carol|    null|
# |  4| dave|    rust|
# +---+-----+--------+
# Bob and Carol are preserved with null language values

Use explode_outer() when:

  • You need to preserve all original entities for reporting or auditing
  • Downstream joins depend on row existence
  • You’re debugging and need to see which records have missing data

Use explode() when:

  • You genuinely want to filter out records with no array elements
  • You’re aggregating and null/empty arrays are meaningless for your analysis

Exploding with Position Index Using posexplode()

Sometimes the position of an element within the array matters. Maybe you’re processing ordered events, ranked preferences, or sequential steps. posexplode() returns both the element and its zero-based index.

from pyspark.sql.functions import posexplode, posexplode_outer

data = [
    (1, "alice", ["first_choice", "second_choice", "third_choice"]),
    (2, "bob", ["only_option"])
]

df = spark.createDataFrame(data, ["id", "name", "preferences"])

# posexplode returns position and element
df.select("id", "name", posexplode("preferences").alias("rank", "preference")).show()
# +---+-----+----+-------------+
# | id| name|rank|   preference|
# +---+-----+----+-------------+
# |  1|alice|   0| first_choice|
# |  1|alice|   1|second_choice|
# |  1|alice|   2| third_choice|
# |  2|  bob|   0|  only_option|
# +---+-----+----+-------------+

The position column lets you reconstruct the original order or filter by position:

# Get only the top preference for each user
from pyspark.sql.functions import col

df.select("id", "name", posexplode("preferences").alias("rank", "preference")) \
  .filter(col("rank") == 0) \
  .show()

There’s also posexplode_outer() which combines position tracking with null preservation—use it when you need both behaviors.

Exploding Map Types

The explode() function also works on map (dictionary) columns, producing key and value columns in the output:

from pyspark.sql.types import MapType, StringType, IntegerType

data = [
    (1, "alice", {"python": 5, "java": 3, "scala": 2}),
    (2, "bob", {"rust": 4, "go": 4})
]

df = spark.createDataFrame(data, ["id", "name", "skill_levels"])

# Explode map into key-value pairs
df.select("id", "name", explode("skill_levels").alias("skill", "level")).show()
# +---+-----+------+-----+
# | id| name| skill|level|
# +---+-----+------+-----+
# |  1|alice|python|    5|
# |  1|alice|  java|    3|
# |  1|alice| scala|    2|
# |  2|  bob|  rust|    4|
# |  2|  bob|    go|    4|
# +---+-----+------+-----+

This is particularly useful when parsing JSON objects where keys aren’t known ahead of time, or when you need to pivot dynamic attributes into a normalized format.

Practical Patterns and Performance Tips

Exploding Multiple Columns

If you need to explode multiple array columns, you have two options. Chaining explodes creates a Cartesian product—every combination of elements:

data = [
    (1, ["a", "b"], [1, 2])
]
df = spark.createDataFrame(data, ["id", "letters", "numbers"])

# Chained explode = Cartesian product (4 rows from 2x2)
df.select("id", explode("letters").alias("letter")) \
  .select("id", "letter", explode(df["numbers"]).alias("number")) \
  .show()

If you need parallel explosion (element 0 from both arrays together), use arrays_zip() first:

from pyspark.sql.functions import arrays_zip

df.select("id", explode(arrays_zip("letters", "numbers")).alias("zipped")) \
  .select("id", col("zipped.letters").alias("letter"), col("zipped.numbers").alias("number")) \
  .show()

Performance Considerations

Exploding arrays can dramatically increase your row count and cause problems:

  1. Data skew: If some arrays have thousands of elements while others have one, partitions become unbalanced. Consider filtering extreme outliers before exploding.

  2. Memory pressure: Each exploded row duplicates all non-array columns. Wide DataFrames with large arrays can exhaust executor memory.

  3. Shuffle costs: Operations after explode (joins, aggregations) operate on the larger dataset.

# Real-world example: parsing nested API response
from pyspark.sql.functions import from_json, explode, col
from pyspark.sql.types import StructType, ArrayType, StringType, StructField

# Assume raw_df has a JSON string column "response"
schema = StructType([
    StructField("status", StringType()),
    StructField("results", ArrayType(StructType([
        StructField("id", StringType()),
        StructField("tags", ArrayType(StringType()))
    ])))
])

# Parse JSON, explode results, then explode tags
parsed_df = raw_df.withColumn("parsed", from_json("response", schema))

final_df = parsed_df \
    .select("request_id", explode_outer("parsed.results").alias("result")) \
    .select("request_id", "result.id", explode_outer("result.tags").alias("tag"))

Conclusion

The explode family covers most array-flattening scenarios you’ll encounter:

  • explode(): Basic flattening, drops nulls and empties
  • explode_outer(): Preserves rows with null/empty arrays
  • posexplode(): Includes element position index
  • posexplode_outer(): Position index with null preservation

Default to explode_outer() unless you explicitly want to drop empty records—it’s easier to filter nulls later than to debug why rows disappeared. Use posexplode() whenever element order has meaning in your domain.

For the official documentation and additional examples, see the Spark SQL Functions reference.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.