How to Explode a Column in Polars

Data rarely arrives in the clean, normalized format you need. JSON APIs return nested arrays. Aggregation operations produce list columns. CSV files contain comma-separated values stuffed into single...

Key Insights

  • The explode() method transforms list columns into individual rows, duplicating values in other columns for each list element—essential for normalizing nested data structures
  • You can explode multiple columns simultaneously, but they must have equal list lengths in each row or Polars will raise an error
  • Combining explode() with lazy evaluation prevents unnecessary memory allocation when working with large datasets that expand significantly after explosion

Introduction

Data rarely arrives in the clean, normalized format you need. JSON APIs return nested arrays. Aggregation operations produce list columns. CSV files contain comma-separated values stuffed into single cells. When you need to analyze this data at the element level, you need to “explode” these list columns into individual rows.

Exploding—sometimes called “unnesting” or “flattening”—takes each element in a list column and creates a separate row for it, duplicating the values in all other columns. A single row with a list of five items becomes five rows, each containing one item from that list.

Polars handles this operation efficiently through its explode() method. Let’s examine how to use it effectively.

Understanding List Columns in Polars

Polars supports nested data types natively. The List type can contain elements of any other Polars dtype—integers, strings, floats, even nested lists. This differs from pandas, where list columns are typically stored as Python objects with significant performance penalties.

List columns commonly appear when:

  • Parsing JSON data with array fields
  • Running group_by().agg() operations that collect values into lists
  • Reading data from sources like Parquet that support nested types
  • Splitting string columns on delimiters

Here’s how list columns look in practice:

import polars as pl

# Creating a DataFrame with list columns
df = pl.DataFrame({
    "order_id": [1, 2, 3],
    "customer": ["Alice", "Bob", "Charlie"],
    "items": [
        ["laptop", "mouse"],
        ["keyboard", "monitor", "webcam"],
        ["headphones"]
    ],
    "quantities": [
        [1, 2],
        [1, 1, 1],
        [3]
    ]
})

print(df)

Output:

shape: (3, 4)
┌──────────┬─────────┬─────────────────────────────┬───────────┐
│ order_id ┆ customer┆ items                       ┆ quantities│
│ ---      ┆ ---     ┆ ---                         ┆ ---       │
│ i64      ┆ str     ┆ list[str]                   ┆ list[i64] │
╞══════════╪═════════╪═════════════════════════════╪═══════════╡
│ 1        ┆ Alice   ┆ ["laptop", "mouse"]         ┆ [1, 2]    │
│ 2        ┆ Bob     ┆ ["keyboard", "monitor", …]  ┆ [1, 1, 1] │
│ 3        ┆ Charlie ┆ ["headphones"]              ┆ [3]       │
└──────────┴─────────┴─────────────────────────────┴───────────┘

Notice the list[str] and list[i64] dtypes. Polars knows exactly what’s inside these lists and can optimize operations accordingly.

Basic Column Explosion with explode()

The explode() method takes one or more column names and expands each list element into its own row. All other columns get their values duplicated.

# Explode the items column
exploded_df = df.explode("items")
print(exploded_df)

Output:

shape: (6, 4)
┌──────────┬─────────┬────────────┬───────────┐
│ order_id ┆ customer┆ items      ┆ quantities│
│ ---      ┆ ---     ┆ ---        ┆ ---       │
│ i64      ┆ str     ┆ str        ┆ list[i64] │
╞══════════╪═════════╪════════════╪═══════════╡
│ 1        ┆ Alice   ┆ laptop     ┆ [1, 2]    │
│ 1        ┆ Alice   ┆ mouse      ┆ [1, 2]    │
│ 2        ┆ Bob     ┆ keyboard   ┆ [1, 1, 1] │
│ 2        ┆ Bob     ┆ monitor    ┆ [1, 1, 1] │
│ 2        ┆ Bob     ┆ webcam     ┆ [1, 1, 1] │
│ 3        ┆ Charlie ┆ headphones ┆ [3]       │
└──────────┴─────────┴────────────┴───────────┘

The items column changed from list[str] to str. Each item now has its own row, with order_id and customer duplicated appropriately. The quantities column remains a list—we only exploded items.

You can also use the expression API for more flexibility:

# Using select with explode
result = df.select(
    pl.col("order_id"),
    pl.col("customer"),
    pl.col("items").explode()
)

This approach lets you rename columns, apply transformations, or select only specific columns in one operation.

Exploding Multiple Columns

When list columns are related—like items and their quantities—you often need to explode them together. Polars supports this, but with a critical requirement: the lists must have the same length in each row.

# Explode both items and quantities together
exploded_full = df.explode("items", "quantities")
print(exploded_full)

Output:

shape: (6, 4)
┌──────────┬─────────┬────────────┬────────────┐
│ order_id ┆ customer┆ items      ┆ quantities │
│ ---      ┆ ---     ┆ ---        ┆ ---        │
│ i64      ┆ str     ┆ str        ┆ i64        │
╞══════════╪═════════╪════════════╪════════════╡
│ 1        ┆ Alice   ┆ laptop     ┆ 1          │
│ 1        ┆ Alice   ┆ mouse      ┆ 2          │
│ 2        ┆ Bob     ┆ keyboard   ┆ 1          │
│ 2        ┆ Bob     ┆ monitor    ┆ 1          │
│ 2        ┆ Bob     ┆ webcam     ┆ 1          │
│ 3        ┆ Charlie ┆ headphones ┆ 3          │
└──────────┴─────────┴────────────┴────────────┘

Now each item is paired with its corresponding quantity. Both columns are now scalar types.

If list lengths don’t match, Polars raises a ShapeError:

# This will fail - mismatched list lengths
bad_df = pl.DataFrame({
    "id": [1],
    "col_a": [["x", "y", "z"]],
    "col_b": [[1, 2]]  # Only 2 elements vs 3
})

try:
    bad_df.explode("col_a", "col_b")
except pl.exceptions.ShapeError as e:
    print(f"Error: {e}")

Before exploding multiple columns, verify they have matching lengths:

# Check list lengths match before exploding
df_with_check = df.with_columns(
    (pl.col("items").list.len() == pl.col("quantities").list.len()).alias("lengths_match")
)

Combining Explode with Other Operations

Real-world data pipelines rarely use explode() in isolation. Here’s a practical example that filters, explodes, and aggregates:

# Sample e-commerce data
orders = pl.DataFrame({
    "order_id": [101, 102, 103, 104],
    "region": ["US", "EU", "US", "EU"],
    "products": [
        ["widget_a", "widget_b"],
        ["widget_a", "widget_c", "widget_d"],
        ["widget_b"],
        ["widget_a", "widget_b", "widget_c"]
    ],
    "prices": [
        [29.99, 49.99],
        [29.99, 19.99, 39.99],
        [49.99],
        [29.99, 49.99, 19.99]
    ]
})

# Pipeline: Filter to US, explode, calculate totals by product
us_product_revenue = (
    orders
    .filter(pl.col("region") == "US")
    .explode("products", "prices")
    .group_by("products")
    .agg(
        pl.col("prices").sum().alias("total_revenue"),
        pl.col("prices").count().alias("units_sold")
    )
    .sort("total_revenue", descending=True)
)

print(us_product_revenue)

Output:

shape: (2, 3)
┌──────────┬───────────────┬────────────┐
│ products ┆ total_revenue ┆ units_sold │
│ ---      ┆ ---           ┆ ---        │
│ str      ┆ f64           ┆ u32        │
╞══════════╪═══════════════╪════════════╡
│ widget_b ┆ 99.98         ┆ 2          │
│ widget_a ┆ 29.99         ┆ 1          │
└──────────┴───────────────┴────────────┘

You can also use with_columns() to add computed columns after explosion:

# Add row numbers within each original order
exploded_with_index = (
    orders
    .explode("products", "prices")
    .with_columns(
        pl.col("products").cum_count().over("order_id").alias("item_number")
    )
)

Performance Considerations

Explosion can dramatically increase your row count. A DataFrame with 1 million rows where each row contains a list of 100 elements becomes 100 million rows after explosion. This has real memory and performance implications.

Polars’ lazy evaluation helps manage this:

import time

# Create a larger dataset
large_df = pl.DataFrame({
    "id": range(100_000),
    "values": [[i, i+1, i+2, i+3, i+4] for i in range(100_000)]
})

# Eager execution - materializes the full exploded DataFrame
start = time.perf_counter()
eager_result = (
    large_df
    .explode("values")
    .filter(pl.col("values") > 250_000)
    .group_by(pl.col("values") % 100)
    .agg(pl.count())
)
eager_time = time.perf_counter() - start

# Lazy execution - optimizes the query plan
start = time.perf_counter()
lazy_result = (
    large_df
    .lazy()
    .explode("values")
    .filter(pl.col("values") > 250_000)
    .group_by(pl.col("values") % 100)
    .agg(pl.count())
    .collect()
)
lazy_time = time.perf_counter() - start

print(f"Eager: {eager_time:.3f}s")
print(f"Lazy:  {lazy_time:.3f}s")

The lazy version often performs better because Polars can optimize the entire query plan. It might push filters closer to the data source or combine operations.

Additional performance tips:

  1. Filter before exploding when possible. Removing rows before explosion means fewer rows to create.

  2. Select only needed columns. Don’t carry unnecessary columns through the explosion.

  3. Use streaming for very large datasets. Polars’ streaming engine processes data in chunks:

# Streaming execution for memory-constrained environments
result = (
    pl.scan_parquet("large_file.parquet")
    .explode("list_column")
    .filter(pl.col("value") > threshold)
    .collect(streaming=True)
)
  1. Consider whether you actually need to explode. Sometimes list operations (list.sum(), list.mean(), list.contains()) can answer your question without creating additional rows.

Conclusion

The explode() method is your primary tool for normalizing list columns in Polars. Use it on single columns for simple unnesting, or multiple columns simultaneously when they contain related data of equal length. Combine it with filtering, grouping, and other operations to build complete data transformation pipelines.

For datasets where explosion would create billions of rows, leverage lazy evaluation and streaming to keep memory usage manageable. And always ask whether you truly need row-level data or if list operations would suffice.

The Polars documentation on nested data types covers additional operations like list.eval() for applying expressions within lists without exploding them—useful when you need to transform nested data but keep the list structure intact.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.