How to Explode a Column in Pandas
When working with real-world data, you'll frequently encounter columns containing list-like values. Maybe you're parsing JSON from an API, dealing with multi-select form fields, or processing...
Key Insights
- The
explode()method transforms list-like elements into separate rows, preserving the original index by default—useignore_index=Truewhen you need a clean sequential index. - Since Pandas 1.3, you can explode multiple columns simultaneously, but the lists must have matching lengths in each row or you’ll get a ValueError.
- Combine
str.split()withexplode()to handle comma-separated strings, one of the most common real-world data cleaning patterns.
What Does “Exploding” Mean in Pandas?
When working with real-world data, you’ll frequently encounter columns containing list-like values. Maybe you’re parsing JSON from an API, dealing with multi-select form fields, or processing denormalized database exports. These nested structures are convenient for storage but terrible for analysis.
The explode() method solves this by transforming each element in a list-like column into its own row. If you have a row with a list of three items, you get three rows—each with one item. The rest of the columns duplicate accordingly.
This operation is the inverse of aggregation. Where groupby().agg(list) combines rows into lists, explode() breaks lists back into rows.
Understanding the explode() Method
The syntax is straightforward:
DataFrame.explode(column, ignore_index=False)
The column parameter accepts a single column label or a list of labels (for multi-column explosion). The ignore_index parameter controls whether to preserve the original index or create a new sequential one.
Here’s the basic behavior:
import pandas as pd
df = pd.DataFrame({
'order_id': [1, 2, 3],
'items': [['apple', 'banana'], ['orange'], ['grape', 'melon', 'kiwi']]
})
print("Original DataFrame:")
print(df)
print()
exploded = df.explode('items')
print("After explode():")
print(exploded)
Output:
Original DataFrame:
order_id items
0 1 [apple, banana]
1 2 [orange]
2 3 [grape, melon, kiwi]
After explode():
order_id items
0 1 apple
0 1 banana
1 2 orange
2 3 grape
2 3 melon
2 3 kiwi
Notice how the index preserves the original row numbers. Order 1 appears twice with index 0, order 3 appears three times with index 2. This index preservation is intentional—it lets you trace exploded rows back to their source.
The method handles lists, tuples, and sets. However, sets lose their order since they’re unordered collections. Strings are treated as scalars, not iterables, so a string value stays as-is rather than exploding into individual characters.
df_mixed = pd.DataFrame({
'id': [1, 2, 3],
'values': [[1, 2], (3, 4), {5, 6}] # list, tuple, set
})
print(df_mixed.explode('values'))
Handling Missing and Empty Values
Real data is messy. You’ll encounter NaN values, None, and empty lists. Here’s how explode() handles each:
df_messy = pd.DataFrame({
'id': [1, 2, 3, 4, 5],
'tags': [
['a', 'b'], # normal list
[], # empty list
None, # None value
float('nan'), # NaN
['c'] # single-item list
]
})
print("Original:")
print(df_messy)
print()
print("After explode():")
print(df_messy.explode('tags'))
Output:
Original:
id tags
0 1 [a, b]
1 2 []
2 3 None
3 4 NaN
4 5 [c]
After explode():
id tags
0 1 a
0 1 b
1 2 NaN
2 3 NaN
3 4 NaN
4 5 c
Empty lists become NaN in the exploded result. None and NaN pass through unchanged. This behavior makes sense—an empty list has no elements to create rows from, so you get one row with a missing value.
When you need a clean index for subsequent operations, use ignore_index=True:
exploded_clean = df_messy.explode('tags', ignore_index=True)
print(exploded_clean)
Output:
id tags
0 1 a
1 1 b
2 2 NaN
3 3 NaN
4 4 NaN
5 5 c
If empty lists shouldn’t produce rows at all, filter them first:
df_filtered = df_messy[df_messy['tags'].apply(lambda x: isinstance(x, list) and len(x) > 0)]
print(df_filtered.explode('tags', ignore_index=True))
Exploding Multiple Columns
Pandas 1.3 introduced the ability to explode multiple columns simultaneously. This is essential when you have parallel lists that must stay aligned.
df_multi = pd.DataFrame({
'order_id': [1, 2],
'products': [['laptop', 'mouse'], ['keyboard', 'monitor', 'cable']],
'quantities': [[1, 2], [1, 1, 3]]
})
print("Original:")
print(df_multi)
print()
exploded_multi = df_multi.explode(['products', 'quantities'])
print("After multi-column explode:")
print(exploded_multi)
Output:
Original:
order_id products quantities
0 1 [laptop, mouse] [1, 2]
1 2 [keyboard, monitor, cable] [1, 1, 3]
After multi-column explode:
order_id products quantities
0 1 laptop 1
0 1 mouse 2
1 2 keyboard 1
1 2 monitor 1
1 2 cable 3
The lists must have matching lengths within each row. If order 1 had two products but three quantities, you’d get a ValueError. Validate your data before exploding:
def validate_list_lengths(row, cols):
lengths = [len(row[col]) if isinstance(row[col], (list, tuple)) else 1 for col in cols]
return len(set(lengths)) == 1
# Check for mismatched rows
mask = df_multi.apply(lambda row: validate_list_lengths(row, ['products', 'quantities']), axis=1)
if not mask.all():
print("Warning: Found rows with mismatched list lengths")
print(df_multi[~mask])
Practical Use Cases
Parsing JSON API Responses
APIs often return nested data. Here’s a realistic example with user permissions:
import json
# Simulated API response
api_data = [
{'user_id': 101, 'name': 'Alice', 'roles': ['admin', 'editor']},
{'user_id': 102, 'name': 'Bob', 'roles': ['viewer']},
{'user_id': 103, 'name': 'Charlie', 'roles': ['editor', 'viewer', 'commenter']}
]
df_users = pd.DataFrame(api_data)
df_permissions = df_users.explode('roles', ignore_index=True)
df_permissions.rename(columns={'roles': 'role'}, inplace=True)
print(df_permissions)
Output:
user_id name role
0 101 Alice admin
1 101 Alice editor
2 102 Bob viewer
3 103 Charlie editor
4 103 Charlie viewer
5 103 Charlie commenter
Splitting Comma-Separated Values
This is the most common real-world pattern. Combine str.split() with explode():
df_products = pd.DataFrame({
'product_id': [1, 2, 3],
'name': ['Widget', 'Gadget', 'Gizmo'],
'tags': ['electronics, sale, featured', 'electronics', 'home, kitchen, sale']
})
# Split the string into a list, then explode
df_products['tags'] = df_products['tags'].str.split(', ')
df_tags = df_products.explode('tags', ignore_index=True)
print(df_tags)
Output:
product_id name tags
0 1 Widget electronics
1 1 Widget sale
2 1 Widget featured
3 2 Gadget electronics
4 3 Gizmo home
5 3 Gizmo kitchen
6 3 Gizmo sale
Now you can easily filter products by tag or count tag frequencies:
# Products with 'sale' tag
sale_products = df_tags[df_tags['tags'] == 'sale']['product_id'].unique()
# Tag frequency
tag_counts = df_tags['tags'].value_counts()
Flattening Survey Data
Multi-select survey questions often store responses as lists:
df_survey = pd.DataFrame({
'respondent_id': [1, 2, 3],
'age_group': ['18-24', '25-34', '35-44'],
'preferred_channels': [
['email', 'sms'],
['email', 'app', 'social'],
['phone']
]
})
df_channels = df_survey.explode('preferred_channels', ignore_index=True)
channel_by_age = pd.crosstab(df_channels['age_group'], df_channels['preferred_channels'])
print(channel_by_age)
Performance Considerations
Exploding creates rows. A DataFrame with 1 million rows where each list averages 10 elements becomes 10 million rows. Memory usage scales accordingly.
For large datasets, consider these strategies:
Process in chunks:
def explode_chunked(df, column, chunk_size=100000):
chunks = []
for start in range(0, len(df), chunk_size):
chunk = df.iloc[start:start + chunk_size].explode(column)
chunks.append(chunk)
return pd.concat(chunks, ignore_index=True)
Filter before exploding: If you only need rows matching certain criteria, filter the original DataFrame first to reduce the explosion factor.
Use appropriate dtypes: After exploding, convert columns to categorical if they have limited unique values. This dramatically reduces memory usage.
df_exploded['tags'] = df_exploded['tags'].astype('category')
Conclusion
The explode() method is essential for normalizing nested data structures. Remember these key points: it preserves the original index by default, treats empty lists as NaN, and requires matching list lengths when exploding multiple columns.
For related reshaping operations, explore melt() for unpivoting wide data into long format, and stack() for pivoting column levels into row indices. Together with explode(), these methods form the core toolkit for transforming data between different structural representations.