How to Use json_normalize in Pandas
Nested JSON is everywhere. APIs return it, NoSQL databases store it, and configuration files depend on it. But pandas DataFrames expect flat, tabular data. The gap between these two worlds causes...
Key Insights
json_normalizetransforms nested JSON structures into flat DataFrames, eliminating the need for manual dictionary traversal and list comprehensions- The
record_pathandmetaparameters work together to extract nested arrays while preserving parent-level context—master these two and you’ll handle 90% of real-world JSON flattening tasks - Always use
errors='ignore'when working with inconsistent API responses to prevent your pipeline from crashing on missing fields
Nested JSON is everywhere. APIs return it, NoSQL databases store it, and configuration files depend on it. But pandas DataFrames expect flat, tabular data. The gap between these two worlds causes endless frustration—developers write brittle loops, nested list comprehensions, and recursive functions just to extract the data they need.
json_normalize bridges this gap. It’s a single function that flattens hierarchical JSON into clean DataFrames, handling nested dictionaries and arrays with minimal code. Yet many pandas users don’t know it exists, or they underutilize its powerful parameters.
Basic Usage
The simplest use case involves flattening a JSON object with nested dictionaries. Consider user profile data where addresses are stored as nested objects:
import pandas as pd
from pandas import json_normalize
data = [
{
"user_id": 1,
"name": "Alice Chen",
"email": "alice@example.com",
"address": {
"street": "123 Main St",
"city": "Seattle",
"state": "WA",
"zip": "98101"
}
},
{
"user_id": 2,
"name": "Bob Martinez",
"email": "bob@example.com",
"address": {
"street": "456 Oak Ave",
"city": "Portland",
"state": "OR",
"zip": "97201"
}
}
]
df = json_normalize(data)
print(df.columns.tolist())
Output:
['user_id', 'name', 'email', 'address.street', 'address.city', 'address.state', 'address.zip']
Without any parameters, json_normalize automatically flattens nested dictionaries, creating dot-separated column names. The nested address object becomes four separate columns. This works recursively—if address contained another nested object, those fields would be flattened too.
Compare this to the manual approach:
# The painful way
rows = []
for item in data:
row = {
'user_id': item['user_id'],
'name': item['name'],
'email': item['email'],
'street': item['address']['street'],
'city': item['address']['city'],
'state': item['address']['state'],
'zip': item['address']['zip']
}
rows.append(row)
df_manual = pd.DataFrame(rows)
The manual version requires explicit field mapping and breaks when the JSON structure changes. json_normalize handles structure changes gracefully and requires zero field enumeration.
Handling Deeply Nested Data with record_path
Real-world JSON often contains arrays of objects nested within other objects. The record_path parameter tells json_normalize which nested array to extract as rows:
customers = [
{
"customer_id": "C001",
"name": "Acme Corp",
"orders": [
{"order_id": "O101", "product": "Widget", "quantity": 50, "price": 9.99},
{"order_id": "O102", "product": "Gadget", "quantity": 25, "price": 19.99}
]
},
{
"customer_id": "C002",
"name": "TechStart Inc",
"orders": [
{"order_id": "O201", "product": "Widget", "quantity": 100, "price": 9.99},
{"order_id": "O202", "product": "Sprocket", "quantity": 75, "price": 4.99},
{"order_id": "O203", "product": "Gadget", "quantity": 30, "price": 19.99}
]
}
]
orders_df = json_normalize(customers, record_path='orders')
print(orders_df)
Output:
order_id product quantity price
0 O101 Widget 50 9.99
1 O102 Gadget 25 19.99
2 O201 Widget 100 9.99
3 O202 Sprocket 75 4.99
4 O203 Gadget 30 19.99
Each order becomes its own row. The function extracted 5 orders from 2 customers, flattening the nested arrays into a single DataFrame.
For deeply nested paths, use a list:
# If orders were nested under customer.transactions.orders
df = json_normalize(data, record_path=['transactions', 'orders'])
Preserving Parent Data with meta
The previous example lost the customer information. We have orders but don’t know which customer placed them. The meta parameter solves this by carrying parent-level fields into each extracted row:
orders_df = json_normalize(
customers,
record_path='orders',
meta=['customer_id', 'name']
)
print(orders_df)
Output:
order_id product quantity price customer_id name
0 O101 Widget 50 9.99 C001 Acme Corp
1 O102 Gadget 25 19.99 C001 Acme Corp
2 O201 Widget 100 9.99 C002 TechStart Inc
3 O202 Sprocket 75 4.99 C002 TechStart Inc
4 O203 Gadget 30 19.99 C002 TechStart Inc
Now each order row includes the customer context. This is essential for any analysis that needs to join order data back to customer attributes.
For nested meta fields, use a list path:
# Extract nested parent fields
complex_data = [
{
"company": {
"id": "C001",
"info": {"name": "Acme Corp", "industry": "Manufacturing"}
},
"orders": [{"order_id": "O101", "amount": 500}]
}
]
df = json_normalize(
complex_data,
record_path='orders',
meta=[
['company', 'id'],
['company', 'info', 'name'],
['company', 'info', 'industry']
]
)
Customizing Column Names with sep and record_prefix
The default dot separator in column names can cause issues with some tools and makes attribute access impossible. The sep parameter lets you choose a different separator:
df = json_normalize(data, sep='_')
print(df.columns.tolist())
# ['user_id', 'name', 'email', 'address_street', 'address_city', 'address_state', 'address_zip']
With underscores, you can use df.address_city instead of df['address.city'].
When extracting nested arrays, the record_prefix parameter adds a prefix to the extracted fields, distinguishing them from meta fields:
orders_df = json_normalize(
customers,
record_path='orders',
meta=['customer_id', 'name'],
record_prefix='order_'
)
print(orders_df.columns.tolist())
# ['order_order_id', 'order_product', 'order_quantity', 'order_price', 'customer_id', 'name']
This becomes valuable when meta fields might share names with record fields, or when you want explicit namespacing in your DataFrame.
Handling Errors and Missing Data
Real-world JSON is messy. Fields appear in some records but not others. The errors parameter controls how json_normalize handles missing meta fields:
inconsistent_data = [
{
"id": 1,
"name": "Complete Record",
"metadata": {"source": "api", "version": "2.0"},
"items": [{"sku": "A1", "qty": 10}]
},
{
"id": 2,
"name": "Missing Metadata",
# No metadata field
"items": [{"sku": "B2", "qty": 5}]
},
{
"id": 3,
"name": "Partial Metadata",
"metadata": {"source": "manual"}, # Missing version
"items": [{"sku": "C3", "qty": 15}]
}
]
# This would raise KeyError without errors='ignore'
df = json_normalize(
inconsistent_data,
record_path='items',
meta=['id', 'name', ['metadata', 'source'], ['metadata', 'version']],
errors='ignore'
)
print(df)
Output:
sku qty id name metadata.source metadata.version
0 A1 10 1 Complete Record api 2.0
1 B2 5 2 Missing Metadata NaN NaN
2 C3 15 3 Partial Metadata manual NaN
Missing fields become NaN values instead of crashing your pipeline. Always use errors='ignore' when processing data from external sources where you can’t guarantee schema consistency.
Real-World Example: Working with API Responses
Let’s tie everything together with a practical example. We’ll normalize data from the GitHub API, which returns deeply nested JSON:
import requests
import pandas as pd
from pandas import json_normalize
def fetch_github_repos(username):
"""Fetch and normalize GitHub repository data."""
url = f"https://api.github.com/users/{username}/repos"
response = requests.get(url)
response.raise_for_status()
repos = response.json()
# Normalize with nested owner information
df = json_normalize(
repos,
sep='_',
max_level=1 # Limit nesting depth to avoid explosion of columns
)
# Select relevant columns
columns = [
'name', 'full_name', 'description', 'html_url',
'stargazers_count', 'forks_count', 'language',
'created_at', 'updated_at', 'owner_login', 'owner_type'
]
# Filter to existing columns (some may be missing)
existing_cols = [c for c in columns if c in df.columns]
return df[existing_cols]
# Fetch pandas repository contributors with nested data
def fetch_repo_contributors(owner, repo):
"""Fetch contributor data with commit statistics."""
url = f"https://api.github.com/repos/{owner}/{repo}/stats/contributors"
response = requests.get(url)
if response.status_code != 200:
return pd.DataFrame()
contributors = response.json()
if not contributors:
return pd.DataFrame()
# Each contributor has nested 'author' info and 'weeks' array
# We want author info with total commits
df = json_normalize(
contributors,
meta=['total'],
record_path='weeks',
record_prefix='week_',
errors='ignore'
)
# Alternative: just flatten author info without weeks
author_df = json_normalize(
contributors,
sep='_',
max_level=2
)
return author_df[['author_login', 'author_type', 'total']].drop_duplicates()
# Example usage
repos_df = fetch_github_repos('pandas-dev')
print(repos_df.head())
This example demonstrates several key patterns:
- Using
sep='_'for cleaner column names from nested owner data - Using
max_levelto prevent deeply nested structures from creating hundreds of columns - Defensive column selection that handles missing fields
- Combining
record_pathandmetafor complex nested arrays
Conclusion
json_normalize eliminates the tedious work of manually flattening JSON. The key parameters to remember:
record_path: Which nested array to extract as rowsmeta: Which parent fields to preserve in each rowsep: Character to join nested field nameserrors='ignore': Handle missing fields gracefully
Start with basic flattening, then add record_path when you need to extract arrays, and meta when you need parent context. For production code processing external APIs, always include errors='ignore' to handle schema inconsistencies without crashing.