How to Use json_normalize in Pandas

Nested JSON is everywhere. APIs return it, NoSQL databases store it, and configuration files depend on it. But pandas DataFrames expect flat, tabular data. The gap between these two worlds causes...

Key Insights

  • json_normalize transforms nested JSON structures into flat DataFrames, eliminating the need for manual dictionary traversal and list comprehensions
  • The record_path and meta parameters work together to extract nested arrays while preserving parent-level context—master these two and you’ll handle 90% of real-world JSON flattening tasks
  • Always use errors='ignore' when working with inconsistent API responses to prevent your pipeline from crashing on missing fields

Nested JSON is everywhere. APIs return it, NoSQL databases store it, and configuration files depend on it. But pandas DataFrames expect flat, tabular data. The gap between these two worlds causes endless frustration—developers write brittle loops, nested list comprehensions, and recursive functions just to extract the data they need.

json_normalize bridges this gap. It’s a single function that flattens hierarchical JSON into clean DataFrames, handling nested dictionaries and arrays with minimal code. Yet many pandas users don’t know it exists, or they underutilize its powerful parameters.

Basic Usage

The simplest use case involves flattening a JSON object with nested dictionaries. Consider user profile data where addresses are stored as nested objects:

import pandas as pd
from pandas import json_normalize

data = [
    {
        "user_id": 1,
        "name": "Alice Chen",
        "email": "alice@example.com",
        "address": {
            "street": "123 Main St",
            "city": "Seattle",
            "state": "WA",
            "zip": "98101"
        }
    },
    {
        "user_id": 2,
        "name": "Bob Martinez",
        "email": "bob@example.com",
        "address": {
            "street": "456 Oak Ave",
            "city": "Portland",
            "state": "OR",
            "zip": "97201"
        }
    }
]

df = json_normalize(data)
print(df.columns.tolist())

Output:

['user_id', 'name', 'email', 'address.street', 'address.city', 'address.state', 'address.zip']

Without any parameters, json_normalize automatically flattens nested dictionaries, creating dot-separated column names. The nested address object becomes four separate columns. This works recursively—if address contained another nested object, those fields would be flattened too.

Compare this to the manual approach:

# The painful way
rows = []
for item in data:
    row = {
        'user_id': item['user_id'],
        'name': item['name'],
        'email': item['email'],
        'street': item['address']['street'],
        'city': item['address']['city'],
        'state': item['address']['state'],
        'zip': item['address']['zip']
    }
    rows.append(row)
df_manual = pd.DataFrame(rows)

The manual version requires explicit field mapping and breaks when the JSON structure changes. json_normalize handles structure changes gracefully and requires zero field enumeration.

Handling Deeply Nested Data with record_path

Real-world JSON often contains arrays of objects nested within other objects. The record_path parameter tells json_normalize which nested array to extract as rows:

customers = [
    {
        "customer_id": "C001",
        "name": "Acme Corp",
        "orders": [
            {"order_id": "O101", "product": "Widget", "quantity": 50, "price": 9.99},
            {"order_id": "O102", "product": "Gadget", "quantity": 25, "price": 19.99}
        ]
    },
    {
        "customer_id": "C002",
        "name": "TechStart Inc",
        "orders": [
            {"order_id": "O201", "product": "Widget", "quantity": 100, "price": 9.99},
            {"order_id": "O202", "product": "Sprocket", "quantity": 75, "price": 4.99},
            {"order_id": "O203", "product": "Gadget", "quantity": 30, "price": 19.99}
        ]
    }
]

orders_df = json_normalize(customers, record_path='orders')
print(orders_df)

Output:

  order_id   product  quantity  price
0     O101    Widget        50   9.99
1     O102    Gadget        25  19.99
2     O201    Widget       100   9.99
3     O202  Sprocket        75   4.99
4     O203    Gadget        30  19.99

Each order becomes its own row. The function extracted 5 orders from 2 customers, flattening the nested arrays into a single DataFrame.

For deeply nested paths, use a list:

# If orders were nested under customer.transactions.orders
df = json_normalize(data, record_path=['transactions', 'orders'])

Preserving Parent Data with meta

The previous example lost the customer information. We have orders but don’t know which customer placed them. The meta parameter solves this by carrying parent-level fields into each extracted row:

orders_df = json_normalize(
    customers,
    record_path='orders',
    meta=['customer_id', 'name']
)
print(orders_df)

Output:

  order_id   product  quantity  price customer_id           name
0     O101    Widget        50   9.99        C001       Acme Corp
1     O102    Gadget        25  19.99        C001       Acme Corp
2     O201    Widget       100   9.99        C002   TechStart Inc
3     O202  Sprocket        75   4.99        C002   TechStart Inc
4     O203    Gadget        30  19.99        C002   TechStart Inc

Now each order row includes the customer context. This is essential for any analysis that needs to join order data back to customer attributes.

For nested meta fields, use a list path:

# Extract nested parent fields
complex_data = [
    {
        "company": {
            "id": "C001",
            "info": {"name": "Acme Corp", "industry": "Manufacturing"}
        },
        "orders": [{"order_id": "O101", "amount": 500}]
    }
]

df = json_normalize(
    complex_data,
    record_path='orders',
    meta=[
        ['company', 'id'],
        ['company', 'info', 'name'],
        ['company', 'info', 'industry']
    ]
)

Customizing Column Names with sep and record_prefix

The default dot separator in column names can cause issues with some tools and makes attribute access impossible. The sep parameter lets you choose a different separator:

df = json_normalize(data, sep='_')
print(df.columns.tolist())
# ['user_id', 'name', 'email', 'address_street', 'address_city', 'address_state', 'address_zip']

With underscores, you can use df.address_city instead of df['address.city'].

When extracting nested arrays, the record_prefix parameter adds a prefix to the extracted fields, distinguishing them from meta fields:

orders_df = json_normalize(
    customers,
    record_path='orders',
    meta=['customer_id', 'name'],
    record_prefix='order_'
)
print(orders_df.columns.tolist())
# ['order_order_id', 'order_product', 'order_quantity', 'order_price', 'customer_id', 'name']

This becomes valuable when meta fields might share names with record fields, or when you want explicit namespacing in your DataFrame.

Handling Errors and Missing Data

Real-world JSON is messy. Fields appear in some records but not others. The errors parameter controls how json_normalize handles missing meta fields:

inconsistent_data = [
    {
        "id": 1,
        "name": "Complete Record",
        "metadata": {"source": "api", "version": "2.0"},
        "items": [{"sku": "A1", "qty": 10}]
    },
    {
        "id": 2,
        "name": "Missing Metadata",
        # No metadata field
        "items": [{"sku": "B2", "qty": 5}]
    },
    {
        "id": 3,
        "name": "Partial Metadata",
        "metadata": {"source": "manual"},  # Missing version
        "items": [{"sku": "C3", "qty": 15}]
    }
]

# This would raise KeyError without errors='ignore'
df = json_normalize(
    inconsistent_data,
    record_path='items',
    meta=['id', 'name', ['metadata', 'source'], ['metadata', 'version']],
    errors='ignore'
)
print(df)

Output:

  sku  qty  id               name metadata.source metadata.version
0  A1   10   1    Complete Record             api              2.0
1  B2    5   2   Missing Metadata             NaN              NaN
2  C3   15   3   Partial Metadata          manual              NaN

Missing fields become NaN values instead of crashing your pipeline. Always use errors='ignore' when processing data from external sources where you can’t guarantee schema consistency.

Real-World Example: Working with API Responses

Let’s tie everything together with a practical example. We’ll normalize data from the GitHub API, which returns deeply nested JSON:

import requests
import pandas as pd
from pandas import json_normalize

def fetch_github_repos(username):
    """Fetch and normalize GitHub repository data."""
    url = f"https://api.github.com/users/{username}/repos"
    response = requests.get(url)
    response.raise_for_status()
    repos = response.json()
    
    # Normalize with nested owner information
    df = json_normalize(
        repos,
        sep='_',
        max_level=1  # Limit nesting depth to avoid explosion of columns
    )
    
    # Select relevant columns
    columns = [
        'name', 'full_name', 'description', 'html_url',
        'stargazers_count', 'forks_count', 'language',
        'created_at', 'updated_at', 'owner_login', 'owner_type'
    ]
    
    # Filter to existing columns (some may be missing)
    existing_cols = [c for c in columns if c in df.columns]
    return df[existing_cols]

# Fetch pandas repository contributors with nested data
def fetch_repo_contributors(owner, repo):
    """Fetch contributor data with commit statistics."""
    url = f"https://api.github.com/repos/{owner}/{repo}/stats/contributors"
    response = requests.get(url)
    
    if response.status_code != 200:
        return pd.DataFrame()
    
    contributors = response.json()
    
    if not contributors:
        return pd.DataFrame()
    
    # Each contributor has nested 'author' info and 'weeks' array
    # We want author info with total commits
    df = json_normalize(
        contributors,
        meta=['total'],
        record_path='weeks',
        record_prefix='week_',
        errors='ignore'
    )
    
    # Alternative: just flatten author info without weeks
    author_df = json_normalize(
        contributors,
        sep='_',
        max_level=2
    )
    
    return author_df[['author_login', 'author_type', 'total']].drop_duplicates()

# Example usage
repos_df = fetch_github_repos('pandas-dev')
print(repos_df.head())

This example demonstrates several key patterns:

  1. Using sep='_' for cleaner column names from nested owner data
  2. Using max_level to prevent deeply nested structures from creating hundreds of columns
  3. Defensive column selection that handles missing fields
  4. Combining record_path and meta for complex nested arrays

Conclusion

json_normalize eliminates the tedious work of manually flattening JSON. The key parameters to remember:

  • record_path: Which nested array to extract as rows
  • meta: Which parent fields to preserve in each row
  • sep: Character to join nested field names
  • errors='ignore': Handle missing fields gracefully

Start with basic flattening, then add record_path when you need to extract arrays, and meta when you need parent context. For production code processing external APIs, always include errors='ignore' to handle schema inconsistencies without crashing.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.