How to Read JSON Files in Pandas

JSON has become the lingua franca of web APIs and configuration files. It's human-readable, flexible, and ubiquitous. But flexibility comes at a cost—JSON's nested, hierarchical structure doesn't map...

Key Insights

  • The orient parameter is the key to correctly parsing JSON files—choosing the wrong orientation will produce garbled DataFrames or outright errors.
  • Nested JSON is the norm in real-world APIs, and pd.json_normalize() is your primary tool for flattening hierarchical data into tabular format.
  • JSON Lines (.jsonl) format with lines=True is often the better choice for large datasets, enabling chunked processing and avoiding memory issues.

Why JSON and Pandas?

JSON has become the lingua franca of web APIs and configuration files. It’s human-readable, flexible, and ubiquitous. But flexibility comes at a cost—JSON’s nested, hierarchical structure doesn’t map cleanly to the flat, tabular format that data analysis requires.

Pandas bridges this gap. It provides robust tools for reading JSON data and transforming it into DataFrames you can actually work with. However, the variety of JSON structures means there’s no one-size-fits-all approach. Understanding your options is essential.

Basic JSON Reading with pd.read_json()

Let’s start with the simplest case. You have a JSON file, and you want it in a DataFrame.

[
  {"name": "Alice", "age": 30, "city": "New York"},
  {"name": "Bob", "age": 25, "city": "Los Angeles"},
  {"name": "Charlie", "age": 35, "city": "Chicago"}
]

Reading this is straightforward:

import pandas as pd

df = pd.read_json('data.json')
print(df)

Output:

      name  age         city
0    Alice   30     New York
1      Bob   25  Los Angeles
2  Charlie   35      Chicago

This works because the JSON structure—an array of objects with consistent keys—maps naturally to rows and columns. Each object becomes a row, each key becomes a column.

But JSON files rarely stay this simple.

Handling Different JSON Structures

JSON can represent the same data in multiple ways. The orient parameter tells pandas how to interpret the structure. Get this wrong, and you’ll either get an error or a DataFrame that looks nothing like what you expected.

Records Orientation

This is the most common format—an array of row objects:

[
  {"name": "Alice", "age": 30},
  {"name": "Bob", "age": 25}
]
df = pd.read_json('data.json', orient='records')

Columns Orientation

Here, the outer keys are column names, and values are arrays of column data:

{
  "name": ["Alice", "Bob"],
  "age": [30, 25]
}
df = pd.read_json('data.json', orient='columns')

Index Orientation

The outer keys are row indices:

{
  "0": {"name": "Alice", "age": 30},
  "1": {"name": "Bob", "age": 25}
}
df = pd.read_json('data.json', orient='index')

Split Orientation

Explicit separation of columns, index, and data:

{
  "columns": ["name", "age"],
  "index": [0, 1],
  "data": [["Alice", 30], ["Bob", 25]]
}
df = pd.read_json('data.json', orient='split')

My recommendation: when you control the JSON format, use records orientation. It’s the most intuitive and widely supported. When consuming external data, inspect the structure first and choose accordingly.

Reading Nested JSON Data

Real-world JSON is messy. APIs return nested objects, arrays within arrays, and deeply hierarchical structures. pd.read_json() won’t flatten these automatically—you need pd.json_normalize().

Consider this API response:

[
  {
    "id": 1,
    "name": "Alice",
    "address": {
      "street": "123 Main St",
      "city": "New York",
      "zip": "10001"
    },
    "orders": [
      {"order_id": 101, "amount": 50.00},
      {"order_id": 102, "amount": 75.50}
    ]
  },
  {
    "id": 2,
    "name": "Bob",
    "address": {
      "street": "456 Oak Ave",
      "city": "Los Angeles",
      "zip": "90001"
    },
    "orders": [
      {"order_id": 103, "amount": 120.00}
    ]
  }
]

Using pd.read_json() directly creates columns containing dictionaries and lists—not useful for analysis.

import json

# Load JSON data
with open('nested_data.json') as f:
    data = json.load(f)

# Basic flattening of nested objects
df = pd.json_normalize(data)
print(df.columns.tolist())

Output:

['id', 'name', 'orders', 'address.street', 'address.city', 'address.zip']

The nested address object is flattened into dot-notation columns. But orders remains a list. To handle nested arrays, use the record_path parameter:

# Flatten the orders array, preserving parent data
df_orders = pd.json_normalize(
    data,
    record_path='orders',
    meta=['id', 'name', ['address', 'city']],
    meta_prefix='customer_'
)
print(df_orders)

Output:

   order_id  amount  customer_id customer_name customer_address.city
0       101   50.00            1         Alice              New York
1       102   75.50            1         Alice              New York
2       103  120.00            2           Bob           Los Angeles

This is powerful but requires understanding your data structure. Always inspect your JSON before writing normalization code.

Reading JSON from URLs and APIs

Pandas can read JSON directly from URLs, which is convenient for quick API exploration:

# Read directly from a URL
df = pd.read_json('https://api.github.com/users/pandas-dev/repos')
print(df[['name', 'stargazers_count', 'language']].head())

For APIs requiring authentication or custom headers, use requests first:

import requests

# API with authentication
headers = {'Authorization': 'Bearer YOUR_TOKEN'}
response = requests.get('https://api.example.com/data', headers=headers)
response.raise_for_status()  # Raise exception for bad status codes

# Parse the response
data = response.json()
df = pd.json_normalize(data['results'])

This approach gives you more control over error handling and lets you inspect the response before parsing. For production code, always prefer explicit HTTP handling over pandas’ built-in URL reading.

Common Parameters and Options

JSON Lines Format

JSON Lines (.jsonl) stores one JSON object per line. It’s excellent for large datasets because you can process them incrementally:

{"name": "Alice", "age": 30}
{"name": "Bob", "age": 25}
{"name": "Charlie", "age": 35}
df = pd.read_json('data.jsonl', lines=True)

For very large files, combine lines=True with chunksize:

# Process in chunks of 10,000 rows
chunks = pd.read_json('large_data.jsonl', lines=True, chunksize=10000)

for chunk in chunks:
    # Process each chunk
    process(chunk)

Data Type Control

Pandas infers types, but sometimes incorrectly. Use dtype to override:

df = pd.read_json('data.json', dtype={'zip_code': str, 'id': 'Int64'})

Date Parsing

Control date conversion with convert_dates:

# Disable automatic date conversion
df = pd.read_json('data.json', convert_dates=False)

# Or specify which columns to convert
df = pd.read_json('data.json', convert_dates=['created_at', 'updated_at'])

Encoding

Specify encoding for non-UTF-8 files:

df = pd.read_json('data.json', encoding='latin-1')

Troubleshooting Common Errors

Malformed JSON

The most common error. Validate your JSON first:

import json

def safe_read_json(filepath):
    try:
        # First, validate the JSON
        with open(filepath, 'r') as f:
            data = json.load(f)
        
        # Then create DataFrame
        return pd.json_normalize(data)
    
    except json.JSONDecodeError as e:
        print(f"Invalid JSON at line {e.lineno}, column {e.colno}: {e.msg}")
        return None
    except Exception as e:
        print(f"Error reading file: {e}")
        return None

df = safe_read_json('data.json')

Memory Errors

Large JSON files can exhaust memory. Solutions:

  1. Use JSON Lines with chunking (shown above)
  2. Filter data during loading if your source supports it
  3. Use ijson for streaming large files:
import ijson
import pandas as pd

def stream_json(filepath, batch_size=10000):
    records = []
    with open(filepath, 'rb') as f:
        for record in ijson.items(f, 'item'):
            records.append(record)
            if len(records) >= batch_size:
                yield pd.DataFrame(records)
                records = []
    if records:
        yield pd.DataFrame(records)

Inconsistent Schemas

Real-world JSON often has inconsistent fields across records:

# Handle missing keys gracefully
df = pd.json_normalize(data)
df = df.reindex(columns=['required_col1', 'required_col2', 'optional_col'], fill_value=None)

Practical Recommendations

After years of working with JSON in pandas, here’s my advice:

  1. Always inspect first. Print a sample of your JSON before writing parsing code. Structure assumptions will burn you.

  2. Prefer JSON Lines for large data. It’s streamable, appendable, and recoverable if corruption occurs.

  3. Use json_normalize for nested data. Don’t try to manually flatten with apply functions—json_normalize handles edge cases you haven’t thought of.

  4. Validate before parsing. A quick json.load() check catches malformed data before it causes cryptic pandas errors.

  5. Specify dtypes explicitly. Inferred types cause subtle bugs. Be explicit about what you expect.

JSON’s flexibility is both its strength and its challenge. Pandas provides the tools to tame it—you just need to know which tool fits your specific structure.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.