How to Read JSON Files in Pandas
JSON has become the lingua franca of web APIs and configuration files. It's human-readable, flexible, and ubiquitous. But flexibility comes at a cost—JSON's nested, hierarchical structure doesn't map...
Key Insights
- The
orientparameter is the key to correctly parsing JSON files—choosing the wrong orientation will produce garbled DataFrames or outright errors. - Nested JSON is the norm in real-world APIs, and
pd.json_normalize()is your primary tool for flattening hierarchical data into tabular format. - JSON Lines (
.jsonl) format withlines=Trueis often the better choice for large datasets, enabling chunked processing and avoiding memory issues.
Why JSON and Pandas?
JSON has become the lingua franca of web APIs and configuration files. It’s human-readable, flexible, and ubiquitous. But flexibility comes at a cost—JSON’s nested, hierarchical structure doesn’t map cleanly to the flat, tabular format that data analysis requires.
Pandas bridges this gap. It provides robust tools for reading JSON data and transforming it into DataFrames you can actually work with. However, the variety of JSON structures means there’s no one-size-fits-all approach. Understanding your options is essential.
Basic JSON Reading with pd.read_json()
Let’s start with the simplest case. You have a JSON file, and you want it in a DataFrame.
[
{"name": "Alice", "age": 30, "city": "New York"},
{"name": "Bob", "age": 25, "city": "Los Angeles"},
{"name": "Charlie", "age": 35, "city": "Chicago"}
]
Reading this is straightforward:
import pandas as pd
df = pd.read_json('data.json')
print(df)
Output:
name age city
0 Alice 30 New York
1 Bob 25 Los Angeles
2 Charlie 35 Chicago
This works because the JSON structure—an array of objects with consistent keys—maps naturally to rows and columns. Each object becomes a row, each key becomes a column.
But JSON files rarely stay this simple.
Handling Different JSON Structures
JSON can represent the same data in multiple ways. The orient parameter tells pandas how to interpret the structure. Get this wrong, and you’ll either get an error or a DataFrame that looks nothing like what you expected.
Records Orientation
This is the most common format—an array of row objects:
[
{"name": "Alice", "age": 30},
{"name": "Bob", "age": 25}
]
df = pd.read_json('data.json', orient='records')
Columns Orientation
Here, the outer keys are column names, and values are arrays of column data:
{
"name": ["Alice", "Bob"],
"age": [30, 25]
}
df = pd.read_json('data.json', orient='columns')
Index Orientation
The outer keys are row indices:
{
"0": {"name": "Alice", "age": 30},
"1": {"name": "Bob", "age": 25}
}
df = pd.read_json('data.json', orient='index')
Split Orientation
Explicit separation of columns, index, and data:
{
"columns": ["name", "age"],
"index": [0, 1],
"data": [["Alice", 30], ["Bob", 25]]
}
df = pd.read_json('data.json', orient='split')
My recommendation: when you control the JSON format, use records orientation. It’s the most intuitive and widely supported. When consuming external data, inspect the structure first and choose accordingly.
Reading Nested JSON Data
Real-world JSON is messy. APIs return nested objects, arrays within arrays, and deeply hierarchical structures. pd.read_json() won’t flatten these automatically—you need pd.json_normalize().
Consider this API response:
[
{
"id": 1,
"name": "Alice",
"address": {
"street": "123 Main St",
"city": "New York",
"zip": "10001"
},
"orders": [
{"order_id": 101, "amount": 50.00},
{"order_id": 102, "amount": 75.50}
]
},
{
"id": 2,
"name": "Bob",
"address": {
"street": "456 Oak Ave",
"city": "Los Angeles",
"zip": "90001"
},
"orders": [
{"order_id": 103, "amount": 120.00}
]
}
]
Using pd.read_json() directly creates columns containing dictionaries and lists—not useful for analysis.
import json
# Load JSON data
with open('nested_data.json') as f:
data = json.load(f)
# Basic flattening of nested objects
df = pd.json_normalize(data)
print(df.columns.tolist())
Output:
['id', 'name', 'orders', 'address.street', 'address.city', 'address.zip']
The nested address object is flattened into dot-notation columns. But orders remains a list. To handle nested arrays, use the record_path parameter:
# Flatten the orders array, preserving parent data
df_orders = pd.json_normalize(
data,
record_path='orders',
meta=['id', 'name', ['address', 'city']],
meta_prefix='customer_'
)
print(df_orders)
Output:
order_id amount customer_id customer_name customer_address.city
0 101 50.00 1 Alice New York
1 102 75.50 1 Alice New York
2 103 120.00 2 Bob Los Angeles
This is powerful but requires understanding your data structure. Always inspect your JSON before writing normalization code.
Reading JSON from URLs and APIs
Pandas can read JSON directly from URLs, which is convenient for quick API exploration:
# Read directly from a URL
df = pd.read_json('https://api.github.com/users/pandas-dev/repos')
print(df[['name', 'stargazers_count', 'language']].head())
For APIs requiring authentication or custom headers, use requests first:
import requests
# API with authentication
headers = {'Authorization': 'Bearer YOUR_TOKEN'}
response = requests.get('https://api.example.com/data', headers=headers)
response.raise_for_status() # Raise exception for bad status codes
# Parse the response
data = response.json()
df = pd.json_normalize(data['results'])
This approach gives you more control over error handling and lets you inspect the response before parsing. For production code, always prefer explicit HTTP handling over pandas’ built-in URL reading.
Common Parameters and Options
JSON Lines Format
JSON Lines (.jsonl) stores one JSON object per line. It’s excellent for large datasets because you can process them incrementally:
{"name": "Alice", "age": 30}
{"name": "Bob", "age": 25}
{"name": "Charlie", "age": 35}
df = pd.read_json('data.jsonl', lines=True)
For very large files, combine lines=True with chunksize:
# Process in chunks of 10,000 rows
chunks = pd.read_json('large_data.jsonl', lines=True, chunksize=10000)
for chunk in chunks:
# Process each chunk
process(chunk)
Data Type Control
Pandas infers types, but sometimes incorrectly. Use dtype to override:
df = pd.read_json('data.json', dtype={'zip_code': str, 'id': 'Int64'})
Date Parsing
Control date conversion with convert_dates:
# Disable automatic date conversion
df = pd.read_json('data.json', convert_dates=False)
# Or specify which columns to convert
df = pd.read_json('data.json', convert_dates=['created_at', 'updated_at'])
Encoding
Specify encoding for non-UTF-8 files:
df = pd.read_json('data.json', encoding='latin-1')
Troubleshooting Common Errors
Malformed JSON
The most common error. Validate your JSON first:
import json
def safe_read_json(filepath):
try:
# First, validate the JSON
with open(filepath, 'r') as f:
data = json.load(f)
# Then create DataFrame
return pd.json_normalize(data)
except json.JSONDecodeError as e:
print(f"Invalid JSON at line {e.lineno}, column {e.colno}: {e.msg}")
return None
except Exception as e:
print(f"Error reading file: {e}")
return None
df = safe_read_json('data.json')
Memory Errors
Large JSON files can exhaust memory. Solutions:
- Use JSON Lines with chunking (shown above)
- Filter data during loading if your source supports it
- Use
ijsonfor streaming large files:
import ijson
import pandas as pd
def stream_json(filepath, batch_size=10000):
records = []
with open(filepath, 'rb') as f:
for record in ijson.items(f, 'item'):
records.append(record)
if len(records) >= batch_size:
yield pd.DataFrame(records)
records = []
if records:
yield pd.DataFrame(records)
Inconsistent Schemas
Real-world JSON often has inconsistent fields across records:
# Handle missing keys gracefully
df = pd.json_normalize(data)
df = df.reindex(columns=['required_col1', 'required_col2', 'optional_col'], fill_value=None)
Practical Recommendations
After years of working with JSON in pandas, here’s my advice:
-
Always inspect first. Print a sample of your JSON before writing parsing code. Structure assumptions will burn you.
-
Prefer JSON Lines for large data. It’s streamable, appendable, and recoverable if corruption occurs.
-
Use json_normalize for nested data. Don’t try to manually flatten with apply functions—
json_normalizehandles edge cases you haven’t thought of. -
Validate before parsing. A quick
json.load()check catches malformed data before it causes cryptic pandas errors. -
Specify dtypes explicitly. Inferred types cause subtle bugs. Be explicit about what you expect.
JSON’s flexibility is both its strength and its challenge. Pandas provides the tools to tame it—you just need to know which tool fits your specific structure.