Pandas - Read JSON File (read_json)
• Pandas `read_json()` handles multiple JSON structures including records, split, index, columns, and values orientations, with automatic type inference and nested data flattening capabilities
Key Insights
• Pandas read_json() handles multiple JSON structures including records, split, index, columns, and values orientations, with automatic type inference and nested data flattening capabilities
• Line-delimited JSON (JSONL/NDJSON) format processes large datasets efficiently through the lines=True parameter, enabling chunked reading for memory-constrained environments
• Complex nested JSON requires json_normalize() for flattening hierarchical structures, with precise control over separator characters and metadata column prefixes
Basic JSON File Reading
The read_json() function provides straightforward JSON file ingestion with automatic type detection. The simplest use case reads a JSON file directly into a DataFrame:
import pandas as pd
# Read basic JSON file
df = pd.read_json('data.json')
print(df.head())
print(df.dtypes)
For a JSON file structured as an array of objects:
[
{"name": "Alice", "age": 30, "city": "New York"},
{"name": "Bob", "age": 25, "city": "San Francisco"},
{"name": "Charlie", "age": 35, "city": "Chicago"}
]
Pandas automatically creates columns from keys and rows from each object. The orient parameter controls how Pandas interprets the JSON structure:
# Explicit orientation specification
df = pd.read_json('data.json', orient='records')
# Different orientations for different structures
df_split = pd.read_json('data.json', orient='split')
df_index = pd.read_json('data.json', orient='index')
df_columns = pd.read_json('data.json', orient='columns')
Working with Different JSON Orientations
JSON data comes in various formats. Understanding orientations prevents parsing errors and ensures correct data interpretation:
import pandas as pd
import json
# Records orientation (array of objects) - most common
records_json = '''
[
{"product": "Laptop", "price": 999, "stock": 15},
{"product": "Mouse", "price": 25, "stock": 150}
]
'''
df_records = pd.read_json(records_json, orient='records')
# Columns orientation (object of arrays)
columns_json = '''
{
"product": ["Laptop", "Mouse"],
"price": [999, 25],
"stock": [15, 150]
}
'''
df_columns = pd.read_json(columns_json, orient='columns')
# Index orientation (object of objects with row keys)
index_json = '''
{
"0": {"product": "Laptop", "price": 999, "stock": 15},
"1": {"product": "Mouse", "price": 25, "stock": 150}
}
'''
df_index = pd.read_json(index_json, orient='index')
# Split orientation (separate index, columns, and data)
split_json = '''
{
"columns": ["product", "price", "stock"],
"index": [0, 1],
"data": [["Laptop", 999, 15], ["Mouse", 25, 150]]
}
'''
df_split = pd.read_json(split_json, orient='split')
print(df_records.equals(df_columns)) # True
print(df_records.equals(df_index)) # True
Line-Delimited JSON (JSONL) Processing
Line-delimited JSON stores one JSON object per line, ideal for streaming data and large files. The lines=True parameter enables this format:
# Create sample JSONL file
jsonl_data = '''
{"timestamp": "2024-01-01T10:00:00", "user_id": 101, "action": "login", "duration": 2.5}
{"timestamp": "2024-01-01T10:05:00", "user_id": 102, "action": "purchase", "duration": 45.3}
{"timestamp": "2024-01-01T10:10:00", "user_id": 101, "action": "logout", "duration": 1.1}
'''
with open('events.jsonl', 'w') as f:
f.write(jsonl_data)
# Read JSONL file
df = pd.read_json('events.jsonl', lines=True)
print(df)
# Parse timestamps during read
df = pd.read_json('events.jsonl', lines=True, convert_dates=['timestamp'])
print(df.dtypes)
For large JSONL files, process in chunks to manage memory:
# Read JSONL in chunks
chunk_size = 1000
chunks = []
for chunk in pd.read_json('large_events.jsonl', lines=True, chunksize=chunk_size):
# Process each chunk
chunk['processed'] = chunk['duration'] * 2
chunks.append(chunk)
# Combine all chunks
df_combined = pd.concat(chunks, ignore_index=True)
Handling Nested JSON Structures
Real-world JSON often contains nested objects and arrays. The json_normalize() function flattens hierarchical data:
import pandas as pd
from pandas import json_normalize
# Nested JSON example
nested_json = '''
[
{
"id": 1,
"name": "Alice",
"address": {
"street": "123 Main St",
"city": "New York",
"zip": "10001"
},
"orders": [
{"order_id": "A1", "amount": 100},
{"order_id": "A2", "amount": 150}
]
},
{
"id": 2,
"name": "Bob",
"address": {
"street": "456 Oak Ave",
"city": "Boston",
"zip": "02101"
},
"orders": [
{"order_id": "B1", "amount": 200}
]
}
]
'''
# Load JSON first
import json
data = json.loads(nested_json)
# Flatten nested objects
df = json_normalize(data)
print(df.columns)
# Output: ['id', 'name', 'address.street', 'address.city', 'address.zip', 'orders']
# Flatten with custom separator
df = json_normalize(data, sep='_')
print(df.columns)
# Output: ['id', 'name', 'address_street', 'address_city', 'address_zip', 'orders']
# Flatten nested arrays into separate rows
df = json_normalize(
data,
record_path='orders',
meta=['id', 'name', ['address', 'city']],
meta_prefix='user_'
)
print(df)
Advanced Reading Options
Control parsing behavior with additional parameters for precision and performance:
# Specify data types explicitly
dtype_spec = {
'user_id': 'int32',
'amount': 'float64',
'category': 'category'
}
df = pd.read_json('data.json', dtype=dtype_spec)
# Handle missing values
df = pd.read_json('data.json', keep_default_na=False)
# Set specific date columns and format
df = pd.read_json(
'data.json',
convert_dates=['created_at', 'updated_at'],
date_unit='ms' # milliseconds since epoch
)
# Precise numeric handling
df = pd.read_json('data.json', precise_float=True)
# Custom encoding for non-UTF8 files
df = pd.read_json('data.json', encoding='latin-1')
Reading JSON from URLs and Strings
Read JSON directly from web APIs or string variables without intermediate file storage:
# Read from URL
url = 'https://api.example.com/data.json'
df = pd.read_json(url)
# Read from string variable
json_string = '{"col1": [1, 2], "col2": [3, 4]}'
df = pd.read_json(json_string)
# Read from compressed files
df = pd.read_json('data.json.gz', compression='gzip')
df = pd.read_json('data.json.bz2', compression='bz2')
# Auto-detect compression
df = pd.read_json('data.json.gz', compression='infer')
Error Handling and Validation
Implement robust error handling for production environments:
import pandas as pd
from json.decoder import JSONDecodeError
def safe_read_json(filepath, **kwargs):
"""Safely read JSON with comprehensive error handling"""
try:
df = pd.read_json(filepath, **kwargs)
# Validate DataFrame isn't empty
if df.empty:
raise ValueError(f"JSON file {filepath} produced empty DataFrame")
# Check for expected columns
expected_cols = kwargs.get('expected_columns', [])
if expected_cols:
missing = set(expected_cols) - set(df.columns)
if missing:
raise ValueError(f"Missing expected columns: {missing}")
return df
except JSONDecodeError as e:
print(f"Invalid JSON format in {filepath}: {e}")
raise
except FileNotFoundError:
print(f"File not found: {filepath}")
raise
except ValueError as e:
print(f"Validation error: {e}")
raise
# Usage
df = safe_read_json(
'data.json',
lines=True,
expected_columns=['id', 'name', 'value']
)
Performance Optimization
Optimize JSON reading for large datasets:
# Use dtype specification to reduce memory
df = pd.read_json(
'large_data.jsonl',
lines=True,
dtype={
'id': 'int32',
'category': 'category',
'value': 'float32'
}
)
# Process in chunks for memory efficiency
def process_large_jsonl(filepath, chunk_size=10000):
"""Process large JSONL files in chunks"""
results = []
for chunk in pd.read_json(filepath, lines=True, chunksize=chunk_size):
# Apply transformations
processed = chunk[chunk['value'] > 0].copy()
results.append(processed)
return pd.concat(results, ignore_index=True)
# Use compression for faster I/O on compressed files
df = pd.read_json('data.json.gz', compression='gzip', lines=True)
The read_json() function provides comprehensive JSON ingestion capabilities. Select the appropriate orientation, leverage json_normalize() for nested structures, and implement chunked reading for large datasets to build robust data pipelines.