Pandas - Read JSON File (read_json) | Application Architect

Key Insights

• Pandas read_json() handles multiple JSON structures including records, split, index, columns, and values orientations, with automatic type inference and nested data flattening capabilities • Line-delimited JSON (JSONL/NDJSON) format processes large datasets efficiently through the lines=True parameter, enabling chunked reading for memory-constrained environments • Complex nested JSON requires json_normalize() for flattening hierarchical structures, with precise control over separator characters and metadata column prefixes

Basic JSON File Reading

The read_json() function provides straightforward JSON file ingestion with automatic type detection. The simplest use case reads a JSON file directly into a DataFrame:

import pandas as pd

# Read basic JSON file
df = pd.read_json('data.json')
print(df.head())
print(df.dtypes)

For a JSON file structured as an array of objects:

[
  {"name": "Alice", "age": 30, "city": "New York"},
  {"name": "Bob", "age": 25, "city": "San Francisco"},
  {"name": "Charlie", "age": 35, "city": "Chicago"}
]

Pandas automatically creates columns from keys and rows from each object. The orient parameter controls how Pandas interprets the JSON structure:

# Explicit orientation specification
df = pd.read_json('data.json', orient='records')

# Different orientations for different structures
df_split = pd.read_json('data.json', orient='split')
df_index = pd.read_json('data.json', orient='index')
df_columns = pd.read_json('data.json', orient='columns')

Working with Different JSON Orientations

JSON data comes in various formats. Understanding orientations prevents parsing errors and ensures correct data interpretation:

import pandas as pd
import json

# Records orientation (array of objects) - most common
records_json = '''
[
  {"product": "Laptop", "price": 999, "stock": 15},
  {"product": "Mouse", "price": 25, "stock": 150}
]
'''
df_records = pd.read_json(records_json, orient='records')

# Columns orientation (object of arrays)
columns_json = '''
{
  "product": ["Laptop", "Mouse"],
  "price": [999, 25],
  "stock": [15, 150]
}
'''
df_columns = pd.read_json(columns_json, orient='columns')

# Index orientation (object of objects with row keys)
index_json = '''
{
  "0": {"product": "Laptop", "price": 999, "stock": 15},
  "1": {"product": "Mouse", "price": 25, "stock": 150}
}
'''
df_index = pd.read_json(index_json, orient='index')

# Split orientation (separate index, columns, and data)
split_json = '''
{
  "columns": ["product", "price", "stock"],
  "index": [0, 1],
  "data": [["Laptop", 999, 15], ["Mouse", 25, 150]]
}
'''
df_split = pd.read_json(split_json, orient='split')

print(df_records.equals(df_columns))  # True
print(df_records.equals(df_index))    # True

Line-Delimited JSON (JSONL) Processing

Line-delimited JSON stores one JSON object per line, ideal for streaming data and large files. The lines=True parameter enables this format:

# Create sample JSONL file
jsonl_data = '''
{"timestamp": "2024-01-01T10:00:00", "user_id": 101, "action": "login", "duration": 2.5}
{"timestamp": "2024-01-01T10:05:00", "user_id": 102, "action": "purchase", "duration": 45.3}
{"timestamp": "2024-01-01T10:10:00", "user_id": 101, "action": "logout", "duration": 1.1}
'''

with open('events.jsonl', 'w') as f:
    f.write(jsonl_data)

# Read JSONL file
df = pd.read_json('events.jsonl', lines=True)
print(df)

# Parse timestamps during read
df = pd.read_json('events.jsonl', lines=True, convert_dates=['timestamp'])
print(df.dtypes)

For large JSONL files, process in chunks to manage memory:

# Read JSONL in chunks
chunk_size = 1000
chunks = []

for chunk in pd.read_json('large_events.jsonl', lines=True, chunksize=chunk_size):
    # Process each chunk
    chunk['processed'] = chunk['duration'] * 2
    chunks.append(chunk)

# Combine all chunks
df_combined = pd.concat(chunks, ignore_index=True)

Handling Nested JSON Structures

Real-world JSON often contains nested objects and arrays. The json_normalize() function flattens hierarchical data:

import pandas as pd
from pandas import json_normalize

# Nested JSON example
nested_json = '''
[
  {
    "id": 1,
    "name": "Alice",
    "address": {
      "street": "123 Main St",
      "city": "New York",
      "zip": "10001"
    },
    "orders": [
      {"order_id": "A1", "amount": 100},
      {"order_id": "A2", "amount": 150}
    ]
  },
  {
    "id": 2,
    "name": "Bob",
    "address": {
      "street": "456 Oak Ave",
      "city": "Boston",
      "zip": "02101"
    },
    "orders": [
      {"order_id": "B1", "amount": 200}
    ]
  }
]
'''

# Load JSON first
import json
data = json.loads(nested_json)

# Flatten nested objects
df = json_normalize(data)
print(df.columns)
# Output: ['id', 'name', 'address.street', 'address.city', 'address.zip', 'orders']

# Flatten with custom separator
df = json_normalize(data, sep='_')
print(df.columns)
# Output: ['id', 'name', 'address_street', 'address_city', 'address_zip', 'orders']

# Flatten nested arrays into separate rows
df = json_normalize(
    data,
    record_path='orders',
    meta=['id', 'name', ['address', 'city']],
    meta_prefix='user_'
)
print(df)

Advanced Reading Options

Control parsing behavior with additional parameters for precision and performance:

# Specify data types explicitly
dtype_spec = {
    'user_id': 'int32',
    'amount': 'float64',
    'category': 'category'
}
df = pd.read_json('data.json', dtype=dtype_spec)

# Handle missing values
df = pd.read_json('data.json', keep_default_na=False)

# Set specific date columns and format
df = pd.read_json(
    'data.json',
    convert_dates=['created_at', 'updated_at'],
    date_unit='ms'  # milliseconds since epoch
)

# Precise numeric handling
df = pd.read_json('data.json', precise_float=True)

# Custom encoding for non-UTF8 files
df = pd.read_json('data.json', encoding='latin-1')

Reading JSON from URLs and Strings

Read JSON directly from web APIs or string variables without intermediate file storage:

# Read from URL
url = 'https://api.example.com/data.json'
df = pd.read_json(url)

# Read from string variable
json_string = '{"col1": [1, 2], "col2": [3, 4]}'
df = pd.read_json(json_string)

# Read from compressed files
df = pd.read_json('data.json.gz', compression='gzip')
df = pd.read_json('data.json.bz2', compression='bz2')

# Auto-detect compression
df = pd.read_json('data.json.gz', compression='infer')

Error Handling and Validation

Implement robust error handling for production environments:

import pandas as pd
from json.decoder import JSONDecodeError

def safe_read_json(filepath, **kwargs):
    """Safely read JSON with comprehensive error handling"""
    try:
        df = pd.read_json(filepath, **kwargs)
        
        # Validate DataFrame isn't empty
        if df.empty:
            raise ValueError(f"JSON file {filepath} produced empty DataFrame")
        
        # Check for expected columns
        expected_cols = kwargs.get('expected_columns', [])
        if expected_cols:
            missing = set(expected_cols) - set(df.columns)
            if missing:
                raise ValueError(f"Missing expected columns: {missing}")
        
        return df
        
    except JSONDecodeError as e:
        print(f"Invalid JSON format in {filepath}: {e}")
        raise
    except FileNotFoundError:
        print(f"File not found: {filepath}")
        raise
    except ValueError as e:
        print(f"Validation error: {e}")
        raise

# Usage
df = safe_read_json(
    'data.json',
    lines=True,
    expected_columns=['id', 'name', 'value']
)

Performance Optimization

Optimize JSON reading for large datasets:

# Use dtype specification to reduce memory
df = pd.read_json(
    'large_data.jsonl',
    lines=True,
    dtype={
        'id': 'int32',
        'category': 'category',
        'value': 'float32'
    }
)

# Process in chunks for memory efficiency
def process_large_jsonl(filepath, chunk_size=10000):
    """Process large JSONL files in chunks"""
    results = []
    
    for chunk in pd.read_json(filepath, lines=True, chunksize=chunk_size):
        # Apply transformations
        processed = chunk[chunk['value'] > 0].copy()
        results.append(processed)
    
    return pd.concat(results, ignore_index=True)

# Use compression for faster I/O on compressed files
df = pd.read_json('data.json.gz', compression='gzip', lines=True)

The read_json() function provides comprehensive JSON ingestion capabilities. Select the appropriate orientation, leverage json_normalize() for nested structures, and implement chunked reading for large datasets to build robust data pipelines.