Pandas - Write DataFrame to JSON (to_json)
The `to_json()` method converts a pandas DataFrame to a JSON string or file. The simplest usage writes the entire DataFrame with default settings.
Key Insights
- The
to_json()method offers multiple orientation formats (split, records, index, columns, values, table) that structure your data differently for various use cases like APIs, databases, or web applications - Control JSON output precision with parameters like
double_precision,date_format, andindentto balance file size against readability and compatibility requirements - For large datasets, use
lines=Truewithorient='records'to generate newline-delimited JSON (NDJSON) that enables streaming and parallel processing
Basic DataFrame to JSON Conversion
The to_json() method converts a pandas DataFrame to a JSON string or file. The simplest usage writes the entire DataFrame with default settings.
import pandas as pd
df = pd.DataFrame({
'product_id': [101, 102, 103],
'name': ['Laptop', 'Mouse', 'Keyboard'],
'price': [999.99, 29.99, 79.99],
'in_stock': [True, False, True]
})
# Convert to JSON string
json_string = df.to_json()
print(json_string)
# Write directly to file
df.to_json('products.json')
The default output uses column-oriented format where each column becomes a key with row indices as nested keys. This format is compact but not always intuitive for consumption by other systems.
Understanding Orient Parameter
The orient parameter fundamentally changes the JSON structure. Each orientation serves specific use cases.
import pandas as pd
df = pd.DataFrame({
'user_id': [1, 2, 3],
'username': ['alice', 'bob', 'charlie'],
'score': [95, 87, 92]
})
# Records: List of objects (ideal for APIs)
records = df.to_json(orient='records')
# [{"user_id":1,"username":"alice","score":95},...]
# Index: Nested objects with index as keys
index = df.to_json(orient='index')
# {"0":{"user_id":1,"username":"alice","score":95},...}
# Columns: Nested objects with columns as keys (default)
columns = df.to_json(orient='columns')
# {"user_id":{"0":1,"1":2,"2":3},...}
# Values: 2D array format
values = df.to_json(orient='values')
# [[1,"alice",95],[2,"bob",87],[3,"charlie",92]]
# Split: Separate arrays for columns, index, and data
split = df.to_json(orient='split')
# {"columns":["user_id","username","score"],"index":[0,1,2],"data":[[1,"alice",95],...]}
# Table: JSON Table Schema format
table = df.to_json(orient='table')
# Includes schema metadata with field types
For REST APIs, orient='records' is typically the best choice. For database exports, orient='split' preserves complete structure including index information.
Handling DateTime and Date Formatting
DateTime objects require special attention during JSON serialization. The date_format and date_unit parameters control how temporal data is represented.
import pandas as pd
from datetime import datetime
df = pd.DataFrame({
'event_id': [1, 2, 3],
'timestamp': pd.to_datetime([
'2024-01-15 10:30:00',
'2024-01-16 14:45:00',
'2024-01-17 09:15:00'
]),
'date_only': pd.to_datetime(['2024-01-15', '2024-01-16', '2024-01-17'])
})
# ISO 8601 format (default for date_format='iso')
iso_format = df.to_json(orient='records', date_format='iso')
# "timestamp":"2024-01-15T10:30:00.000"
# Unix epoch timestamp in milliseconds
epoch_ms = df.to_json(orient='records', date_format='epoch', date_unit='ms')
# "timestamp":1705318200000
# Unix epoch in seconds
epoch_s = df.to_json(orient='records', date_format='epoch', date_unit='s')
# "timestamp":1705318200
# Custom formatting requires converting before to_json
df['formatted_date'] = df['timestamp'].dt.strftime('%Y-%m-%d %H:%M:%S')
custom = df[['event_id', 'formatted_date']].to_json(orient='records')
ISO format provides human readability and timezone support. Epoch timestamps are more compact and avoid timezone ambiguity but require client-side conversion.
Controlling Precision and Formatting
Numerical precision and indentation affect both file size and readability. Adjust these based on your requirements.
import pandas as pd
import numpy as np
df = pd.DataFrame({
'measurement_id': [1, 2, 3],
'temperature': [23.456789, 24.123456, 22.987654],
'pressure': [1013.25678, 1012.34567, 1014.45678]
})
# Default precision (10 decimal places)
default = df.to_json(orient='records')
# Limit to 2 decimal places
precise = df.to_json(orient='records', double_precision=2)
# "temperature":23.46,"pressure":1013.26
# Pretty-print with indentation
readable = df.to_json(orient='records', indent=2)
"""
[
{
"measurement_id": 1,
"temperature": 23.46,
"pressure": 1013.26
},
...
]
"""
# Compact output (no spaces)
compact = df.to_json(orient='records', indent=None)
For production APIs, use minimal precision and no indentation to reduce bandwidth. For configuration files or debugging, use indentation for readability.
Newline-Delimited JSON for Large Datasets
NDJSON (newline-delimited JSON) writes each record as a separate JSON object on its own line. This format enables streaming, parallel processing, and append operations.
import pandas as pd
# Generate large dataset
df = pd.DataFrame({
'transaction_id': range(1, 100001),
'amount': np.random.uniform(10, 1000, 100000),
'customer_id': np.random.randint(1000, 9999, 100000)
})
# Write as NDJSON
df.to_json('transactions.jsonl', orient='records', lines=True)
# Read NDJSON back
df_loaded = pd.read_json('transactions.jsonl', orient='records', lines=True)
# Process in chunks for memory efficiency
chunk_size = 10000
for chunk in pd.read_json('transactions.jsonl', orient='records',
lines=True, chunksize=chunk_size):
# Process each chunk
print(f"Processing {len(chunk)} records")
NDJSON is the standard format for log files, data pipelines, and systems like Apache Kafka or Elasticsearch bulk operations.
Handling Missing Values and Data Types
Control how null values and special data types are serialized using additional parameters.
import pandas as pd
import numpy as np
df = pd.DataFrame({
'product_id': [1, 2, 3, 4],
'name': ['Widget', 'Gadget', None, 'Tool'],
'price': [19.99, np.nan, 29.99, 39.99],
'quantity': [10, 0, 5, None]
})
# Default: NaN and None become null
default = df.to_json(orient='records')
# {"name":null,"price":null,...}
# Force ASCII encoding (escape Unicode characters)
ascii_only = df.to_json(orient='records', force_ascii=True)
# Preserve index in output
with_index = df.to_json(orient='records', index=True)
# Convert to specific encoding
df.to_json('output.json', orient='records', force_ascii=False,
indent=2, double_precision=2)
Compression and Performance Optimization
For large files, enable compression to reduce storage and transfer costs.
import pandas as pd
# Create large dataset
df = pd.DataFrame({
'id': range(1000000),
'value': np.random.randn(1000000),
'category': np.random.choice(['A', 'B', 'C', 'D'], 1000000)
})
# Write with gzip compression
df.to_json('data.json.gz', orient='records', lines=True, compression='gzip')
# Write with other compression formats
df.to_json('data.json.bz2', orient='records', lines=True, compression='bz2')
df.to_json('data.json.xz', orient='records', lines=True, compression='xz')
# Read compressed file
df_loaded = pd.read_json('data.json.gz', orient='records', lines=True,
compression='gzip')
# Benchmark different approaches
import time
start = time.time()
df.to_json('uncompressed.json', orient='records', lines=True)
uncompressed_time = time.time() - start
start = time.time()
df.to_json('compressed.json.gz', orient='records', lines=True,
compression='gzip')
compressed_time = time.time() - start
print(f"Uncompressed: {uncompressed_time:.2f}s")
print(f"Compressed: {compressed_time:.2f}s")
Gzip compression typically reduces JSON file sizes by 70-90% with minimal CPU overhead. For archival storage, use xz compression for maximum compression ratio.
Working with MultiIndex DataFrames
MultiIndex DataFrames require careful handling to preserve hierarchical structure.
import pandas as pd
# Create MultiIndex DataFrame
arrays = [
['2024-01', '2024-01', '2024-02', '2024-02'],
['Product A', 'Product B', 'Product A', 'Product B']
]
index = pd.MultiIndex.from_arrays(arrays, names=['month', 'product'])
df = pd.DataFrame({
'sales': [1000, 1500, 1200, 1600],
'units': [50, 75, 60, 80]
}, index=index)
# Split format preserves MultiIndex structure
split = df.to_json(orient='split')
# Table format includes index metadata
table = df.to_json(orient='table', indent=2)
# Reset index to flatten before export
flat = df.reset_index().to_json(orient='records')
# Load back with MultiIndex
df_loaded = pd.read_json(split, orient='split')
df_loaded.index = pd.MultiIndex.from_arrays(
[df_loaded.index.get_level_values(i) for i in range(df_loaded.index.nlevels)],
names=['month', 'product']
)
For most use cases, resetting the index before export and reconstructing it after import provides the most reliable serialization approach.