How to Read JSON Files in PySpark

Key Insights

Always define explicit schemas in production—schema inference reads your entire dataset twice and fails silently on inconsistent data
Use multiLine=true only when necessary; single-line JSON (NDJSON) is significantly faster to parse in distributed environments
Set mode to FAILFAST during development to catch data quality issues early, then switch to PERMISSIVE with corrupt record logging in production

Introduction

JSON has become the lingua franca of data interchange. Whether you’re processing API responses, application logs, configuration dumps, or event streams, you’ll inevitably encounter JSON files that need to land in your data lake or warehouse. PySpark handles JSON natively, but the default settings often aren’t what you want in production.

This article covers the practical aspects of reading JSON in PySpark—from basic file loading to handling malformed records and deeply nested structures. I’ll focus on the decisions that matter: when to define schemas explicitly, how to handle different JSON formats, and what options actually affect your job performance.

Basic JSON Reading with spark.read.json()

The simplest way to read JSON in PySpark uses the DataFrameReader interface. Here’s the minimal approach:

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("JSONReader").getOrCreate()

# Read a single JSON file
df = spark.read.json("data/events.json")

# Inspect the results
df.show(5, truncate=False)
df.printSchema()

You can also read multiple files at once by passing a list of paths or using glob patterns:

# Multiple specific files
df = spark.read.json(["data/events_2024_01.json", "data/events_2024_02.json"])

# Glob pattern for all JSON files in a directory
df = spark.read.json("data/events_*.json")

# Entire directory
df = spark.read.json("data/events/")

This works, but it’s doing more than you might expect. PySpark samples your data to infer the schema, which means it reads through your files before actually loading them. For small files, this is fine. For terabytes of JSON logs, you’re doubling your I/O costs.

Handling Different JSON Formats

JSON files come in two flavors that PySpark treats very differently: single-line (JSON Lines/NDJSON) and multi-line.

JSON Lines format puts one complete JSON object per line:

{"user_id": 1, "event": "login", "timestamp": "2024-01-15T10:30:00Z"}
{"user_id": 2, "event": "purchase", "timestamp": "2024-01-15T10:31:00Z"}
{"user_id": 1, "event": "logout", "timestamp": "2024-01-15T11:00:00Z"}

This is PySpark’s default expectation and the format you should prefer. It’s splittable—meaning Spark can distribute different lines to different executors without coordination.

Multi-line JSON is what you get from most APIs and JSON.stringify():

[
  {
    "user_id": 1,
    "event": "login",
    "timestamp": "2024-01-15T10:30:00Z"
  },
  {
    "user_id": 2,
    "event": "purchase",
    "timestamp": "2024-01-15T10:31:00Z"
  }
]

To read multi-line JSON, you must set the multiLine option:

# Multi-line JSON (standard JSON array or pretty-printed)
df = spark.read.option("multiLine", "true").json("data/api_response.json")

# JSON Lines (default behavior, explicit for clarity)
df = spark.read.option("multiLine", "false").json("data/events.jsonl")

Here’s the critical point: multi-line JSON files cannot be split across executors. Each file must be read by a single task. If you have a 10GB multi-line JSON file, one executor handles all of it while others sit idle. Convert to JSON Lines format during ingestion whenever possible.

Schema Definition and Inference

Schema inference is convenient but dangerous. It samples your data (by default, the first partition), infers types, and hopes the rest of your data matches. When it doesn’t match, you get nulls or runtime errors.

Define your schema explicitly:

from pyspark.sql.types import StructType, StructField, StringType, IntegerType, TimestampType, DoubleType

# Define schema explicitly
event_schema = StructType([
    StructField("user_id", IntegerType(), nullable=False),
    StructField("event", StringType(), nullable=False),
    StructField("timestamp", TimestampType(), nullable=True),
    StructField("metadata", StructType([
        StructField("ip_address", StringType(), nullable=True),
        StructField("user_agent", StringType(), nullable=True),
        StructField("session_id", StringType(), nullable=True)
    ]), nullable=True),
    StructField("amount", DoubleType(), nullable=True)
])

# Read with explicit schema
df = spark.read.schema(event_schema).json("data/events/")

Compare the behavior:

# Inference: reads data twice, may get types wrong
df_inferred = spark.read.json("data/events/")
print("Inferred schema:")
df_inferred.printSchema()

# Explicit: reads data once, fails fast on type mismatches
df_explicit = spark.read.schema(event_schema).json("data/events/")
print("Explicit schema:")
df_explicit.printSchema()

With explicit schemas, you get:

Faster reads: No sampling pass required
Consistent types: A field is always IntegerType, never sometimes LongType and sometimes StringType
Documentation: Your schema serves as a contract for what the data should contain
Earlier failures: Type mismatches surface immediately, not three transformations later

The only time I use schema inference is during initial exploration of unfamiliar data. For anything running in production, define your schema.

Common Options and Configuration

PySpark’s JSON reader has several options that affect parsing behavior. Here are the ones that matter:

df = spark.read \
    .option("multiLine", "false") \
    .option("mode", "PERMISSIVE") \
    .option("columnNameOfCorruptRecord", "_corrupt_record") \
    .option("dateFormat", "yyyy-MM-dd") \
    .option("timestampFormat", "yyyy-MM-dd'T'HH:mm:ss.SSSZ") \
    .option("primitivesAsString", "false") \
    .option("allowComments", "true") \
    .schema(event_schema) \
    .json("data/events/")

Error handling modes determine what happens with malformed records:

# PERMISSIVE (default): Nulls for corrupt fields, optionally capture original
df_permissive = spark.read \
    .option("mode", "PERMISSIVE") \
    .option("columnNameOfCorruptRecord", "_corrupt_record") \
    .schema(event_schema.add("_corrupt_record", StringType())) \
    .json("data/events/")

# Check for corrupt records
corrupt_count = df_permissive.filter("_corrupt_record IS NOT NULL").count()
print(f"Found {corrupt_count} corrupt records")

# DROPMALFORMED: Silently skip bad records
df_dropped = spark.read \
    .option("mode", "DROPMALFORMED") \
    .schema(event_schema) \
    .json("data/events/")

# FAILFAST: Throw exception on first bad record
df_strict = spark.read \
    .option("mode", "FAILFAST") \
    .schema(event_schema) \
    .json("data/events/")

My recommendation: use FAILFAST during development and testing. You want to know immediately when your schema doesn’t match reality. In production, switch to PERMISSIVE with corrupt record capture so you can monitor data quality without failing entire jobs.

Other useful options:

# Allow JavaScript-style comments in JSON
.option("allowComments", "true")

# Allow single quotes instead of double quotes
.option("allowSingleQuotes", "true")

# Allow unquoted field names
.option("allowUnquotedFieldNames", "true")

# Read all values as strings (useful for initial exploration)
.option("primitivesAsString", "true")

Reading Nested JSON Structures

Real-world JSON is rarely flat. Here’s how to work with nested structures:

# Sample nested JSON structure
# {"order_id": 1, "customer": {"name": "Alice", "email": "alice@example.com"}, 
#  "items": [{"sku": "A1", "qty": 2}, {"sku": "B2", "qty": 1}]}

from pyspark.sql.functions import col, explode, explode_outer

# Read nested JSON
df = spark.read.json("data/orders.json")

# Access nested fields with dot notation
df.select(
    col("order_id"),
    col("customer.name").alias("customer_name"),
    col("customer.email").alias("customer_email")
).show()

# Explode arrays to create one row per array element
df_items = df.select(
    col("order_id"),
    explode(col("items")).alias("item")
).select(
    col("order_id"),
    col("item.sku"),
    col("item.qty")
)

df_items.show()

# Use explode_outer to keep rows even when array is null/empty
df_items_all = df.select(
    col("order_id"),
    explode_outer(col("items")).alias("item")
)

For deeply nested structures, you can chain operations or use getItem() for array access:

from pyspark.sql.functions import col

# Access specific array element by index
df.select(
    col("order_id"),
    col("items").getItem(0).alias("first_item")
).show()

# Flatten everything in one go
df_flat = df.select(
    col("order_id"),
    col("customer.name").alias("customer_name"),
    col("customer.email").alias("customer_email"),
    explode("items").alias("item")
).select(
    col("order_id"),
    col("customer_name"),
    col("customer_email"),
    col("item.sku").alias("item_sku"),
    col("item.qty").alias("item_qty")
)

Conclusion

Reading JSON in PySpark is straightforward once you understand the tradeoffs. Here’s what matters:

Use explicit schemas. The performance gain from skipping inference is real, and the type safety prevents silent data corruption. Define your schemas in a shared module and version them alongside your code.

Prefer JSON Lines format. If you control the data source, emit one JSON object per line. Your Spark jobs will parallelize properly and run faster.

Choose your error handling deliberately. FAILFAST for development, PERMISSIVE with corrupt record capture for production. Never use DROPMALFORMED unless you genuinely don’t care about data loss.

Partition large datasets. If you’re reading JSON files repeatedly, consider converting them to Parquet after initial ingestion. The columnar format is dramatically more efficient for analytical queries.

JSON is everywhere, and PySpark handles it well—but only if you configure it correctly for your use case.