PySpark - Write DataFrame to JSON File

Key Insights

PySpark provides multiple methods to write DataFrames to JSON with control over file structure, partitioning, and compression options
Understanding the difference between write.json() and toJSON() is critical—the former writes to disk while the latter returns RDD strings
Proper partition management and coalesce operations prevent excessive small file creation that degrades performance in distributed systems

Basic JSON Write Operations

Writing a PySpark DataFrame to JSON requires the DataFrameWriter API. The simplest approach uses the write.json() method with a target path.

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType

spark = SparkSession.builder.appName("JSONWriter").getOrCreate()

# Create sample DataFrame
data = [
    ("John", 28, "Engineering"),
    ("Sarah", 34, "Marketing"),
    ("Mike", 31, "Sales")
]

schema = StructType([
    StructField("name", StringType(), True),
    StructField("age", IntegerType(), True),
    StructField("department", StringType(), True)
])

df = spark.createDataFrame(data, schema)

# Write to JSON
df.write.json("output/employees")

This creates a directory named employees containing multiple JSON files—one per partition. Each line in these files represents a single JSON object (JSONL format).

Controlling Write Modes

PySpark supports four write modes that determine behavior when the target path exists.

# Overwrite existing data
df.write.mode("overwrite").json("output/employees")

# Append to existing data
df.write.mode("append").json("output/employees")

# Error if path exists (default)
df.write.mode("error").json("output/employees")

# Ignore write if path exists
df.write.mode("ignore").json("output/employees")

The overwrite mode deletes existing data before writing. Use append for incremental loads, but ensure schema compatibility between existing and new data.

Single File Output with Coalesce

By default, PySpark writes one file per partition. To generate a single JSON file, reduce the DataFrame to one partition.

# Write to single file
df.coalesce(1).write.json("output/employees_single")

# The output directory contains one JSON file plus metadata
# Access the actual file: output/employees_single/part-00000-*.json

Use coalesce(1) cautiously with large datasets—it forces all data through a single executor, creating a performance bottleneck. For datasets exceeding memory capacity of a single node, this approach fails.

# Better approach for moderately sized data
df.repartition(1).write.json("output/employees_single")

While repartition(1) involves a full shuffle (more expensive than coalesce), it provides better data distribution before the final merge.

Partitioning Output Files

Partition output by specific columns to organize data hierarchically, enabling efficient filtering in downstream processes.

# Partition by department
df.write.partitionBy("department").json("output/employees_partitioned")

# Creates structure: output/employees_partitioned/department=Engineering/part-*.json
#                    output/employees_partitioned/department=Marketing/part-*.json

Multiple partition columns create nested directories:

# Add year column for demonstration
from pyspark.sql.functions import lit

df_with_year = df.withColumn("year", lit(2024))

# Partition by multiple columns
df_with_year.write.partitionBy("year", "department").json("output/employees_multi_partition")

# Structure: output/employees_multi_partition/year=2024/department=Engineering/part-*.json

Partitioning works best with columns having low cardinality. High cardinality creates excessive directories, degrading filesystem performance.

Compression Options

Compress JSON output to reduce storage costs and I/O overhead.

# Gzip compression
df.write.option("compression", "gzip").json("output/employees_compressed")

# Snappy compression (faster, less compression)
df.write.option("compression", "snappy").json("output/employees_snappy")

# Bzip2 compression (slower, better compression)
df.write.option("compression", "bzip2").json("output/employees_bzip2")

# No compression
df.write.option("compression", "none").json("output/employees_uncompressed")

Gzip provides the best balance for most use cases. Snappy excels when decompression speed matters more than file size. The output files have corresponding extensions: .json.gz, .json.snappy, .json.bz2.

Handling Nested Structures

PySpark preserves complex data types when writing JSON, including nested structures and arrays.

from pyspark.sql.types import ArrayType
from pyspark.sql.functions import array, struct

# Create DataFrame with nested data
complex_data = [
    ("John", ["Python", "Scala"], {"city": "NYC", "country": "USA"}),
    ("Sarah", ["Java", "SQL"], {"city": "SF", "country": "USA"})
]

complex_schema = StructType([
    StructField("name", StringType(), True),
    StructField("skills", ArrayType(StringType()), True),
    StructField("location", StructType([
        StructField("city", StringType(), True),
        StructField("country", StringType(), True)
    ]), True)
])

df_complex = spark.createDataFrame(complex_data, complex_schema)
df_complex.write.json("output/employees_complex")

Output JSON preserves the hierarchy:

{"name":"John","skills":["Python","Scala"],"location":{"city":"NYC","country":"USA"}}

Date and Timestamp Formatting

Control date and timestamp format in JSON output using options.

from pyspark.sql.functions import current_date, current_timestamp

df_dates = df.withColumn("hire_date", current_date()) \
             .withColumn("last_update", current_timestamp())

# Default ISO 8601 format
df_dates.write.json("output/employees_dates_default")

# Custom date format
df_dates.write \
    .option("dateFormat", "yyyy-MM-dd") \
    .option("timestampFormat", "yyyy-MM-dd HH:mm:ss") \
    .json("output/employees_dates_custom")

Without explicit formatting, dates appear as "2024-01-15" and timestamps as "2024-01-15T10:30:45.123Z".

Converting to JSON Strings

The toJSON() method returns an RDD of JSON strings rather than writing to disk—useful for further processing or custom output logic.

# Get RDD of JSON strings
json_rdd = df.toJSON()

# Collect to driver (small datasets only)
json_list = json_rdd.collect()
for json_str in json_list:
    print(json_str)

# Write with custom logic
json_rdd.coalesce(1).saveAsTextFile("output/custom_json")

This approach provides flexibility but requires manual file management and loses DataFrame write optimizations.

Performance Optimization

Optimize JSON writes by controlling parallelism and file sizes.

# Check current partition count
print(f"Partitions: {df.rdd.getNumPartitions()}")

# Optimize partition count based on data size
# Rule of thumb: 128MB - 256MB per partition
df_optimized = df.repartition(4)
df_optimized.write.json("output/employees_optimized")

# For very large datasets with partitioning
df.repartition("department") \
  .write \
  .partitionBy("department") \
  .json("output/employees_repartitioned")

Pre-repartitioning by the same column used in partitionBy() minimizes shuffle operations and improves write performance.

Handling Special Characters

Configure how PySpark handles special characters and null values in JSON output.

# Data with special characters
special_data = [
    ("John \"Johnny\" Doe", None),
    ("Sarah O'Brien", "Engineering")
]

df_special = spark.createDataFrame(special_data, ["name", "dept"])

# Write with null handling
df_special.write \
    .option("ignoreNullFields", "false") \
    .json("output/employees_nulls")

# ignoreNullFields=true omits null fields from output
# ignoreNullFields=false includes them as {"field": null}

PySpark automatically escapes quotes and special characters according to JSON specification. No additional configuration needed for standard special character handling.

Complete Example with Best Practices

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, current_date

spark = SparkSession.builder \
    .appName("JSONWriteOptimized") \
    .config("spark.sql.files.maxRecordsPerFile", 100000) \
    .getOrCreate()

# Read source data
df = spark.read.parquet("input/large_dataset")

# Filter and transform
df_filtered = df.filter(col("status") == "active") \
                .withColumn("export_date", current_date())

# Optimize partitions (target 128MB per file)
df_optimized = df_filtered.repartition(20, "region")

# Write with compression and partitioning
df_optimized.write \
    .mode("overwrite") \
    .option("compression", "gzip") \
    .option("dateFormat", "yyyy-MM-dd") \
    .partitionBy("region", "export_date") \
    .json("output/active_users")

spark.stop()

This pattern handles large-scale JSON exports efficiently, balancing file count, size, and write performance. The maxRecordsPerFile configuration prevents individual files from growing too large, even with reduced partition counts.