Spark Scala - Write DataFrame to CSV/Parquet/JSON

Key Insights

Parquet should be your default choice for Spark workloads—it offers columnar storage, built-in compression, and schema preservation that CSV and JSON simply cannot match.
Use coalesce(1) or repartition(1) before writing when you need a single output file, but understand this forces all data through one executor and kills parallelism.
Save modes (Overwrite, Append, ErrorIfExists, Ignore) prevent accidental data loss—always set them explicitly rather than relying on defaults.

Introduction

Every Spark job eventually needs to persist data somewhere. Whether you’re building ETL pipelines, generating reports, or feeding downstream systems, choosing the right output format matters more than most developers realize.

CSV works when you need human-readable files or compatibility with legacy systems. Parquet delivers performance gains that compound across your entire data platform. JSON fits API integrations and semi-structured data scenarios. Each format has trade-offs, and understanding them saves you from rewriting pipelines later.

This guide covers the practical mechanics of writing DataFrames to all three formats in Spark Scala, with real code you can adapt for production use.

Setting Up the Environment

Before writing data, you need a SparkSession and some data to work with. Here’s a complete setup:

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.SaveMode

// Initialize SparkSession
val spark = SparkSession.builder()
  .appName("DataFrame Write Examples")
  .master("local[*]")  // Remove this for cluster deployment
  .config("spark.sql.parquet.compression.codec", "snappy")
  .getOrCreate()

import spark.implicits._

// Define a case class for type-safe DataFrame creation
case class Employee(
  id: Int,
  name: String,
  department: String,
  salary: Double,
  hireDate: String
)

// Create sample data
val employees = Seq(
  Employee(1, "Alice Chen", "Engineering", 95000.0, "2021-03-15"),
  Employee(2, "Bob Smith", "Marketing", 72000.0, "2020-07-22"),
  Employee(3, "Carol Jones", "Engineering", 105000.0, "2019-11-08"),
  Employee(4, "David Lee", "Sales", 68000.0, "2022-01-10"),
  Employee(5, "Eva Martinez", "Engineering", 88000.0, "2021-09-01")
)

val df = employees.toDF()
df.show()

This gives you a typed DataFrame with mixed data types—integers, strings, and doubles—which helps demonstrate how each format handles schema information differently.

Writing DataFrames to CSV

CSV remains the universal interchange format. Every tool reads it, every analyst understands it, and every legacy system accepts it. The trade-off is that you lose type information and pay a performance penalty on large datasets.

Basic CSV Write

// Simple CSV write with headers
df.write
  .option("header", "true")
  .csv("/output/employees_csv")

This creates a directory with multiple part files. Spark writes in parallel by default, so you’ll see files like part-00000-*.csv, part-00001-*.csv, and so on.

CSV with Full Options

df.write
  .mode(SaveMode.Overwrite)
  .option("header", "true")
  .option("delimiter", ",")
  .option("quote", "\"")
  .option("escape", "\\")
  .option("nullValue", "NULL")
  .option("emptyValue", "")
  .option("compression", "gzip")
  .option("dateFormat", "yyyy-MM-dd")
  .option("timestampFormat", "yyyy-MM-dd HH:mm:ss")
  .csv("/output/employees_csv_full")

Key options explained:

header: Include column names as the first row
delimiter: Field separator (use \t for TSV)
nullValue: String representation for null values
compression: Supports none, gzip, bzip2, lz4, snappy

Single File Output

When downstream systems expect one file, not a directory of parts:

df.coalesce(1)
  .write
  .mode(SaveMode.Overwrite)
  .option("header", "true")
  .csv("/output/employees_single")

// Rename the part file if needed (post-processing step)

Warning: coalesce(1) funnels all data through a single partition. For large datasets, this destroys parallelism and can cause out-of-memory errors. Use it only for small outputs or when you genuinely need a single file.

Writing DataFrames to Parquet

Parquet is the right choice for most Spark workloads. It’s columnar, compressed by default, and preserves your schema exactly. Reading Parquet back into Spark is dramatically faster than CSV because Spark can push down predicates and skip irrelevant columns entirely.

Basic Parquet Write

df.write
  .mode(SaveMode.Overwrite)
  .parquet("/output/employees_parquet")

That’s it. Parquet handles compression and schema automatically. The simplicity here is a feature—Parquet’s defaults are production-ready.

Parquet with Partitioning

Partitioning organizes data into subdirectories based on column values. This enables partition pruning, where Spark skips entire directories that don’t match your query filters:

df.write
  .mode(SaveMode.Overwrite)
  .partitionBy("department")
  .parquet("/output/employees_partitioned")

// Creates structure:
// /output/employees_partitioned/department=Engineering/
// /output/employees_partitioned/department=Marketing/
// /output/employees_partitioned/department=Sales/

Choose partition columns carefully. Good candidates have low cardinality (tens to hundreds of values) and appear frequently in WHERE clauses. High-cardinality columns like user IDs create millions of tiny files—a performance disaster known as the “small files problem.”

Compression Options

df.write
  .mode(SaveMode.Overwrite)
  .option("compression", "snappy")  // or "gzip", "lz4", "zstd", "none"
  .parquet("/output/employees_compressed")

Snappy offers the best balance of compression ratio and speed for most workloads. Use gzip when storage cost matters more than read performance. ZSTD provides excellent compression with reasonable speed—consider it for archival data.

Bucketing for Join Optimization

When you repeatedly join on the same columns, bucketing pre-sorts data to eliminate shuffle operations:

df.write
  .mode(SaveMode.Overwrite)
  .bucketBy(4, "department")
  .sortBy("salary")
  .saveAsTable("employees_bucketed")

Bucketing requires saving as a managed table. It’s powerful but adds complexity—use it when join performance is a proven bottleneck.

Writing DataFrames to JSON

JSON output makes sense when feeding web APIs, working with document databases, or preserving nested structures. Spark writes one JSON object per line by default (JSON Lines format), which is easier to process in parallel than a single JSON array.

Basic JSON Write

df.write
  .mode(SaveMode.Overwrite)
  .json("/output/employees_json")

Each line in the output contains a complete JSON object:

{"id":1,"name":"Alice Chen","department":"Engineering","salary":95000.0,"hireDate":"2021-03-15"}
{"id":2,"name":"Bob Smith","department":"Marketing","salary":72000.0,"hireDate":"2020-07-22"}

JSON with Options

df.write
  .mode(SaveMode.Overwrite)
  .option("compression", "gzip")
  .option("dateFormat", "yyyy-MM-dd")
  .option("timestampFormat", "yyyy-MM-dd'T'HH:mm:ss.SSSZ")
  .option("ignoreNullFields", "false")  // Include null fields in output
  .json("/output/employees_json_options")

The ignoreNullFields option controls whether null values appear in the output. Set it to false when consuming systems expect consistent field presence.

Pretty-Printed JSON (Single File)

For human-readable output or configuration files:

df.coalesce(1)
  .write
  .mode(SaveMode.Overwrite)
  .option("pretty", "true")  // Note: This is not a standard option
  .json("/output/employees_pretty")

Note that Spark’s JSON writer doesn’t have a built-in pretty-print option. For truly formatted JSON, you’ll need post-processing or a custom solution.

Common Write Options and Save Modes

Save Modes

Every write operation should specify a save mode explicitly:

// Overwrite: Delete existing data, write new data
df.write.mode(SaveMode.Overwrite).parquet("/output/path")

// Append: Add to existing data
df.write.mode(SaveMode.Append).parquet("/output/path")

// ErrorIfExists: Fail if path exists (default behavior)
df.write.mode(SaveMode.ErrorIfExists).parquet("/output/path")

// Ignore: Do nothing if path exists
df.write.mode(SaveMode.Ignore).parquet("/output/path")

Use Overwrite for idempotent pipelines that can safely rerun. Use Append for incremental loads. Avoid ErrorIfExists in production—it makes pipelines fragile.

Controlling File Count

// Reduce file count (use for small outputs)
df.coalesce(4).write.parquet("/output/fewer_files")

// Increase file count with shuffle (use for large outputs)
df.repartition(100).write.parquet("/output/more_files")

// Repartition by column (combines partitioning benefits)
df.repartition($"department").write.partitionBy("department").parquet("/output/dept_files")

The difference: coalesce reduces partitions without shuffle (faster but can create uneven files). repartition shuffles data for even distribution (slower but balanced output).

Summary and Best Practices

Feature	CSV	Parquet	JSON
Schema Preservation	No	Yes	Partial
Compression	External	Built-in	External
Read Speed	Slow	Fast	Medium
Column Pruning	No	Yes	No
Human Readable	Yes	No	Yes
Nested Data	No	Yes	Yes
Tool Compatibility	Universal	Big Data Tools	APIs/Web

Recommendations:

Default to Parquet for any data that stays within your Spark ecosystem. The performance benefits compound across every downstream job.
Use CSV only when external systems require it or for small, human-inspected outputs. Always enable headers and consider gzip compression.
Choose JSON for API integrations, configuration data, or when preserving complex nested structures matters more than query performance.
Always set SaveMode explicitly. Silent failures from Ignore or unexpected overwrites cause production incidents.
Partition thoughtfully. Good partitioning accelerates queries by orders of magnitude. Bad partitioning creates small file nightmares that slow everything down.
Test file counts in development. Run your write logic on sample data and verify the output structure before deploying to production.