How to Write to CSV in PySpark

Key Insights

PySpark’s write.csv() method distributes CSV writing across your cluster, but this means you’ll get multiple part files by default—use coalesce(1) when you need a single file, but understand the memory tradeoffs.
Always specify header=True and choose your nullValue representation explicitly; implicit defaults cause silent data corruption when downstream systems interpret empty strings differently.
Partitioned writes with partitionBy() are your best friend for large datasets, enabling parallel processing and efficient data retrieval without loading entire files into memory.

Why PySpark for CSV Output?

CSV remains the lingua franca of data exchange. Despite its limitations—no schema enforcement, no compression by default, verbose storage—it’s universally readable. When you’re processing terabytes of data in PySpark and need to hand off results to analysts, legacy systems, or external partners, CSV is often the required format.

PySpark handles CSV writing differently than pandas or single-machine tools. It distributes the write operation across executors, which means faster writes for large datasets but also introduces complexity around file management. Understanding these mechanics prevents the “why do I have 200 part files?” confusion that trips up newcomers.

Basic CSV Writing with write.csv()

The simplest approach uses the DataFrame’s write.csv() method:

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType

spark = SparkSession.builder.appName("CSVWriteExample").getOrCreate()

# Create a sample DataFrame
data = [
    ("Alice", 28, 75000.50),
    ("Bob", 35, 82000.00),
    ("Charlie", 42, 95000.75),
    ("Diana", 31, 68000.25)
]

schema = StructType([
    StructField("name", StringType(), True),
    StructField("age", IntegerType(), True),
    StructField("salary", DoubleType(), True)
])

df = spark.createDataFrame(data, schema)

# Basic CSV write
df.write.csv("output/employees")

This creates a directory called employees containing multiple part files (like part-00000-*.csv, part-00001-*.csv). Each Spark partition writes to a separate file. The directory also includes a _SUCCESS marker file indicating the write completed.

Essential Write Options

The basic write rarely suffices for production use. You’ll need to configure options for downstream compatibility:

df.write.csv(
    "output/employees_configured",
    header=True,           # Include column names as first row
    sep=",",               # Field delimiter (default is comma)
    quote='"',             # Character for quoting fields
    escape="\\",           # Escape character within quoted fields
    nullValue="NULL",      # String representation of null values
    emptyValue="",         # String representation of empty strings
    encoding="UTF-8",      # Character encoding
    dateFormat="yyyy-MM-dd",
    timestampFormat="yyyy-MM-dd HH:mm:ss"
)

Let’s break down when each option matters:

header: Almost always set to True. Without it, consumers must know column order beforehand—a recipe for bugs when schemas evolve.

sep (delimiter): Use | or \t when your data contains commas. Tab-delimited files are particularly robust for text-heavy datasets.

nullValue: Critical for data integrity. If you leave this as empty string (default), you can’t distinguish between “this field is null” and “this field is an empty string.” Choose something unambiguous like NULL, \N, or NA.

quote and escape: Handle fields containing your delimiter or newlines. The defaults work for most cases, but verify when processing user-generated content.

Here’s a more realistic example with pipe-delimited output:

# Data with potential CSV hazards
problematic_data = [
    ("O'Brien", "Sales, Marketing", 55000.00, None),
    ("Smith", "Engineering", 72000.00, "Senior"),
    ("Lee", 'Said "Hello"', 48000.00, "Junior")
]

schema = StructType([
    StructField("name", StringType(), True),
    StructField("department", StringType(), True),
    StructField("salary", DoubleType(), True),
    StructField("level", StringType(), True)
])

df_problematic = spark.createDataFrame(problematic_data, schema)

df_problematic.write.csv(
    "output/employees_safe",
    header=True,
    sep="|",
    nullValue="\\N",
    quote='"',
    escape="\\"
)

Controlling Output Files

PySpark’s distributed nature means one file per partition. For a DataFrame with 200 partitions, you get 200 files. This is often undesirable when sharing data externally.

Using coalesce() for fewer files:

# Write to a single CSV file
df.coalesce(1).write.csv(
    "output/employees_single",
    header=True,
    mode="overwrite"
)

coalesce(1) reduces partitions to one without a full shuffle—it combines partitions on existing executors. However, this forces all data through a single executor, which can cause memory issues with large datasets.

Using repartition() for controlled distribution:

# Write to exactly 4 files with even data distribution
df.repartition(4).write.csv(
    "output/employees_four_parts",
    header=True,
    mode="overwrite"
)

repartition() performs a full shuffle, distributing data evenly. Use it when you need balanced file sizes, not just fewer files.

Practical guidance: For files under 100MB total, coalesce(1) works fine. For larger datasets, target file sizes of 100MB-1GB each by calculating the appropriate partition count:

# Estimate row count and target ~500MB files
estimated_size_mb = df.count() * 0.001  # Rough estimate: 1KB per row
target_file_size_mb = 500
num_files = max(1, int(estimated_size_mb / target_file_size_mb))

df.repartition(num_files).write.csv("output/sized_output", header=True)

Write Modes: Overwrite, Append, and Error Handling

The mode parameter controls behavior when the output path exists:

# Overwrite: Delete existing data and write new
df.write.csv("output/data", mode="overwrite", header=True)

# Append: Add to existing files (careful with headers!)
df.write.csv("output/data", mode="append", header=True)

# Ignore: Silently skip if path exists
df.write.csv("output/data", mode="ignore", header=True)

# Error (default): Throw exception if path exists
df.write.csv("output/data", mode="error", header=True)

Append mode warning: When appending with header=True, each append operation adds a new header row to its part files. This creates malformed CSVs if you later concatenate them. For append workflows, either omit headers or handle them in post-processing.

A safer append pattern:

from datetime import datetime

# Use timestamped directories instead of true append
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
df.write.csv(f"output/data/batch_{timestamp}", header=True, mode="error")

Partitioned Writes for Large Datasets

For datasets that will be queried by specific columns (date, region, category), partitioned writes dramatically improve downstream performance:

# Sample data with date and region
sales_data = [
    ("2024-01-15", "North", "Widget", 150.00),
    ("2024-01-15", "South", "Gadget", 200.00),
    ("2024-01-16", "North", "Widget", 175.00),
    ("2024-01-16", "East", "Gadget", 225.00),
    ("2024-02-01", "North", "Widget", 180.00),
]

schema = StructType([
    StructField("sale_date", StringType(), True),
    StructField("region", StringType(), True),
    StructField("product", StringType(), True),
    StructField("amount", DoubleType(), True)
])

df_sales = spark.createDataFrame(sales_data, schema)

# Partition by date and region
df_sales.write.csv(
    "output/sales_partitioned",
    header=True,
    mode="overwrite",
    partitionBy=["sale_date", "region"]
)

This creates a directory structure like:

output/sales_partitioned/
├── sale_date=2024-01-15/
│   ├── region=North/
│   │   └── part-00000-*.csv
│   └── region=South/
│       └── part-00000-*.csv
├── sale_date=2024-01-16/
│   ├── region=North/
│   │   └── part-00000-*.csv
│   └── region=East/
│       └── part-00000-*.csv

Note that partition columns are removed from the CSV files themselves—they’re encoded in the directory path. When reading back with PySpark, the partition columns are automatically reconstructed.

Common Issues and Best Practices

Problem: Part files scattered everywhere

Use Hadoop’s getmerge or a post-processing step to combine files:

import subprocess

# After writing
df.coalesce(1).write.csv("output/temp", header=True, mode="overwrite")

# Rename the part file to a clean name (in driver code)
# This is filesystem-dependent; shown for local filesystem
import os
import glob

output_dir = "output/temp"
part_files = glob.glob(f"{output_dir}/part-*.csv")
if part_files:
    os.rename(part_files[0], "output/final_output.csv")

Problem: Schema drift between writes

Enforce schema consistency:

expected_columns = ["name", "age", "salary"]
df_validated = df.select(expected_columns)  # Fails fast if columns missing
df_validated.write.csv("output/validated", header=True)

Problem: Special characters corrupting output

Test with edge cases before production:

# Validate your configuration handles problematic data
test_data = [
    ('Contains, comma', 'Has "quotes"', 'Line\nbreak'),
    ('Normal', 'Also normal', 'Fine'),
]
# Write and read back, verify integrity

Performance tips:

Write to cloud storage (S3, GCS, ADLS) directly rather than local disk when running on clusters
Avoid coalesce(1) for datasets over 1GB—it creates a bottleneck
Use Parquet for intermediate storage, CSV only for final delivery
Set spark.sql.shuffle.partitions appropriately before operations that precede writes

# For a 10GB dataset targeting 500MB files
spark.conf.set("spark.sql.shuffle.partitions", 20)

CSV writing in PySpark is straightforward once you understand the distributed model. Specify your options explicitly, control your file count intentionally, and choose write modes that match your data pipeline’s semantics. The defaults optimize for distributed performance, not human convenience—override them accordingly.