PySpark - Convert DataFrame to CSV
PySpark DataFrames are the backbone of distributed data processing, but eventually you need to export results for reporting, data sharing, or integration with systems that expect CSV format. Unlike...
Key Insights
- PySpark writes CSV files as partitioned directories by default, not single files—use
coalesce(1)to generate a single CSV when needed, but understand the performance implications - The
write.csv()method offers critical options likeheader,delimiter,mode, andnullValuethat directly impact data integrity and compatibility with downstream systems - For datasets under 10GB that fit in driver memory, converting to Pandas with
toPandas().to_csv()often provides simpler syntax and better control, while larger datasets require distributed PySpark writing with compression
Introduction
PySpark DataFrames are the backbone of distributed data processing, but eventually you need to export results for reporting, data sharing, or integration with systems that expect CSV format. Unlike pandas, PySpark operates in a distributed environment where data lives across multiple executors, which fundamentally changes how CSV export works.
CSV export is appropriate for final results, data handoffs to non-Spark systems, and archival purposes. However, CSV is inefficient for intermediate processing—use Parquet or ORC for that. The key challenge with PySpark CSV export is understanding that you’re not writing a single file but coordinating multiple executors to write portions of your data simultaneously.
Let’s create a sample DataFrame we’ll use throughout this article:
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType, DateType
from datetime import date
spark = SparkSession.builder.appName("CSVExport").getOrCreate()
data = [
("John Doe", 28, 75000.50, date(2023, 1, 15), "Engineering"),
("Jane Smith", 34, 92000.00, date(2023, 2, 20), "Marketing"),
("Bob Johnson", 45, 65000.75, date(2023, 3, 10), "Sales"),
("Alice Williams", None, 88000.00, date(2023, 4, 5), "Engineering"),
("Charlie Brown", 29, None, date(2023, 5, 12), "HR")
]
schema = StructType([
StructField("name", StringType(), True),
StructField("age", IntegerType(), True),
StructField("salary", DoubleType(), True),
StructField("hire_date", DateType(), True),
StructField("department", StringType(), True)
])
df = spark.createDataFrame(data, schema)
Basic DataFrame to CSV Conversion
The fundamental method for CSV export is write.csv(). Here’s the most basic usage:
df.write.csv("output/employees")
This creates a directory structure, not a single file. Inside output/employees/, you’ll find:
part-00000-xxx.csv(and possiblypart-00001-xxx.csv, etc.)_SUCCESS(a zero-byte marker file indicating successful completion)
The number of part files corresponds to the number of partitions in your DataFrame. If your DataFrame has 200 partitions, you get 200 CSV files. This distributed writing is what makes PySpark fast for large datasets—multiple executors write simultaneously.
By default, PySpark writes CSV files without headers, which makes them nearly useless for most purposes. Always specify header options in production code.
Essential Write Options
The option() method configures CSV output format. Here are the critical parameters:
df.write \
.option("header", "true") \
.option("delimiter", ",") \
.option("quote", '"') \
.option("escape", "\\") \
.mode("overwrite") \
.csv("output/employees_formatted")
Header: Set to "true" to include column names in the first row. Without this, you lose schema information.
Delimiter: Defaults to comma, but you can use pipes (|), tabs (\t), or any character. For tab-delimited files:
df.write.option("header", "true").option("delimiter", "\t").csv("output/employees_tsv")
Mode: Controls behavior when the output path exists:
"overwrite": Deletes existing data and writes fresh (most common)"append": Adds new files to existing directory"ignore": Silently skips if path exists"error": Throws exception if path exists (default)
Quote and Escape: Control how special characters are handled. The default quote character is " and escape is \. These matter when your data contains delimiters or newlines.
Single File Output with Coalesce and Repartition
Most downstream systems expect a single CSV file, not a directory of parts. Use coalesce(1) to merge all partitions into one:
df.coalesce(1) \
.write \
.option("header", "true") \
.mode("overwrite") \
.csv("output/employees_single")
This produces one part-00000-xxx.csv file. The downside? All data funnels through a single executor, creating a bottleneck. For datasets over 1GB, this can be painfully slow.
Coalesce vs Repartition: These are not interchangeable.
# Coalesce: reduces partitions without shuffling (efficient)
df.coalesce(1).write.csv("output/coalesced")
# Repartition: full shuffle, evenly distributes data (slower)
df.repartition(1).write.csv("output/repartitioned")
Use coalesce(1) for final output. Use repartition(n) when you need exactly n evenly-sized files or want to repartition by specific columns:
# Create one file per department
df.repartition("department") \
.write \
.option("header", "true") \
.mode("overwrite") \
.csv("output/by_department")
Handling Data Quality Issues
Real-world data has nulls, special characters, and formatting quirks. Configure these explicitly:
df.write \
.option("header", "true") \
.option("nullValue", "NULL") \
.option("emptyValue", "") \
.option("dateFormat", "yyyy-MM-dd") \
.option("timestampFormat", "yyyy-MM-dd HH:mm:ss") \
.option("quoteAll", "false") \
.option("escapeQuotes", "true") \
.mode("overwrite") \
.csv("output/employees_clean")
nullValue: String representation for null values. Default is empty string, which is ambiguous. Use "NULL", "NA", or "\\N" for clarity.
dateFormat: Controls date serialization. The default format may not match your requirements. ISO format (yyyy-MM-dd) is safest for portability.
quoteAll: When "true", wraps every field in quotes. This prevents parsing issues but increases file size. Set to "false" and let PySpark quote only fields containing special characters.
escapeQuotes: When "true", escapes quote characters within quoted fields. Essential when your data contains quotes.
For data with embedded commas or newlines:
complex_data = [
("Smith, John", "Says \"Hello, World\"", "Department: R&D\nBuilding: North")
]
complex_df = spark.createDataFrame(complex_data, ["name", "quote", "multiline"])
complex_df.write \
.option("header", "true") \
.option("quote", '"') \
.option("escape", '"') \
.option("quoteAll", "true") \
.mode("overwrite") \
.csv("output/complex_data")
Converting to Pandas for Local CSV
For smaller datasets (under 10GB), converting to Pandas often simplifies the process:
# Collect to driver and write as single file
pandas_df = df.toPandas()
pandas_df.to_csv("output/employees_pandas.csv", index=False)
This approach gives you:
- Single file output without
coalesce(1)bottleneck - Pandas’ rich CSV options (compression, chunking, custom formatting)
- Simpler debugging with familiar Pandas syntax
The trade-off: all data must fit in driver memory. Use this for final results after filtering and aggregation, not for raw data dumps.
You can also specify Pandas CSV options:
pandas_df.to_csv(
"output/employees_custom.csv",
index=False,
sep="|",
na_rep="NULL",
date_format="%Y-%m-%d",
quoting=1 # QUOTE_ALL
)
Performance Best Practices
For large CSV exports, performance matters. Here are optimization strategies:
Compression: Reduce I/O and storage with compression. GZIP offers good compression ratios:
df.write \
.option("header", "true") \
.option("compression", "gzip") \
.mode("overwrite") \
.csv("output/employees_compressed")
This produces .csv.gz files. Other options: "bzip2", "lz4", "snappy". GZIP is widely supported; Snappy is faster but less portable.
Partition Management: Don’t use coalesce(1) for huge datasets. Instead, use a reasonable number of partitions:
# For a 100GB dataset, use 10-20 partitions
df.coalesce(10) \
.write \
.option("header", "true") \
.option("compression", "gzip") \
.mode("overwrite") \
.csv("output/employees_optimized")
Memory Configuration: When writing large CSVs, increase executor memory and configure shuffle partitions:
spark.conf.set("spark.sql.shuffle.partitions", "200")
spark.conf.set("spark.sql.adaptive.enabled", "true")
Selective Columns: Export only needed columns to reduce data volume:
df.select("name", "department", "salary") \
.write \
.option("header", "true") \
.mode("overwrite") \
.csv("output/employees_subset")
Avoid Multiple Small Files: If you have many small partitions, consolidate before writing:
# Check current partition count
print(f"Current partitions: {df.rdd.getNumPartitions()}")
# Consolidate if too many
if df.rdd.getNumPartitions() > 100:
df = df.coalesce(20)
For production pipelines, always monitor the size and count of output files. Thousands of tiny CSV files create filesystem overhead and slow down downstream processing. Aim for files between 128MB and 1GB each—adjust coalesce() accordingly based on your total data size.
Remember that CSV is a row-oriented format optimized for human readability, not analytical queries. For data you’ll read back into Spark, use Parquet with compression. Reserve CSV for final exports to external systems or stakeholders who need Excel-compatible formats.