How to Write to CSV in Polars
Polars has rapidly become the go-to DataFrame library for Python developers who need speed. Built in Rust with a lazy evaluation engine, it consistently outperforms pandas by 10-100x on common...
Key Insights
- Polars provides
write_csv()for eager DataFrames andsink_csv()for memory-efficient streaming writes with LazyFrames—choose based on your data size - Customizing CSV output through parameters like
separator,null_value, anddatetime_formatgives you precise control over the output format without post-processing - Writing directly to cloud storage (S3, GCS) requires only the
fsspeclibrary and the appropriate cloud SDK—no code changes needed beyond the path
Why CSV Export in Polars Matters
Polars has rapidly become the go-to DataFrame library for Python developers who need speed. Built in Rust with a lazy evaluation engine, it consistently outperforms pandas by 10-100x on common operations. But raw speed means nothing if you can’t get your data out efficiently.
CSV remains the lingua franca of data exchange. Whether you’re feeding data to a legacy system, sharing results with non-technical stakeholders, or creating portable backups, you’ll need to write CSVs. Polars makes this straightforward while maintaining its performance edge.
Let’s dive into the practical details of writing CSV files with Polars.
Basic CSV Writing with write_csv()
The write_csv() method handles most CSV export needs. It’s available on any eager DataFrame and works exactly as you’d expect.
import polars as pl
# Create a sample DataFrame
df = pl.DataFrame({
"user_id": [1, 2, 3, 4, 5],
"name": ["Alice", "Bob", "Charlie", "Diana", "Eve"],
"score": [85.5, 92.3, 78.1, 95.7, 88.4],
"active": [True, True, False, True, False]
})
# Write to CSV - that's it
df.write_csv("users.csv")
This produces a standard CSV with headers:
user_id,name,score,active
1,Alice,85.5,true
2,Bob,92.3,true
3,Charlie,78.1,false
4,Diana,95.7,true
5,Eve,88.4,false
Notice that Polars writes boolean values as lowercase true/false. This differs from pandas’ True/False output, which can matter for downstream systems expecting specific formats.
The method returns None and writes directly to disk. For small to medium datasets (up to a few million rows), this approach works perfectly.
Customizing CSV Output
Real-world CSV requirements rarely match defaults. European systems often expect semicolons as delimiters. Legacy databases might need specific null representations. Polars handles all of this through write_csv() parameters.
import polars as pl
from datetime import datetime, date
df = pl.DataFrame({
"product_id": [101, 102, 103, 104],
"product_name": ["Widget A", "Gadget B", None, "Tool D"],
"price": [29.99, None, 15.50, 42.00],
"created_at": [
datetime(2024, 1, 15, 10, 30),
datetime(2024, 2, 20, 14, 45),
datetime(2024, 3, 10, 9, 0),
datetime(2024, 4, 5, 16, 15)
],
"release_date": [
date(2024, 2, 1),
date(2024, 3, 15),
date(2024, 4, 1),
date(2024, 5, 10)
]
})
# European-style CSV with custom null handling
df.write_csv(
"products_eu.csv",
separator=";",
null_value="N/A",
datetime_format="%d/%m/%Y %H:%M",
date_format="%d.%m.%Y",
include_header=True,
quote_style="necessary" # Only quote when needed
)
Output:
product_id;product_name;price;created_at;release_date
101;Widget A;29.99;15/01/2024 10:30;01.02.2024
102;Gadget B;N/A;20/02/2024 14:45;15.03.2024
103;N/A;15.5;10/03/2024 09:00;01.04.2024
104;Tool D;42.0;05/04/2024 16:15;10.05.2024
Here’s a breakdown of the key parameters:
| Parameter | Purpose | Default |
|---|---|---|
separator |
Field delimiter | , |
null_value |
String representation of nulls | empty string |
datetime_format |
Format string for datetime columns | ISO 8601 |
date_format |
Format string for date columns | ISO 8601 |
include_header |
Write column names as first row | True |
quote_style |
When to quote fields | "necessary" |
quote_char |
Character for quoting | " |
For headerless output (common when appending to existing files):
df.write_csv("products_no_header.csv", include_header=False)
Writing to Different Destinations
Sometimes you don’t want a file on disk. You might need to return CSV data from an API, store it in a database blob, or write directly to cloud storage.
Writing to a Buffer
For in-memory operations, write to a StringIO or BytesIO buffer:
import polars as pl
from io import StringIO, BytesIO
df = pl.DataFrame({
"id": [1, 2, 3],
"value": ["a", "b", "c"]
})
# Write to StringIO for string operations
buffer = StringIO()
df.write_csv(buffer)
csv_string = buffer.getvalue()
print(csv_string)
# Write to BytesIO for binary operations (useful for HTTP responses)
byte_buffer = BytesIO()
df.write_csv(byte_buffer)
csv_bytes = byte_buffer.getvalue()
This pattern works well in web frameworks when you need to return CSV data:
from fastapi import Response
@app.get("/export")
def export_data():
df = get_data_as_dataframe()
buffer = BytesIO()
df.write_csv(buffer)
return Response(
content=buffer.getvalue(),
media_type="text/csv",
headers={"Content-Disposition": "attachment; filename=export.csv"}
)
Writing to Cloud Storage
Polars integrates with fsspec for cloud storage. Install the required backends:
pip install s3fs # For AWS S3
pip install gcsfs # For Google Cloud Storage
pip install adlfs # For Azure Blob Storage
Then write using cloud URIs:
import polars as pl
df = pl.DataFrame({
"transaction_id": [1001, 1002, 1003],
"amount": [150.00, 275.50, 89.99]
})
# Write to S3
df.write_csv("s3://my-bucket/exports/transactions.csv")
# Write to GCS
df.write_csv("gs://my-bucket/exports/transactions.csv")
# Write to Azure Blob Storage
df.write_csv("abfs://container@account.dfs.core.windows.net/transactions.csv")
Authentication uses your environment’s default credentials (AWS credentials file, GOOGLE_APPLICATION_CREDENTIALS, Azure CLI login). For explicit credentials, use storage options:
df.write_csv(
"s3://my-bucket/data.csv",
storage_options={
"key": "your-access-key",
"secret": "your-secret-key",
"endpoint_url": "https://custom-endpoint.com" # For MinIO, etc.
}
)
Handling Large Datasets with LazyFrames
Here’s where Polars truly shines. For datasets that push memory limits, sink_csv() streams data directly to disk without loading everything into memory.
The standard approach loads all data before writing:
# Memory-intensive: loads entire result into RAM
(
pl.scan_parquet("huge_dataset/*.parquet")
.filter(pl.col("status") == "active")
.select(["id", "name", "value"])
.collect() # Everything loads here
.write_csv("filtered_output.csv")
)
With sink_csv(), Polars streams results directly:
# Memory-efficient: streams data to disk
(
pl.scan_parquet("huge_dataset/*.parquet")
.filter(pl.col("status") == "active")
.select(["id", "name", "value"])
.sink_csv("filtered_output.csv") # Streams without full materialization
)
The difference matters enormously at scale. I’ve processed 50GB datasets on machines with 8GB RAM using sink_csv(). The same operation with collect().write_csv() would crash.
sink_csv() supports most of the same parameters as write_csv():
(
pl.scan_csv("raw_data/*.csv")
.filter(pl.col("amount") > 100)
.with_columns(
pl.col("timestamp").str.to_datetime()
)
.sink_csv(
"processed_output.csv",
separator="\t",
null_value="NULL",
datetime_format="%Y-%m-%d %H:%M:%S"
)
)
One caveat: sink_csv() doesn’t support all query operations. Operations requiring global state (like sort() on the full dataset or certain window functions) force Polars to materialize intermediate results. The engine handles this automatically, but peak memory usage may increase.
Common Gotchas and Best Practices
Float Precision
Polars writes floats with full precision by default. This can create unexpectedly large files:
df = pl.DataFrame({"value": [1/3, 2/3, 1/7]})
df.write_csv("precise.csv")
# Output: 0.3333333333333333,0.6666666666666666,0.14285714285714285
Round before export if you need cleaner output:
df.with_columns(pl.col("value").round(4)).write_csv("rounded.csv")
# Output: 0.3333,0.6667,0.1429
Datetime Handling
Always specify datetime_format when interoperability matters:
from datetime import datetime
df = pl.DataFrame({
"event_time": [datetime(2024, 6, 15, 14, 30, 45, 123456)]
})
# Default ISO format with microseconds
df.write_csv("default.csv")
# Output: 2024-06-15T14:30:45.123456
# MySQL-compatible format
df.write_csv("mysql.csv", datetime_format="%Y-%m-%d %H:%M:%S")
# Output: 2024-06-15 14:30:45
Special Characters and Encoding
Polars writes UTF-8 by default. For systems requiring different encodings, write to a buffer and encode:
buffer = BytesIO()
df.write_csv(buffer)
csv_content = buffer.getvalue().decode('utf-8').encode('latin-1')
with open("latin1_output.csv", "wb") as f:
f.write(csv_content)
For fields containing delimiters or quotes, Polars automatically handles quoting:
df = pl.DataFrame({
"description": ['Contains, comma', 'Has "quotes"', 'Normal text']
})
df.write_csv("special_chars.csv")
# Output properly quotes and escapes
Performance Tips
- Avoid unnecessary collects: Chain operations in lazy mode and call
sink_csv()once - Write partitioned files: For very large outputs, write multiple smaller files
- Use appropriate compression: Polars doesn’t compress CSV directly, but you can write to
.csv.gzwith some additional handling
Conclusion
Polars gives you two primary methods for CSV export: write_csv() for eager DataFrames and sink_csv() for streaming LazyFrame results.
Use write_csv() when your data fits comfortably in memory and you need maximum parameter flexibility. Use sink_csv() when processing large datasets where memory efficiency matters more than convenience.
Both methods support custom delimiters, null handling, datetime formatting, and cloud storage destinations. The API is clean and predictable—once you learn the parameters for one method, they transfer directly to the other.
For most workflows, start with write_csv(). When you hit memory limits or need to process files larger than RAM, switch to a lazy pipeline with sink_csv(). That’s the Polars way: start simple, scale when needed.