How to Read CSV Files in Polars

Polars has rapidly become the go-to DataFrame library for Python developers who need speed without sacrificing usability. Built in Rust with a Python API, it consistently outperforms pandas on CSV...

Key Insights

  • Polars’ read_csv() function is significantly faster than pandas and offers fine-grained control over parsing, schema inference, and memory usage
  • Use scan_csv() for lazy evaluation when working with large files—it pushes filters and column selections down to the read operation, minimizing memory consumption
  • Explicit schema definition prevents type inference overhead and ensures consistent data types across different CSV files

Polars has rapidly become the go-to DataFrame library for Python developers who need speed without sacrificing usability. Built in Rust with a Python API, it consistently outperforms pandas on CSV operations—often by 5-10x on large files. But raw speed isn’t the only reason to switch. Polars provides a cleaner API, better memory efficiency, and lazy evaluation capabilities that fundamentally change how you approach data loading.

This guide covers everything you need to read CSV files effectively in Polars, from basic operations to production-ready patterns for handling massive datasets.

Basic CSV Reading with pl.read_csv()

The simplest way to load a CSV file into a Polars DataFrame mirrors what you’d expect from pandas:

import polars as pl

df = pl.read_csv("sales_data.csv")
print(df)

This returns an eager DataFrame—the entire file is read into memory immediately. For a typical CSV with headers in the first row and comma delimiters, this single line handles everything.

# View the schema Polars inferred
print(df.schema)
# Output: {'date': String, 'product': String, 'quantity': Int64, 'price': Float64}

# Check dimensions
print(f"Rows: {df.height}, Columns: {df.width}")

Polars automatically infers column types by sampling rows from the file. By default, it examines the first 100 rows to determine types, which works well for most datasets but can cause issues with heterogeneous data—more on that later.

Handling Common CSV Variations

Real-world CSV files rarely follow the ideal format. European datasets often use semicolons as delimiters, legacy systems export files without headers, and different applications represent null values in countless ways.

The read_csv() function provides parameters for all these scenarios:

# Semicolon-delimited file without headers
df = pl.read_csv(
    "european_data.csv",
    separator=";",
    has_header=False,
    new_columns=["id", "name", "value", "timestamp"]
)

# Skip metadata rows at the top of the file
df = pl.read_csv(
    "report_export.csv",
    skip_rows=3,  # Skip first 3 rows before header
    skip_rows_after_header=1  # Skip row immediately after header
)

# Handle multiple null representations
df = pl.read_csv(
    "legacy_data.csv",
    null_values=["NA", "N/A", "", "NULL", "-999"],
    ignore_errors=True  # Don't fail on malformed rows
)

Encoding issues frequently cause headaches when reading CSV files from different sources. Polars defaults to UTF-8, but you can specify alternatives:

# Read a file with Latin-1 encoding
df = pl.read_csv(
    "old_database_export.csv",
    encoding="latin1"
)

# Handle files with BOM (Byte Order Mark)
df = pl.read_csv(
    "windows_export.csv",
    encoding="utf-8-sig"
)

The comment_prefix parameter lets you skip lines that begin with specific characters—useful for CSV files that include metadata or documentation:

df = pl.read_csv(
    "annotated_data.csv",
    comment_prefix="#"  # Skip lines starting with #
)

Schema Control and Data Types

Automatic type inference is convenient but comes with costs. Polars must scan rows to determine types, which adds overhead. Worse, inference can produce inconsistent results if your data contains edge cases—a column that looks like integers in the first 100 rows might contain floats later.

Explicit schema definition solves both problems:

# Define schema explicitly
schema = {
    "order_id": pl.Int64,
    "customer_id": pl.String,
    "order_date": pl.Date,
    "total": pl.Float64,
    "is_fulfilled": pl.Boolean
}

df = pl.read_csv("orders.csv", schema=schema)

When you provide a schema, Polars skips inference entirely and parses values directly into the specified types. This is faster and guarantees consistent types across runs.

For partial control, use dtypes to override specific columns while letting Polars infer the rest:

# Override only specific columns
df = pl.read_csv(
    "mixed_data.csv",
    dtypes={"zip_code": pl.String, "phone": pl.String}
)

This pattern is particularly useful for columns that look numeric but should remain strings—ZIP codes, phone numbers, and product codes are classic examples.

If you trust inference but want more accuracy, increase the sample size:

# Sample more rows for type inference
df = pl.read_csv(
    "variable_data.csv",
    infer_schema_length=10000  # Sample first 10,000 rows
)

# Or scan the entire file (slower but most accurate)
df = pl.read_csv(
    "critical_data.csv",
    infer_schema_length=None  # Scan all rows
)

Date parsing deserves special attention. Polars can parse dates during CSV reading if you specify the format:

df = pl.read_csv(
    "events.csv",
    try_parse_dates=True  # Attempt automatic date parsing
)

# For specific formats, cast after reading
df = pl.read_csv("events.csv").with_columns(
    pl.col("event_date").str.strptime(pl.Date, "%Y-%m-%d"),
    pl.col("event_time").str.strptime(pl.Time, "%H:%M:%S")
)

Reading Large Files Efficiently

When working with files that don’t fit comfortably in memory, or when you only need a subset of the data, Polars provides several optimization strategies.

Select only the columns you need:

# Read only specific columns
df = pl.read_csv(
    "wide_table.csv",
    columns=["user_id", "event_type", "timestamp"]
)

# Or by index position
df = pl.read_csv(
    "wide_table.csv",
    columns=[0, 3, 7]  # First, fourth, and eighth columns
)

This dramatically reduces memory usage and speeds up parsing—Polars doesn’t waste time parsing columns you’ll discard anyway.

For exploratory work, limit the number of rows:

# Read first 1000 rows for exploration
sample_df = pl.read_csv(
    "massive_dataset.csv",
    n_rows=1000
)

# Combine with column selection for minimal memory footprint
sample_df = pl.read_csv(
    "massive_dataset.csv",
    columns=["id", "value"],
    n_rows=1000
)

Polars also supports reading CSV files in chunks using the batch_size parameter with the streaming reader:

# Process file in batches
reader = pl.read_csv_batched("huge_file.csv", batch_size=100_000)

results = []
while True:
    batch = reader.next_batches(1)
    if batch is None:
        break
    # Process each batch
    processed = batch[0].filter(pl.col("status") == "active")
    results.append(processed)

final_df = pl.concat(results)

This approach lets you process files larger than available memory by handling chunks sequentially.

Lazy Reading with scan_csv()

The real power of Polars emerges with lazy evaluation. Instead of read_csv(), use scan_csv() to create a lazy frame that defers execution until you explicitly request results:

# Create a lazy frame (no data loaded yet)
lf = pl.scan_csv("transactions.csv")

# Build a query
query = (
    lf
    .filter(pl.col("amount") > 1000)
    .filter(pl.col("date") >= "2024-01-01")
    .select(["transaction_id", "customer_id", "amount"])
    .group_by("customer_id")
    .agg(pl.col("amount").sum().alias("total_amount"))
)

# Execute and collect results
result = query.collect()

The magic here is query optimization. Polars analyzes your entire query before execution and pushes operations like filtering and column selection down to the file reading stage. If you only need three columns and rows matching certain criteria, Polars reads only those columns and skips rows that fail the filter—without loading the entire file into memory.

You can inspect the optimized query plan:

print(query.explain())

This shows exactly what Polars will do, including predicate pushdown and projection pushdown optimizations.

For large files, scan_csv() with lazy evaluation often uses a fraction of the memory that read_csv() would require:

# Memory-efficient aggregation on a large file
result = (
    pl.scan_csv("10gb_logfile.csv")
    .filter(pl.col("level") == "ERROR")
    .group_by("service")
    .agg(pl.count().alias("error_count"))
    .sort("error_count", descending=True)
    .head(10)
    .collect()
)

This query processes a 10GB file while keeping memory usage minimal—only the filtered, aggregated results need to fit in memory.

Use read_csv() when you need the entire dataset in memory for multiple operations. Use scan_csv() when you’re filtering, aggregating, or selecting subsets—especially with large files.

Conclusion

Polars provides a comprehensive toolkit for CSV operations that scales from quick exploration to production data pipelines. Start with read_csv() for simple cases, add schema definitions for consistency and performance, and graduate to scan_csv() when file sizes grow.

The key patterns to remember: specify schemas explicitly for production code, use column selection to minimize memory usage, and leverage lazy evaluation with scan_csv() for large files. These practices will serve you well as your data grows from megabytes to gigabytes.

For advanced options like reading from cloud storage, handling compressed files, or parallel reading of multiple CSVs, consult the Polars documentation—the library continues to evolve rapidly with new capabilities appearing in each release.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.