How to Cast Data Types in Polars

Data type casting is one of those operations you'll perform constantly but rarely think about until something breaks. In Polars, getting your types right matters for two reasons: memory efficiency...

Key Insights

  • Polars’ cast() method is explicit and strict by default, catching type conversion errors early rather than silently producing incorrect data like pandas often does.
  • Choosing the right numeric type (Int8 vs Int64, Float32 vs Float64) can reduce memory usage by 75% or more on large datasets without sacrificing precision for your use case.
  • Always cast data types at schema definition time when possible—it’s faster than converting after loading and makes your data pipeline’s expectations explicit.

Introduction to Type Casting in Polars

Data type casting is one of those operations you’ll perform constantly but rarely think about until something breaks. In Polars, getting your types right matters for two reasons: memory efficiency and computational correctness.

Polars takes a stricter approach to types than pandas. Where pandas will silently coerce your integers to floats when nulls appear, Polars maintains distinct nullable integer types. Where pandas might let you perform string operations on a mixed-type column, Polars will complain loudly. This strictness is a feature, not a bug—it catches data quality issues early.

The core method for type conversion in Polars is cast(). Unlike pandas’ scattered approach (astype(), to_numeric(), to_datetime(), etc.), Polars consolidates most conversions into this single, predictable method.

Basic Casting with cast()

The cast() method lives on expressions and takes a Polars data type as its argument. Here’s the fundamental pattern:

import polars as pl

df = pl.DataFrame({
    "integers": [1, 2, 3, 4, 5],
    "floats": [1.5, 2.7, 3.2, 4.8, 5.1],
    "numeric_strings": ["100", "200", "300", "400", "500"],
    "booleans": [True, False, True, False, True],
})

# Basic type conversions
result = df.select(
    pl.col("integers").cast(pl.Float64).alias("int_to_float"),
    pl.col("floats").cast(pl.Int64).alias("float_to_int"),  # Truncates decimals
    pl.col("numeric_strings").cast(pl.Int64).alias("str_to_int"),
    pl.col("booleans").cast(pl.Int8).alias("bool_to_int"),  # True=1, False=0
)

print(result)

Output:

shape: (5, 4)
┌──────────────┬──────────────┬────────────┬─────────────┐
│ int_to_float ┆ float_to_int ┆ str_to_int ┆ bool_to_int │
│ ---          ┆ ---          ┆ ---        ┆ ---         │
│ f64          ┆ i64          ┆ i64        ┆ i8          │
╞══════════════╪══════════════╪════════════╪═════════════╡
│ 1.0          ┆ 1            ┆ 100        ┆ 1           │
│ 2.0          ┆ 2            ┆ 200        ┆ 0           │
│ 3.0          ┆ 3            ┆ 300        ┆ 1           │
│ 4.0          ┆ 4            ┆ 400        ┆ 0           │
│ 5.0          ┆ 5            ┆ 500        ┆ 1           │
└──────────────┴──────────────┴────────────┴─────────────┘

Notice that float-to-integer conversion truncates toward zero, not rounds. If you need rounding, apply round() before casting.

Casting to Numeric Types

Polars offers a full range of numeric types with explicit bit widths. Understanding when to use each saves memory and prevents overflow bugs.

Type Range Memory
Int8 -128 to 127 1 byte
Int16 -32,768 to 32,767 2 bytes
Int32 -2.1B to 2.1B 4 bytes
Int64 ±9.2 quintillion 8 bytes
UInt8 0 to 255 1 byte
UInt32 0 to 4.3B 4 bytes
Float32 ~7 decimal digits 4 bytes
Float64 ~15 decimal digits 8 bytes

Here’s how to downcast for memory efficiency:

# Simulating a large dataset with small values
df = pl.DataFrame({
    "user_age": [25, 34, 45, 28, 52] * 1_000_000,  # Values 0-120 fit in UInt8
    "rating": [4.5, 3.2, 5.0, 4.8, 3.9] * 1_000_000,  # Float32 is plenty
    "year": [2020, 2021, 2022, 2023, 2024] * 1_000_000,  # Int16 works
})

print(f"Original memory: {df.estimated_size('mb'):.2f} MB")

df_optimized = df.select(
    pl.col("user_age").cast(pl.UInt8),
    pl.col("rating").cast(pl.Float32),
    pl.col("year").cast(pl.Int16),
)

print(f"Optimized memory: {df_optimized.estimated_size('mb'):.2f} MB")

Output:

Original memory: 114.44 MB
Optimized memory: 33.57 MB

That’s a 70% memory reduction. On datasets with billions of rows, this difference determines whether your data fits in RAM.

String and Categorical Conversions

Categorical types are Polars’ secret weapon for string columns with repeated values. Instead of storing the full string for each row, categoricals store integer indices into a lookup table.

df = pl.DataFrame({
    "country": ["USA", "Canada", "USA", "Mexico", "Canada"] * 100_000,
    "status": ["active", "inactive", "pending", "active", "active"] * 100_000,
    "amount": ["1234.56", "789.01", "456.78", "999.99", "123.45"] * 100_000,
})

print(f"Original size: {df.estimated_size('mb'):.2f} MB")

# Convert strings to categoricals and parse numeric strings
df_converted = df.select(
    pl.col("country").cast(pl.Categorical),
    pl.col("status").cast(pl.Categorical),
    pl.col("amount").cast(pl.Float64),
)

print(f"Converted size: {df_converted.estimated_size('mb'):.2f} MB")

# Convert categorical back to string when needed
df_back = df_converted.select(
    pl.col("country").cast(pl.String),
)

For string-to-numeric parsing with more control, use the str namespace methods:

df = pl.DataFrame({
    "messy_numbers": ["  123  ", "456.789", "1,234", "N/A", "789"],
})

result = df.select(
    # Strip whitespace and convert
    pl.col("messy_numbers").str.strip_chars().cast(pl.Float64, strict=False),
)

Date and Time Type Casting

Temporal types require special handling because dates come in countless string formats. Polars provides dedicated parsing methods in the str namespace:

df = pl.DataFrame({
    "date_iso": ["2024-01-15", "2024-02-20", "2024-03-25"],
    "date_us": ["01/15/2024", "02/20/2024", "03/25/2024"],
    "datetime_str": ["2024-01-15 14:30:00", "2024-02-20 09:15:00", "2024-03-25 18:45:00"],
    "timestamp_ms": [1705312200000, 1708416900000, 1711392300000],
})

result = df.select(
    # ISO format parses automatically
    pl.col("date_iso").str.to_date(),
    
    # Custom format requires explicit pattern
    pl.col("date_us").str.to_date("%m/%d/%Y").alias("date_us_parsed"),
    
    # Datetime parsing
    pl.col("datetime_str").str.to_datetime("%Y-%m-%d %H:%M:%S"),
    
    # Unix timestamp to datetime
    pl.col("timestamp_ms").cast(pl.Datetime("ms")),
)

print(result)

For timezone handling:

df = pl.DataFrame({
    "utc_time": ["2024-01-15T14:30:00Z", "2024-02-20T09:15:00Z"],
})

result = df.select(
    pl.col("utc_time")
    .str.to_datetime("%Y-%m-%dT%H:%M:%SZ")
    .dt.replace_time_zone("UTC")
    .dt.convert_time_zone("America/New_York")
    .alias("eastern_time"),
)

Handling Cast Errors and Edge Cases

By default, cast() is strict—it raises an error when conversion fails. Use strict=False to convert failures to null instead:

df = pl.DataFrame({
    "mixed_numbers": ["123", "456", "not_a_number", "789", None],
    "large_values": [100, 200, 50000, 150, 75],  # 50000 won't fit in Int8
})

# Strict mode raises an error
try:
    df.select(pl.col("mixed_numbers").cast(pl.Int64))
except Exception as e:
    print(f"Strict error: {e}")

# Lenient mode converts failures to null
result = df.select(
    pl.col("mixed_numbers").cast(pl.Int64, strict=False).alias("parsed"),
    pl.col("large_values").cast(pl.Int8, strict=False).alias("small_int"),
)

print(result)

Output:

shape: (5, 2)
┌────────┬───────────┐
│ parsed ┆ small_int │
│ ---    ┆ ---       │
│ i64    ┆ i8        │
╞════════╪═══════════╡
│ 123    ┆ 100       │
│ 456    ┆ null      │
│ null   ┆ null      │
│ 789    ┆ null      │
│ null   ┆ 75        │
└────────┴───────────┘

The overflow value (50000) becomes null because it exceeds Int8’s maximum of 127.

Best Practices and Performance Tips

Define schemas upfront. The fastest cast is the one you don’t have to do. When reading files, specify types at load time:

# Instead of loading then casting
df = pl.read_csv("data.csv")
df = df.with_columns(pl.col("id").cast(pl.Int32))

# Define schema at read time
df = pl.read_csv(
    "data.csv",
    schema_overrides={
        "id": pl.Int32,
        "price": pl.Float32,
        "category": pl.Categorical,
        "date": pl.Date,
    },
)

Cast multiple columns efficiently using selectors:

import polars.selectors as cs

df = pl.DataFrame({
    "a": [1, 2, 3],
    "b": [4, 5, 6],
    "c": [7, 8, 9],
    "name": ["x", "y", "z"],
})

# Cast all integer columns to Float32
result = df.cast({cs.integer(): pl.Float32})

# Or cast specific columns by name pattern
result = df.cast({cs.by_name("a", "b"): pl.Float64})

Leverage lazy mode for complex pipelines. In lazy mode, Polars can optimize cast operations:

# Lazy evaluation allows Polars to optimize the query plan
result = (
    pl.scan_csv("large_file.csv")
    .with_columns(
        pl.col("amount").cast(pl.Float32),
        pl.col("category").cast(pl.Categorical),
    )
    .filter(pl.col("amount") > 100)
    .collect()
)

Polars may push the cast operation closer to the data source or combine it with other operations for better performance.

Validate after lenient casting. When using strict=False, always check for unexpected nulls:

original_nulls = df.select(pl.col("value").null_count()).item()
casted = df.with_columns(pl.col("value").cast(pl.Int32, strict=False))
new_nulls = casted.select(pl.col("value").null_count()).item()

if new_nulls > original_nulls:
    failed_count = new_nulls - original_nulls
    print(f"Warning: {failed_count} values failed to cast")

Type casting in Polars is straightforward once you internalize the cast() pattern. The key is being intentional about your types from the start—define schemas explicitly, choose appropriate numeric widths, and use categoricals for repeated strings. Your memory usage and query performance will thank you.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.