How to Cast Data Types in Polars
Data type casting is one of those operations you'll perform constantly but rarely think about until something breaks. In Polars, getting your types right matters for two reasons: memory efficiency...
Key Insights
- Polars’
cast()method is explicit and strict by default, catching type conversion errors early rather than silently producing incorrect data like pandas often does. - Choosing the right numeric type (Int8 vs Int64, Float32 vs Float64) can reduce memory usage by 75% or more on large datasets without sacrificing precision for your use case.
- Always cast data types at schema definition time when possible—it’s faster than converting after loading and makes your data pipeline’s expectations explicit.
Introduction to Type Casting in Polars
Data type casting is one of those operations you’ll perform constantly but rarely think about until something breaks. In Polars, getting your types right matters for two reasons: memory efficiency and computational correctness.
Polars takes a stricter approach to types than pandas. Where pandas will silently coerce your integers to floats when nulls appear, Polars maintains distinct nullable integer types. Where pandas might let you perform string operations on a mixed-type column, Polars will complain loudly. This strictness is a feature, not a bug—it catches data quality issues early.
The core method for type conversion in Polars is cast(). Unlike pandas’ scattered approach (astype(), to_numeric(), to_datetime(), etc.), Polars consolidates most conversions into this single, predictable method.
Basic Casting with cast()
The cast() method lives on expressions and takes a Polars data type as its argument. Here’s the fundamental pattern:
import polars as pl
df = pl.DataFrame({
"integers": [1, 2, 3, 4, 5],
"floats": [1.5, 2.7, 3.2, 4.8, 5.1],
"numeric_strings": ["100", "200", "300", "400", "500"],
"booleans": [True, False, True, False, True],
})
# Basic type conversions
result = df.select(
pl.col("integers").cast(pl.Float64).alias("int_to_float"),
pl.col("floats").cast(pl.Int64).alias("float_to_int"), # Truncates decimals
pl.col("numeric_strings").cast(pl.Int64).alias("str_to_int"),
pl.col("booleans").cast(pl.Int8).alias("bool_to_int"), # True=1, False=0
)
print(result)
Output:
shape: (5, 4)
┌──────────────┬──────────────┬────────────┬─────────────┐
│ int_to_float ┆ float_to_int ┆ str_to_int ┆ bool_to_int │
│ --- ┆ --- ┆ --- ┆ --- │
│ f64 ┆ i64 ┆ i64 ┆ i8 │
╞══════════════╪══════════════╪════════════╪═════════════╡
│ 1.0 ┆ 1 ┆ 100 ┆ 1 │
│ 2.0 ┆ 2 ┆ 200 ┆ 0 │
│ 3.0 ┆ 3 ┆ 300 ┆ 1 │
│ 4.0 ┆ 4 ┆ 400 ┆ 0 │
│ 5.0 ┆ 5 ┆ 500 ┆ 1 │
└──────────────┴──────────────┴────────────┴─────────────┘
Notice that float-to-integer conversion truncates toward zero, not rounds. If you need rounding, apply round() before casting.
Casting to Numeric Types
Polars offers a full range of numeric types with explicit bit widths. Understanding when to use each saves memory and prevents overflow bugs.
| Type | Range | Memory |
|---|---|---|
| Int8 | -128 to 127 | 1 byte |
| Int16 | -32,768 to 32,767 | 2 bytes |
| Int32 | -2.1B to 2.1B | 4 bytes |
| Int64 | ±9.2 quintillion | 8 bytes |
| UInt8 | 0 to 255 | 1 byte |
| UInt32 | 0 to 4.3B | 4 bytes |
| Float32 | ~7 decimal digits | 4 bytes |
| Float64 | ~15 decimal digits | 8 bytes |
Here’s how to downcast for memory efficiency:
# Simulating a large dataset with small values
df = pl.DataFrame({
"user_age": [25, 34, 45, 28, 52] * 1_000_000, # Values 0-120 fit in UInt8
"rating": [4.5, 3.2, 5.0, 4.8, 3.9] * 1_000_000, # Float32 is plenty
"year": [2020, 2021, 2022, 2023, 2024] * 1_000_000, # Int16 works
})
print(f"Original memory: {df.estimated_size('mb'):.2f} MB")
df_optimized = df.select(
pl.col("user_age").cast(pl.UInt8),
pl.col("rating").cast(pl.Float32),
pl.col("year").cast(pl.Int16),
)
print(f"Optimized memory: {df_optimized.estimated_size('mb'):.2f} MB")
Output:
Original memory: 114.44 MB
Optimized memory: 33.57 MB
That’s a 70% memory reduction. On datasets with billions of rows, this difference determines whether your data fits in RAM.
String and Categorical Conversions
Categorical types are Polars’ secret weapon for string columns with repeated values. Instead of storing the full string for each row, categoricals store integer indices into a lookup table.
df = pl.DataFrame({
"country": ["USA", "Canada", "USA", "Mexico", "Canada"] * 100_000,
"status": ["active", "inactive", "pending", "active", "active"] * 100_000,
"amount": ["1234.56", "789.01", "456.78", "999.99", "123.45"] * 100_000,
})
print(f"Original size: {df.estimated_size('mb'):.2f} MB")
# Convert strings to categoricals and parse numeric strings
df_converted = df.select(
pl.col("country").cast(pl.Categorical),
pl.col("status").cast(pl.Categorical),
pl.col("amount").cast(pl.Float64),
)
print(f"Converted size: {df_converted.estimated_size('mb'):.2f} MB")
# Convert categorical back to string when needed
df_back = df_converted.select(
pl.col("country").cast(pl.String),
)
For string-to-numeric parsing with more control, use the str namespace methods:
df = pl.DataFrame({
"messy_numbers": [" 123 ", "456.789", "1,234", "N/A", "789"],
})
result = df.select(
# Strip whitespace and convert
pl.col("messy_numbers").str.strip_chars().cast(pl.Float64, strict=False),
)
Date and Time Type Casting
Temporal types require special handling because dates come in countless string formats. Polars provides dedicated parsing methods in the str namespace:
df = pl.DataFrame({
"date_iso": ["2024-01-15", "2024-02-20", "2024-03-25"],
"date_us": ["01/15/2024", "02/20/2024", "03/25/2024"],
"datetime_str": ["2024-01-15 14:30:00", "2024-02-20 09:15:00", "2024-03-25 18:45:00"],
"timestamp_ms": [1705312200000, 1708416900000, 1711392300000],
})
result = df.select(
# ISO format parses automatically
pl.col("date_iso").str.to_date(),
# Custom format requires explicit pattern
pl.col("date_us").str.to_date("%m/%d/%Y").alias("date_us_parsed"),
# Datetime parsing
pl.col("datetime_str").str.to_datetime("%Y-%m-%d %H:%M:%S"),
# Unix timestamp to datetime
pl.col("timestamp_ms").cast(pl.Datetime("ms")),
)
print(result)
For timezone handling:
df = pl.DataFrame({
"utc_time": ["2024-01-15T14:30:00Z", "2024-02-20T09:15:00Z"],
})
result = df.select(
pl.col("utc_time")
.str.to_datetime("%Y-%m-%dT%H:%M:%SZ")
.dt.replace_time_zone("UTC")
.dt.convert_time_zone("America/New_York")
.alias("eastern_time"),
)
Handling Cast Errors and Edge Cases
By default, cast() is strict—it raises an error when conversion fails. Use strict=False to convert failures to null instead:
df = pl.DataFrame({
"mixed_numbers": ["123", "456", "not_a_number", "789", None],
"large_values": [100, 200, 50000, 150, 75], # 50000 won't fit in Int8
})
# Strict mode raises an error
try:
df.select(pl.col("mixed_numbers").cast(pl.Int64))
except Exception as e:
print(f"Strict error: {e}")
# Lenient mode converts failures to null
result = df.select(
pl.col("mixed_numbers").cast(pl.Int64, strict=False).alias("parsed"),
pl.col("large_values").cast(pl.Int8, strict=False).alias("small_int"),
)
print(result)
Output:
shape: (5, 2)
┌────────┬───────────┐
│ parsed ┆ small_int │
│ --- ┆ --- │
│ i64 ┆ i8 │
╞════════╪═══════════╡
│ 123 ┆ 100 │
│ 456 ┆ null │
│ null ┆ null │
│ 789 ┆ null │
│ null ┆ 75 │
└────────┴───────────┘
The overflow value (50000) becomes null because it exceeds Int8’s maximum of 127.
Best Practices and Performance Tips
Define schemas upfront. The fastest cast is the one you don’t have to do. When reading files, specify types at load time:
# Instead of loading then casting
df = pl.read_csv("data.csv")
df = df.with_columns(pl.col("id").cast(pl.Int32))
# Define schema at read time
df = pl.read_csv(
"data.csv",
schema_overrides={
"id": pl.Int32,
"price": pl.Float32,
"category": pl.Categorical,
"date": pl.Date,
},
)
Cast multiple columns efficiently using selectors:
import polars.selectors as cs
df = pl.DataFrame({
"a": [1, 2, 3],
"b": [4, 5, 6],
"c": [7, 8, 9],
"name": ["x", "y", "z"],
})
# Cast all integer columns to Float32
result = df.cast({cs.integer(): pl.Float32})
# Or cast specific columns by name pattern
result = df.cast({cs.by_name("a", "b"): pl.Float64})
Leverage lazy mode for complex pipelines. In lazy mode, Polars can optimize cast operations:
# Lazy evaluation allows Polars to optimize the query plan
result = (
pl.scan_csv("large_file.csv")
.with_columns(
pl.col("amount").cast(pl.Float32),
pl.col("category").cast(pl.Categorical),
)
.filter(pl.col("amount") > 100)
.collect()
)
Polars may push the cast operation closer to the data source or combine it with other operations for better performance.
Validate after lenient casting. When using strict=False, always check for unexpected nulls:
original_nulls = df.select(pl.col("value").null_count()).item()
casted = df.with_columns(pl.col("value").cast(pl.Int32, strict=False))
new_nulls = casted.select(pl.col("value").null_count()).item()
if new_nulls > original_nulls:
failed_count = new_nulls - original_nulls
print(f"Warning: {failed_count} values failed to cast")
Type casting in Polars is straightforward once you internalize the cast() pattern. The key is being intentional about your types from the start—define schemas explicitly, choose appropriate numeric widths, and use categoricals for repeated strings. Your memory usage and query performance will thank you.