How to Handle Null Values in Polars
Missing data is inevitable. Whether you're parsing CSV files with empty cells, joining datasets with mismatched keys, or processing API responses with optional fields, you'll encounter null values....
Key Insights
- Polars uses a native null representation (not Python’s
Noneor NumPy’sNaN), which enables consistent behavior across all data types and better performance through optimized memory layouts. - The
fill_null()method supports multiple strategies—literal values, forward/backward fill, and computed expressions—making it flexible enough for most imputation scenarios without writing complex conditional logic. - Polars aggregations ignore nulls by default, which is usually what you want, but understanding this behavior prevents subtle bugs when null counts matter for your analysis.
Introduction to Null Handling in Polars
Missing data is inevitable. Whether you’re parsing CSV files with empty cells, joining datasets with mismatched keys, or processing API responses with optional fields, you’ll encounter null values. How you handle them determines whether your analysis is trustworthy or riddled with silent errors.
Polars takes a different approach to nulls than pandas. Instead of overloading NaN (which is technically a valid floating-point value) or relying on Python’s None, Polars uses a dedicated null representation built into its Arrow-based memory format. This means nulls work consistently across integers, strings, dates, and every other data type—not just floats.
This design choice has practical implications. You won’t accidentally compare NaN != NaN and get True. You won’t see integers mysteriously convert to floats just because one value is missing. And you’ll get better performance because Polars can use bitmasks to track nulls without touching the actual data.
Let’s explore the complete toolkit for handling nulls in Polars.
Detecting Null Values
Before you can handle nulls, you need to find them. Polars provides several methods for null detection, each suited to different use cases.
The is_null() and is_not_null() expressions return boolean masks:
import polars as pl
df = pl.DataFrame({
"name": ["Alice", None, "Charlie", "Diana"],
"age": [25, 30, None, 28],
"email": ["alice@example.com", None, None, "diana@example.com"]
})
# Check which values are null
print(df.select(pl.all().is_null()))
This outputs a DataFrame of booleans showing null positions. More useful for exploration is counting nulls per column:
# Count nulls in each column
null_counts = df.select(pl.all().is_null().sum())
print(null_counts)
# shape: (1, 3)
# ┌──────┬─────┬───────┐
# │ name ┆ age ┆ email │
# │ --- ┆ --- ┆ --- │
# │ u32 ┆ u32 ┆ u32 │
# ╞══════╪═════╪═══════╡
# │ 1 ┆ 1 ┆ 2 │
# └──────┴─────┴───────┘
For a quick summary, use null_count() directly:
# Alternative: null_count() method
print(df.null_count())
Filtering rows based on null presence is straightforward:
# Rows where email is missing
missing_email = df.filter(pl.col("email").is_null())
# Rows where ALL specified columns have values
complete_rows = df.filter(
pl.col("name").is_not_null() & pl.col("age").is_not_null()
)
Dropping Null Values
Sometimes the cleanest solution is removing incomplete records. The drop_nulls() method handles this at both DataFrame and expression levels.
# Drop rows with ANY null value
clean_df = df.drop_nulls()
print(clean_df)
# shape: (1, 3)
# ┌───────┬─────┬───────────────────┐
# │ name ┆ age ┆ email │
# │ --- ┆ --- ┆ --- │
# │ str ┆ i64 ┆ str │
# ╞═══════╪═════╪═══════════════════╡
# │ Diana ┆ 28 ┆ diana@example.com │
# └───────┴─────┴───────────────────┘
That’s aggressive—you lost three rows. Usually you want to target specific columns:
# Drop rows only where 'name' or 'age' is null
df_subset_clean = df.drop_nulls(subset=["name", "age"])
print(df_subset_clean)
# shape: (2, 3)
# ┌─────────┬─────┬───────────────────┐
# │ name ┆ age ┆ email │
# │ --- ┆ --- ┆ --- │
# │ str ┆ i64 ┆ str │
# ╞═════════╪═════╪═══════════════════╡
# │ Alice ┆ 25 ┆ alice@example.com │
# │ Diana ┆ 28 ┆ diana@example.com │
# └─────────┴─────┴───────────────────┘
Within expressions, drop_nulls() works on individual columns, useful for aggregations:
# Calculate mean of non-null ages explicitly
mean_age = df.select(pl.col("age").drop_nulls().mean())
Filling Null Values
Dropping data isn’t always acceptable. When you need to impute missing values, fill_null() offers multiple strategies.
Literal values work for simple cases:
# Fill with a constant
df_filled = df.with_columns(
pl.col("email").fill_null("no-email@placeholder.com")
)
Strategy-based filling handles sequential data well:
# Time series example
ts_df = pl.DataFrame({
"date": ["2024-01-01", "2024-01-02", "2024-01-03", "2024-01-04"],
"price": [100.0, None, None, 105.0]
})
# Forward fill: carry last known value forward
forward_filled = ts_df.with_columns(
pl.col("price").fill_null(strategy="forward")
)
print(forward_filled)
# price: [100.0, 100.0, 100.0, 105.0]
# Backward fill: use next known value
backward_filled = ts_df.with_columns(
pl.col("price").fill_null(strategy="backward")
)
print(backward_filled)
# price: [100.0, 105.0, 105.0, 105.0]
Computed values let you fill with statistics:
# Fill with column mean
df_mean_filled = df.with_columns(
pl.col("age").fill_null(pl.col("age").mean())
)
# Fill with median (more robust to outliers)
df_median_filled = df.with_columns(
pl.col("age").fill_null(pl.col("age").median())
)
You can also fill with values from other columns:
# Fill missing age with a default based on another column
df_complex = df.with_columns(
pl.col("age").fill_null(pl.lit(0)) # or any expression
)
Coalescing and Conditional Replacement
When you have fallback options, coalesce() returns the first non-null value from a list of expressions:
contact_df = pl.DataFrame({
"user_id": [1, 2, 3, 4],
"primary_email": ["a@test.com", None, None, "d@test.com"],
"secondary_email": [None, "b2@test.com", None, "d2@test.com"],
"phone": ["111", "222", "333", None]
})
# Get best available contact method
result = contact_df.with_columns(
pl.coalesce(
pl.col("primary_email"),
pl.col("secondary_email"),
pl.lit("unknown")
).alias("contact_email")
)
print(result.select("user_id", "contact_email"))
# user_id: [1, 2, 3, 4]
# contact_email: ["a@test.com", "b2@test.com", "unknown", "d@test.com"]
For complex conditional logic, use when().then().otherwise():
# Conditional filling based on other columns
df_conditional = df.with_columns(
pl.when(pl.col("age").is_null())
.then(pl.lit(18)) # Default age for missing values
.otherwise(pl.col("age"))
.alias("age_filled")
)
# More complex: different defaults based on conditions
df_smart_fill = df.with_columns(
pl.when(pl.col("age").is_null() & pl.col("name").str.starts_with("A"))
.then(pl.lit(25))
.when(pl.col("age").is_null())
.then(pl.lit(30))
.otherwise(pl.col("age"))
.alias("age_imputed")
)
Null-Safe Operations and Aggregations
Understanding how Polars handles nulls in operations prevents surprises.
Arithmetic with nulls propagates nulls:
calc_df = pl.DataFrame({
"a": [1, 2, None, 4],
"b": [10, None, 30, 40]
})
result = calc_df.with_columns(
(pl.col("a") + pl.col("b")).alias("sum")
)
# sum: [11, null, null, 44]
Aggregations ignore nulls by default:
agg_df = pl.DataFrame({"values": [1, 2, None, 4, None]})
print(agg_df.select(
pl.col("values").sum().alias("sum"), # 7 (not null)
pl.col("values").mean().alias("mean"), # 2.33 (divides by 3, not 5)
pl.col("values").count().alias("count"), # 3 (non-null count)
pl.col("values").len().alias("len") # 5 (total rows)
))
This is usually correct behavior, but be explicit when null counts matter:
# If you need to count nulls as zeros in a sum
adjusted_sum = agg_df.select(
pl.col("values").fill_null(0).sum()
)
Null-safe equality requires attention. Standard equality returns null when comparing with null:
# This might not do what you expect
df.filter(pl.col("name") == None) # Returns empty—use is_null() instead
# Correct approach
df.filter(pl.col("name").is_null())
Best Practices and Performance Considerations
Handle nulls lazily when possible. In lazy mode, Polars can optimize null handling operations:
# Lazy mode allows optimization
result = (
pl.scan_csv("large_file.csv")
.filter(pl.col("important_col").is_not_null())
.with_columns(pl.col("optional_col").fill_null(0))
.collect()
)
Be explicit about schema implications. Filling nulls doesn’t change the nullable property of a column:
# Column is still nullable even after fill_null
df_filled = df.with_columns(pl.col("age").fill_null(0))
# Schema still shows age as nullable
Avoid repeated null checks. Instead of checking and then filling, just fill:
# Unnecessary
df.with_columns(
pl.when(pl.col("x").is_null()).then(0).otherwise(pl.col("x"))
)
# Better
df.with_columns(pl.col("x").fill_null(0))
Document your null handling strategy. Whether you drop, fill, or propagate nulls affects downstream analysis. Make these decisions explicit in your code and documentation.
Null handling isn’t glamorous, but getting it right is the difference between trustworthy analysis and garbage-in-garbage-out. Polars gives you the tools—use them deliberately.