How to Fill Null Values in Polars
Null values are inevitable in real-world data. Whether you're processing user submissions, merging datasets, or ingesting external APIs, you'll encounter missing values that need handling before...
Key Insights
- Polars distinguishes between
null(missing data) andNaN(invalid floating-point results), requiring different handling strategies for each - The
fill_null()method supports both static values and built-in strategies like forward fill, backward fill, and statistical aggregations—choose based on your data’s nature - For complex fallback logic across multiple columns,
pl.coalesce()provides a cleaner, more performant solution than chained conditional expressions
Introduction
Null values are inevitable in real-world data. Whether you’re processing user submissions, merging datasets, or ingesting external APIs, you’ll encounter missing values that need handling before analysis or model training.
Polars takes a fundamentally different approach to null handling than pandas. While pandas conflates missing values with NaN for numeric columns and uses None for objects, Polars maintains a clear distinction. This separation isn’t just semantic—it affects how you write transformations and can catch subtle bugs that would slip through in pandas.
Understanding Polars’ null handling methods will make your data pipelines more robust and your code more expressive. Let’s dig into the practical techniques.
Understanding Nulls in Polars
Polars represents missing data as null, which is distinct from NaN (Not a Number). This matters because NaN is a valid IEEE 754 floating-point value representing undefined mathematical results, while null represents the absence of data entirely.
import polars as pl
df = pl.DataFrame({
"product": ["Widget", "Gadget", None, "Tool", None],
"price": [29.99, None, 45.00, None, 19.99],
"quantity": [100, 50, None, 75, 200],
"rating": [4.5, float("nan"), 3.8, 4.2, None]
})
print(df)
shape: (5, 4)
┌─────────┬───────┬──────────┬────────┐
│ product ┆ price ┆ quantity ┆ rating │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ f64 ┆ i64 ┆ f64 │
╞═════════╪═══════╪══════════╪════════╡
│ Widget ┆ 29.99 ┆ 100 ┆ 4.5 │
│ Gadget ┆ null ┆ 50 ┆ NaN │
│ null ┆ 45.0 ┆ null ┆ 3.8 │
│ Tool ┆ null ┆ 75 ┆ 4.2 │
│ null ┆ 19.99 ┆ 200 ┆ null │
└─────────┴───────┴──────────┴────────┘
Notice how the rating column contains both NaN and null. These are not interchangeable.
To detect nulls, use null_count() for a quick summary or is_null() for row-level inspection:
# Get null counts per column
print(df.null_count())
shape: (1, 4)
┌─────────┬───────┬──────────┬────────┐
│ product ┆ price ┆ quantity ┆ rating │
│ --- ┆ --- ┆ --- ┆ --- │
│ u32 ┆ u32 ┆ u32 ┆ u32 │
╞═════════╪═══════╪══════════╪════════╡
│ 2 ┆ 2 ┆ 1 ┆ 1 │
└─────────┴───────┴──────────┴────────┘
# Check which rows have null prices
print(df.select(
pl.col("product"),
pl.col("price"),
pl.col("price").is_null().alias("price_is_null")
))
shape: (5, 3)
┌─────────┬───────┬───────────────┐
│ product ┆ price ┆ price_is_null │
│ --- ┆ --- ┆ --- │
│ str ┆ f64 ┆ bool │
╞═════════╪═══════╪═══════════════╡
│ Widget ┆ 29.99 ┆ false │
│ Gadget ┆ null ┆ true │
│ null ┆ 45.0 ┆ false │
│ Tool ┆ null ┆ true │
│ null ┆ 19.99 ┆ false │
└─────────┴───────┴───────────────┘
If you need to handle NaN values specifically, use is_nan() instead. To treat NaN as null for filling purposes, first convert with fill_nan(None).
Fill with Static Values
The most straightforward approach is replacing nulls with a constant value. The fill_null() method accepts literals directly:
df_filled = df.with_columns(
pl.col("price").fill_null(0.0).alias("price_filled"),
pl.col("product").fill_null("Unknown").alias("product_filled"),
pl.col("quantity").fill_null(-1).alias("quantity_filled")
)
print(df_filled.select("price", "price_filled", "product", "product_filled"))
shape: (5, 4)
┌───────┬──────────────┬─────────┬────────────────┐
│ price ┆ price_filled ┆ product ┆ product_filled │
│ --- ┆ --- ┆ --- ┆ --- │
│ f64 ┆ f64 ┆ str ┆ str │
╞═══════╪══════════════╪═════════╪════════════════╡
│ 29.99 ┆ 29.99 ┆ Widget ┆ Widget │
│ null ┆ 0.0 ┆ Gadget ┆ Gadget │
│ 45.0 ┆ 45.0 ┆ null ┆ Unknown │
│ null ┆ 0.0 ┆ Tool ┆ Tool │
│ 19.99 ┆ 19.99 ┆ null ┆ Unknown │
└───────┴──────────────┴─────────┴────────────────┘
The fill value must be compatible with the column’s data type. Polars will raise an error if you try to fill a numeric column with a string.
For bulk operations across multiple columns of the same type, use selectors:
import polars.selectors as cs
df_numeric_filled = df.with_columns(
cs.numeric().fill_null(0)
)
Fill with Strategies
Polars provides built-in strategies for common fill patterns. These are particularly useful for time-series data or when you want statistically-informed replacements.
Available strategies: "forward", "backward", "mean", "min", "max", "zero", "one".
# Time-series example
ts_df = pl.DataFrame({
"date": pl.date_range(pl.date(2024, 1, 1), pl.date(2024, 1, 7), eager=True),
"temperature": [72.0, None, None, 68.0, 71.0, None, 69.0],
"humidity": [45, 48, None, None, 52, 50, None]
})
print(ts_df)
shape: (7, 3)
┌────────────┬─────────────┬──────────┐
│ date ┆ temperature ┆ humidity │
│ --- ┆ --- ┆ --- │
│ date ┆ f64 ┆ i64 │
╞════════════╪═════════════╪══════════╡
│ 2024-01-01 ┆ 72.0 ┆ 45 │
│ 2024-01-02 ┆ null ┆ 48 │
│ 2024-01-03 ┆ null ┆ null │
│ 2024-01-04 ┆ 68.0 ┆ null │
│ 2024-01-05 ┆ 71.0 ┆ 52 │
│ 2024-01-06 ┆ null ┆ 50 │
│ 2024-01-07 ┆ 69.0 ┆ null │
└────────────┴─────────────┴──────────┘
# Forward fill carries the last known value forward
ts_filled = ts_df.with_columns(
pl.col("temperature").fill_null(strategy="forward").alias("temp_ffill"),
pl.col("humidity").fill_null(strategy="backward").alias("humidity_bfill"),
pl.col("temperature").fill_null(strategy="mean").alias("temp_mean")
)
print(ts_filled)
shape: (7, 5)
┌────────────┬─────────────┬──────────┬────────────┬────────────────┐
│ date ┆ temperature ┆ humidity ┆ temp_ffill ┆ humidity_bfill │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ date ┆ f64 ┆ i64 ┆ f64 ┆ i64 │
╞════════════╪═════════════╪══════════╪════════════╪════════════════╡
│ 2024-01-01 ┆ 72.0 ┆ 45 ┆ 72.0 ┆ 45 │
│ 2024-01-02 ┆ null ┆ 48 ┆ 72.0 ┆ 48 │
│ 2024-01-03 ┆ null ┆ null ┆ 72.0 ┆ 52 │
│ 2024-01-04 ┆ 68.0 ┆ null ┆ 68.0 ┆ 52 │
│ 2024-01-05 ┆ 71.0 ┆ 52 ┆ 71.0 ┆ 52 │
│ 2024-01-06 ┆ null ┆ 50 ┆ 71.0 ┆ 50 │
│ 2024-01-07 ┆ 69.0 ┆ null ┆ 69.0 ┆ 50 │
└────────────┴─────────────┴──────────┴────────────┴────────────────┘
Forward fill is ideal for sensor data where the last reading remains valid until updated. Backward fill works well when you have future data and want to propagate it backward. Mean fill suits scenarios where you want to preserve statistical properties without introducing bias.
Be cautious with forward fill at the start of your data and backward fill at the end—nulls in those positions will remain null.
Fill with Expressions and Computed Values
Static values and strategies cover common cases, but real-world data often requires dynamic fills. Polars expressions let you compute fill values on the fly.
employee_df = pl.DataFrame({
"name": ["Alice", "Bob", "Charlie", "Diana", "Eve"],
"department": ["Engineering", "Engineering", "Sales", "Sales", "Engineering"],
"salary": [95000, None, 75000, None, 88000],
"bonus": [5000, 4500, None, 3000, None]
})
# Fill salary nulls with department mean
filled = employee_df.with_columns(
pl.col("salary").fill_null(
pl.col("salary").mean().over("department")
).alias("salary_filled")
)
print(filled)
shape: (5, 5)
┌─────────┬─────────────┬────────┬───────┬───────────────┐
│ name ┆ department ┆ salary ┆ bonus ┆ salary_filled │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ i64 ┆ i64 ┆ f64 │
╞═════════╪═════════════╪════════╪═══════╪═══════════════╡
│ Alice ┆ Engineering ┆ 95000 ┆ 5000 ┆ 95000.0 │
│ Bob ┆ Engineering ┆ null ┆ 4500 ┆ 91500.0 │
│ Charlie ┆ Sales ┆ 75000 ┆ null ┆ 75000.0 │
│ Diana ┆ Sales ┆ null ┆ 3000 ┆ 75000.0 │
│ Eve ┆ Engineering ┆ 88000 ┆ null ┆ 88000.0 │
└─────────┴─────────────┴────────┴───────┴───────────────┘
You can also fill from another column entirely:
contact_df = pl.DataFrame({
"user": ["u1", "u2", "u3"],
"mobile": ["555-1234", None, "555-9999"],
"landline": ["555-0001", "555-0002", None]
})
filled_contact = contact_df.with_columns(
pl.col("mobile").fill_null(pl.col("landline")).alias("primary_phone")
)
print(filled_contact)
Handling Nulls with Coalesce
When you need fallback logic across multiple columns, pl.coalesce() is cleaner than nested fill_null() calls. It returns the first non-null value from the provided expressions.
user_df = pl.DataFrame({
"user_id": [1, 2, 3, 4, 5],
"primary_email": ["alice@work.com", None, None, "diana@work.com", None],
"secondary_email": [None, "bob@personal.com", None, None, None],
"recovery_email": ["alice@backup.com", "bob@backup.com", "charlie@backup.com", None, None]
})
result = user_df.with_columns(
pl.coalesce(
pl.col("primary_email"),
pl.col("secondary_email"),
pl.col("recovery_email"),
pl.lit("no-email@placeholder.com")
).alias("contact_email")
)
print(result)
shape: (5, 5)
┌─────────┬────────────────┬──────────────────┬────────────────────┬──────────────────────────┐
│ user_id ┆ primary_email ┆ secondary_email ┆ recovery_email ┆ contact_email │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ str ┆ str ┆ str │
╞═════════╪════════════════╪══════════════════╪════════════════════╪══════════════════════════╡
│ 1 ┆ alice@work.com ┆ null ┆ alice@backup.com ┆ alice@work.com │
│ 2 ┆ null ┆ bob@personal.com ┆ bob@backup.com ┆ bob@personal.com │
│ 3 ┆ null ┆ null ┆ charlie@backup.com ┆ charlie@backup.com │
│ 4 ┆ diana@work.com ┆ null ┆ null ┆ diana@work.com │
│ 5 ┆ null ┆ null ┆ null ┆ no-email@placeholder.com │
└─────────┴────────────────┴──────────────────┴────────────────────┴──────────────────────────┘
coalesce() evaluates left to right and short-circuits, making it efficient for chains of fallback values.
Performance Considerations and Best Practices
Use lazy execution for large datasets. When working with substantial data, chain your null-handling operations in a lazy frame:
result = (
pl.scan_parquet("large_dataset.parquet")
.with_columns(
pl.col("value").fill_null(strategy="forward"),
pl.col("category").fill_null("Unknown")
)
.filter(pl.col("value").is_not_null())
.collect()
)
Lazy execution lets Polars optimize the query plan, potentially pushing filters before fills or parallelizing operations.
Drop versus fill. Filling nulls isn’t always the right choice. If nulls represent genuinely missing observations that would skew analysis, dropping rows with drop_nulls() may be more appropriate. Consider your downstream use case.
Handle NaN separately. Remember that fill_null() doesn’t touch NaN values. If your data contains both, chain the operations:
df.with_columns(
pl.col("value").fill_nan(None).fill_null(0)
)
Be explicit about types. When filling with literals, ensure type compatibility. Use pl.lit() with explicit casting if needed:
pl.col("int_column").fill_null(pl.lit(0).cast(pl.Int64))
Document your fill strategy. Null handling decisions affect downstream analysis. Whether you choose forward fill, mean imputation, or static values, document why. Future you (or your teammates) will appreciate it.