How to Handle Missing Data in Polars

Key Insights

Polars distinguishes between null (missing data) and NaN (undefined floating-point results)—handling them correctly requires different methods and understanding when each appears in your data.
The fill_null() method with strategies like forward, backward, and mean handles most common imputation patterns, but complex business logic often requires when().then().otherwise() expressions.
Polars’ lazy evaluation shines with null handling operations—chaining multiple null checks and fills in a lazy frame lets the query optimizer eliminate redundant scans.

Introduction to Null Values in Polars

Missing data is inevitable. Sensors fail, users skip form fields, and joins produce unmatched rows. How you handle these gaps determines whether your analysis is trustworthy or garbage.

Polars takes a clear stance on missing data: it uses null to represent missing values, distinct from NaN (Not a Number) which represents undefined floating-point operations like 0/0. This distinction matters. A null means “we don’t have this value.” A NaN means “we computed something mathematically undefined.”

import polars as pl
import numpy as np

df = pl.DataFrame({
    "name": ["Alice", "Bob", None, "Diana"],
    "age": [25, None, 35, 40],
    "score": [85.5, float("nan"), None, 92.0],
})

print(df)

shape: (4, 4)
┌───────┬──────┬───────┐
│ name  ┆ age  ┆ score │
│ ---   ┆ ---  ┆ ---   │
│ str   ┆ i64  ┆ f64   │
╞═══════╪══════╪═══════╡
│ Alice ┆ 25   ┆ 85.5  │
│ Bob   ┆ null ┆ NaN   │
│ null  ┆ 35   ┆ null  │
│ Diana ┆ 40   ┆ 92.0  │
└───────┴──────┴───────┘

Notice Bob’s score shows NaN while the third row shows null. These require different handling approaches, which we’ll cover throughout this article.

Detecting Missing Data

Before you can handle missing data, you need to find it. Polars provides several methods for null detection that work efficiently across large datasets.

The is_null() and is_not_null() methods return boolean masks you can use for filtering or counting:

# Find rows where age is missing
missing_age = df.filter(pl.col("age").is_null())
print(missing_age)

# Count nulls per column
null_counts = df.select(pl.all().null_count())
print(null_counts)

shape: (1, 3)
┌──────┬──────┬───────┐
│ name ┆ age  ┆ score │
│ ---  ┆ ---  ┆ ---   │
│ str  ┆ i64  ┆ f64   │
╞══════╪══════╪═══════╡
│ Bob  ┆ null ┆ NaN   │
└──────┴──────┴───────┘

shape: (1, 3)
┌──────┬─────┬───────┐
│ name ┆ age ┆ score │
│ ---  ┆ --- ┆ ---   │
│ u32  ┆ u32 ┆ u32   │
╞══════╪═════╪═══════╡
│ 1    ┆ 1   ┆ 1     │
└──────┴─────┴───────┘

For a quick overview of your data including null counts, use describe():

print(df.describe())

For NaN detection specifically, use is_nan():

# Find rows with NaN in score column
nan_scores = df.filter(pl.col("score").is_nan())
print(nan_scores)

# Check for either null OR NaN
problematic = df.filter(
    pl.col("score").is_null() | pl.col("score").is_nan()
)

Dropping Null Values

Sometimes the cleanest solution is removing incomplete records. The drop_nulls() method handles this, but use it carefully—you might be throwing away valuable data.

# Drop rows with ANY null values
clean_df = df.drop_nulls()
print(clean_df)

shape: (2, 3)
┌───────┬─────┬───────┐
│ name  ┆ age ┆ score │
│ ---   ┆ --- ┆ ---   │
│ str   ┆ i64 ┆ f64   │
╞═══════╪═════╪═══════╡
│ Alice ┆ 25  ┆ 85.5  │
│ Diana ┆ 40  ┆ 92.0  │
└───────┴─────┴───────┘

More often, you want to drop nulls only in specific columns:

# Drop rows only where 'name' is null
df_with_names = df.drop_nulls(subset=["name"])
print(df_with_names)

# Drop rows where EITHER age OR score is null
df_complete_metrics = df.drop_nulls(subset=["age", "score"])

When is dropping appropriate? When missing values represent truly unusable records—like a transaction without an amount or a user without an ID. When is it risky? When nulls aren’t random. If all your missing ages come from users who signed up before you added the age field, dropping them biases your analysis toward newer users.

Filling Null Values

Imputation—filling missing values with substitutes—is often preferable to dropping data. Polars offers fill_null() with multiple strategies.

Literal Values

The simplest approach fills nulls with a constant:

# Fill missing names with "Unknown"
df_filled = df.with_columns(
    pl.col("name").fill_null("Unknown")
)
print(df_filled)

Built-in Strategies

For numeric data, Polars provides statistical strategies:

# Fill with column mean
df_mean_filled = df.with_columns(
    pl.col("age").fill_null(strategy="mean")
)

# Forward fill (use previous value)
df_forward = df.with_columns(
    pl.col("age").fill_null(strategy="forward")
)

# Backward fill (use next value)
df_backward = df.with_columns(
    pl.col("age").fill_null(strategy="backward")
)

Forward and backward fills excel for time series data where adjacent values are related:

time_series = pl.DataFrame({
    "timestamp": pl.date_range(
        pl.date(2024, 1, 1), 
        pl.date(2024, 1, 5), 
        eager=True
    ),
    "temperature": [20.0, None, None, 23.0, 24.0],
})

# Forward fill carries the last known temperature
filled_temps = time_series.with_columns(
    pl.col("temperature").fill_null(strategy="forward")
)
print(filled_temps)

shape: (5, 2)
┌────────────┬─────────────┐
│ timestamp  ┆ temperature │
│ ---        ┆ ---         │
│ date       ┆ f64         │
╞════════════╪═════════════╡
│ 2024-01-01 ┆ 20.0        │
│ 2024-01-02 ┆ 20.0        │
│ 2024-01-03 ┆ 20.0        │
│ 2024-01-04 ┆ 23.0        │
│ 2024-01-05 ┆ 24.0        │
└────────────┴─────────────┘

Handling NaN Separately

Remember, fill_null() doesn’t touch NaN values. Use fill_nan() for those:

df_no_nan = df.with_columns(
    pl.col("score").fill_nan(0.0)
)

# Often you want to handle both
df_clean = df.with_columns(
    pl.col("score").fill_nan(None).fill_null(strategy="mean")
)

The pattern fill_nan(None) converts NaN to null, letting you handle all missing data uniformly with fill_null().

Interpolation for Missing Values

When your data has a natural ordering and missing values should fall between known points, interpolation beats simple filling strategies.

sensor_data = pl.DataFrame({
    "minute": list(range(10)),
    "reading": [1.0, 2.0, None, None, 5.0, None, 7.0, 8.0, None, 10.0],
})

interpolated = sensor_data.with_columns(
    pl.col("reading").interpolate()
)
print(interpolated)

shape: (10, 2)
┌────────┬─────────┐
│ minute ┆ reading │
│ ---    ┆ ---     │
│ i64    ┆ f64     │
╞════════╪═════════╡
│ 0      ┆ 1.0     │
│ 1      ┆ 2.0     │
│ 2      ┆ 3.0     │
│ 3      ┆ 4.0     │
│ 4      ┆ 5.0     │
│ 5      ┆ 6.0     │
│ 6      ┆ 7.0     │
│ 7      ┆ 8.0     │
│ 8      ┆ 9.0     │
│ 9      ┆ 10.0    │
└────────┴─────────┘

Linear interpolation assumes a straight line between known points. This works well for sensor data, stock prices between trades, or any continuous measurement. It fails for categorical data or values that don’t follow linear patterns.

Conditional Null Handling with Expressions

Real-world null handling often requires business logic. Polars expressions give you full control.

sales_data = pl.DataFrame({
    "product": ["Widget", "Gadget", "Gizmo", "Thingamajig"],
    "price": [10.0, None, 30.0, None],
    "category": ["electronics", "electronics", "home", "home"],
    "default_price": [15.0, 20.0, 25.0, 35.0],
})

# Fill null prices based on category
filled = sales_data.with_columns(
    pl.when(pl.col("price").is_null() & (pl.col("category") == "electronics"))
    .then(pl.lit(19.99))
    .when(pl.col("price").is_null() & (pl.col("category") == "home"))
    .then(pl.lit(29.99))
    .otherwise(pl.col("price"))
    .alias("price")
)
print(filled)

For simpler cases where you want to fall back to another column, use coalesce():

# Use price if available, otherwise default_price
coalesced = sales_data.with_columns(
    pl.coalesce(["price", "default_price"]).alias("final_price")
)
print(coalesced)

shape: (4, 5)
┌─────────────┬───────┬─────────────┬───────────────┬─────────────┐
│ product     ┆ price ┆ category    ┆ default_price ┆ final_price │
│ ---         ┆ ---   ┆ ---         ┆ ---           ┆ ---         │
│ str         ┆ f64   ┆ str         ┆ f64           ┆ f64         │
╞═════════════╪═══════╪═════════════╪═══════════════╪═════════════╡
│ Widget      ┆ 10.0  ┆ electronics ┆ 15.0          ┆ 10.0        │
│ Gadget      ┆ null  ┆ electronics ┆ 20.0          ┆ 20.0        │
│ Gizmo       ┆ 30.0  ┆ home        ┆ 25.0          ┆ 30.0        │
│ Thingamajig ┆ null  ┆ home        ┆ 35.0          ┆ 35.0        │
└─────────────┴───────┴─────────────┴───────────────┴─────────────┘

coalesce() takes the first non-null value from the list of columns—perfect for fallback chains.

Best Practices and Performance Tips

Use Lazy Evaluation

Chain your null operations in a lazy frame for better performance:

result = (
    pl.scan_csv("large_file.csv")
    .with_columns(
        pl.col("value").fill_nan(None).fill_null(strategy="forward"),
        pl.col("category").fill_null("unknown"),
    )
    .filter(pl.col("id").is_not_null())
    .collect()
)

The query optimizer can combine these operations and avoid materializing intermediate results.

Choose Strategies by Data Type

Numeric continuous data: Mean, median, or interpolation
Numeric discrete data: Mode or forward/backward fill
Categorical data: Mode, “Unknown” literal, or domain-specific defaults
Time series: Forward fill or interpolation
IDs and keys: Don’t fill—drop or investigate why they’re missing

Versus Pandas

If you’re coming from pandas, here’s what changes:

Pandas	Polars
`isna()` / `isnull()`	`is_null()`
`notna()` / `notnull()`	`is_not_null()`
`fillna()`	`fill_null()` (and `fill_nan()` separately)
`dropna()`	`drop_nulls()`
`interpolate()`	`interpolate()`

The key difference: Polars forces you to think about null versus NaN explicitly. This catches bugs that pandas silently propagates.

Missing data handling isn’t glamorous, but it’s where data quality lives or dies. Polars gives you the tools—use them deliberately, document your choices, and your downstream analysis will thank you.