How to Fill Null Values in PySpark

Null values are inevitable in real-world data pipelines. Whether you're processing clickstream data, IoT sensor readings, or financial transactions, you'll encounter missing values that can break...

Key Insights

  • PySpark’s fillna() and na.fill() are interchangeable methods that handle simple static replacements, but complex scenarios require coalesce() or window functions for conditional and sequential fills.
  • Always compute aggregate statistics (mean, median) once and broadcast them rather than recalculating per-partition—this single optimization can cut fill operation time by 10x on large datasets.
  • Distinguish between null, NaN, and empty strings early in your pipeline; they require different handling strategies and mixing them up corrupts downstream analytics silently.

Introduction

Null values are inevitable in real-world data pipelines. Whether you’re processing clickstream data, IoT sensor readings, or financial transactions, you’ll encounter missing values that can break aggregations, skew machine learning models, and produce misleading analytics.

In distributed systems like PySpark, null handling becomes more complex. Operations that work fine locally can fail silently across partitions, and inefficient null-filling strategies can trigger expensive shuffles. Understanding your options—and when to use each—separates production-ready pipelines from fragile prototypes.

This guide covers every practical approach to filling null values in PySpark, from simple static replacements to sophisticated window-based fills for time-series data.

Understanding Nulls in PySpark

PySpark distinguishes between three types of “missing” data that developers often conflate:

  • Null/None: True absence of a value. This is what isNull() detects.
  • NaN (Not a Number): A specific floating-point value representing undefined mathematical results. Common after division by zero or invalid casts.
  • Empty strings: Valid string values with zero length. Not null, but often semantically meaningless.

This distinction matters because different methods handle each type differently. fillna() only replaces nulls—it ignores NaN values entirely unless you’re working with pandas API on Spark.

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, isnan, when, count
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType

spark = SparkSession.builder.appName("NullHandling").getOrCreate()

# Create DataFrame with various null types
data = [
    ("Alice", 34, 50000.0, "Engineering"),
    ("Bob", None, 60000.0, "Sales"),
    ("Charlie", 28, None, "Engineering"),
    ("Diana", 45, 75000.0, None),
    ("Eve", 31, float("nan"), "Marketing"),
    ("Frank", None, None, ""),
]

schema = StructType([
    StructField("name", StringType(), True),
    StructField("age", IntegerType(), True),
    StructField("salary", DoubleType(), True),
    StructField("department", StringType(), True),
])

df = spark.createDataFrame(data, schema)

# Count nulls, NaNs, and empty strings per column
df.select([
    count(when(col("age").isNull(), 1)).alias("age_nulls"),
    count(when(col("salary").isNull(), 1)).alias("salary_nulls"),
    count(when(isnan(col("salary")), 1)).alias("salary_nans"),
    count(when(col("department").isNull(), 1)).alias("dept_nulls"),
    count(when(col("department") == "", 1)).alias("dept_empty"),
]).show()

Output:

+---------+------------+-----------+----------+----------+
|age_nulls|salary_nulls|salary_nans|dept_nulls|dept_empty|
+---------+------------+-----------+----------+----------+
|        2|           1|          1|         1|         1|
+---------+------------+-----------+----------+----------+

Filter for rows with any null values using this pattern:

# Find rows with nulls in specific columns
df.filter(col("age").isNull() | col("salary").isNull()).show()

# Find rows with nulls in any column
from functools import reduce
from operator import or_

null_condition = reduce(or_, [col(c).isNull() for c in df.columns])
df.filter(null_condition).show()

Using fillna() / na.fill()

The fillna() method (aliased as na.fill()) is your primary tool for static null replacement. Both methods are identical—use whichever reads better in your codebase.

# Fill all numeric columns with 0
df_filled = df.fillna(0)

# Fill all string columns with "Unknown"
df_filled = df.fillna("Unknown")

# These are equivalent
df.fillna(0)
df.na.fill(0)

For column-specific fills, pass a dictionary mapping column names to replacement values:

# Column-specific fills with dictionary
fill_values = {
    "age": 0,
    "salary": 0.0,
    "department": "Unknown"
}

df_filled = df.fillna(fill_values)
df_filled.show()

Output:

+-------+---+-------+-----------+
|   name|age| salary| department|
+-------+---+-------+-----------+
|  Alice| 34|50000.0|Engineering|
|    Bob|  0|60000.0|      Sales|
|Charlie| 28|    0.0|Engineering|
|  Diana| 45|75000.0|    Unknown|
|    Eve| 31|    NaN|  Marketing|
|  Frank|  0|    0.0|           |
+-------+---+-------+-----------+

Notice that Eve’s NaN salary wasn’t replaced, and Frank’s empty department string remains empty. This is expected behavior—fillna() only targets true nulls.

Type matching is strict: You cannot fill an integer column with a string value. PySpark will silently ignore mismatched types rather than throwing an error, which can lead to confusing debugging sessions.

Filling with Aggregates (Mean, Median, Mode)

Static fills work for categorical data, but numeric columns often require statistical imputation. The key is computing the aggregate once and reusing it.

from pyspark.sql.functions import mean, col

# Calculate mean salary (excluding nulls and NaNs)
mean_salary = df.filter(~isnan(col("salary")) & col("salary").isNotNull()) \
    .agg(mean("salary")).collect()[0][0]

print(f"Mean salary: {mean_salary}")  # 61666.67

# Apply the computed mean
df_filled = df.fillna({"salary": mean_salary})

For reusability, wrap this pattern in a function:

def fill_with_mean(df, columns):
    """Fill null values with column means."""
    fill_values = {}
    for column in columns:
        avg_value = df.filter(col(column).isNotNull() & ~isnan(col(column))) \
            .agg(mean(column)).collect()[0][0]
        if avg_value is not None:
            fill_values[column] = avg_value
    return df.fillna(fill_values)

# Usage
numeric_cols = ["age", "salary"]
df_filled = fill_with_mean(df, numeric_cols)

For median, PySpark requires the approxQuantile() method since exact median calculation is expensive at scale:

def fill_with_median(df, columns, relative_error=0.01):
    """Fill null values with approximate column medians."""
    fill_values = {}
    for column in columns:
        median_value = df.filter(col(column).isNotNull()) \
            .approxQuantile(column, [0.5], relative_error)[0]
        fill_values[column] = median_value
    return df.fillna(fill_values)

The relative_error parameter controls accuracy vs. performance. For most use cases, 0.01 (1% error) is acceptable and significantly faster than exact computation.

Using coalesce() for Conditional Fills

When you need to fill nulls from other columns rather than static values, coalesce() returns the first non-null value from a list of columns:

from pyspark.sql.functions import coalesce, lit

# Sample data with primary and backup contact info
contact_data = [
    ("Alice", "alice@work.com", "alice@personal.com"),
    ("Bob", None, "bob@personal.com"),
    ("Charlie", None, None),
]

contacts = spark.createDataFrame(
    contact_data, 
    ["name", "work_email", "personal_email"]
)

# Create fallback chain: work -> personal -> default
contacts_filled = contacts.withColumn(
    "contact_email",
    coalesce(
        col("work_email"),
        col("personal_email"),
        lit("no-email@placeholder.com")
    )
)

contacts_filled.show(truncate=False)

Output:

+-------+--------------+------------------+------------------------+
|name   |work_email    |personal_email    |contact_email           |
+-------+--------------+------------------+------------------------+
|Alice  |alice@work.com|alice@personal.com|alice@work.com          |
|Bob    |null          |bob@personal.com  |bob@personal.com        |
|Charlie|null          |null              |no-email@placeholder.com|
+-------+--------------+------------------+------------------------+

This pattern is invaluable for data consolidation scenarios where you’re merging multiple sources with overlapping but incomplete coverage.

Forward/Backward Fill with Window Functions

Time-series data often requires propagating the last known value forward (or the next known value backward). PySpark handles this with window functions:

from pyspark.sql.window import Window
from pyspark.sql.functions import last, first

# Sensor data with gaps
sensor_data = [
    ("sensor_1", "2024-01-01 10:00", 23.5),
    ("sensor_1", "2024-01-01 10:05", None),
    ("sensor_1", "2024-01-01 10:10", None),
    ("sensor_1", "2024-01-01 10:15", 24.1),
    ("sensor_1", "2024-01-01 10:20", None),
]

sensors = spark.createDataFrame(
    sensor_data, 
    ["sensor_id", "timestamp", "temperature"]
)

# Forward fill: carry last known value forward
window_spec = Window.partitionBy("sensor_id") \
    .orderBy("timestamp") \
    .rowsBetween(Window.unboundedPreceding, 0)

sensors_ffill = sensors.withColumn(
    "temp_filled",
    last("temperature", ignorenulls=True).over(window_spec)
)

sensors_ffill.show()

Output:

+---------+----------------+-----------+-----------+
|sensor_id|       timestamp|temperature|temp_filled|
+---------+----------------+-----------+-----------+
| sensor_1|2024-01-01 10:00|       23.5|       23.5|
| sensor_1|2024-01-01 10:05|       null|       23.5|
| sensor_1|2024-01-01 10:10|       null|       23.5|
| sensor_1|2024-01-01 10:15|       24.1|       24.1|
| sensor_1|2024-01-01 10:20|       null|       24.1|
+---------+----------------+-----------+-----------+

The ignorenulls=True parameter is critical—without it, last() would return null whenever it encounters one. For backward fill, use first() with a window bounded from the current row to the end.

Best Practices & Performance Tips

Know when to drop instead of fill. If more than 30-40% of a column’s values are null, filling may introduce more noise than signal. Consider dropping the column entirely or using it as a binary “has_value” indicator.

Compute aggregates once. Every call to .collect() triggers a full DataFrame evaluation. Compute all fill values in a single pass:

# Inefficient: multiple passes
mean_age = df.agg(mean("age")).collect()[0][0]
mean_salary = df.agg(mean("salary")).collect()[0][0]

# Efficient: single pass
stats = df.agg(
    mean("age").alias("mean_age"),
    mean("salary").alias("mean_salary")
).collect()[0]

fill_values = {
    "age": stats["mean_age"],
    "salary": stats["mean_salary"]
}

Cache before multiple fill operations. If you’re filling nulls differently for different downstream consumers, cache the base DataFrame to avoid recomputation:

df_clean = df.na.drop(subset=["critical_column"]).cache()
df_for_ml = fill_with_mean(df_clean, ["feature1", "feature2"])
df_for_reporting = df_clean.fillna({"feature1": 0, "feature2": 0})

Handle NaN separately. If your data contains NaN values, replace them before using fillna():

from pyspark.sql.functions import when, isnan

df_clean = df.withColumn(
    "salary",
    when(isnan(col("salary")), None).otherwise(col("salary"))
).fillna({"salary": 0.0})

Null handling isn’t glamorous, but getting it right determines whether your pipeline produces trustworthy results or subtle data quality issues that surface months later. Choose your fill strategy based on the data’s semantics, not just convenience.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.