PySpark - Replace NULL Values (fillna/na.fill)
NULL values in distributed DataFrames represent missing or undefined data, and they behave differently in PySpark than in pandas. In PySpark, NULLs propagate through most operations: adding a number...
Key Insights
fillna()andna.fill()are interchangeable methods in PySpark for replacing NULL values, withfillna()offering dictionary-based column-specific replacements that execute in a single pass for better performance- NULL handling must happen before aggregations and joins to prevent silent data loss, as PySpark treats NULLs differently than pandas (NULLs propagate through most operations rather than being ignored)
- Type-aware replacement strategies prevent subtle bugs—use conditional logic with
when()andotherwise()for sophisticated NULL handling that considers business context rather than blanket replacements
Introduction to NULL Handling in PySpark
NULL values in distributed DataFrames represent missing or undefined data, and they behave differently in PySpark than in pandas. In PySpark, NULLs propagate through most operations: adding a number to NULL yields NULL, comparing NULL to anything returns NULL, and aggregations can produce unexpected results if you don’t handle them explicitly.
The two primary methods for replacing NULLs are fillna() and na.fill(). These are aliases—they execute identical code paths. Most developers prefer fillna() because it’s familiar from pandas, but na.fill() follows PySpark’s DataFrameNaFunctions pattern. Both methods are optimized for distributed processing and execute replacement in a single pass when possible.
Here’s a DataFrame with NULLs across different column types:
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType
spark = SparkSession.builder.appName("NullHandling").getOrCreate()
schema = StructType([
StructField("user_id", IntegerType(), True),
StructField("name", StringType(), True),
StructField("age", IntegerType(), True),
StructField("salary", DoubleType(), True),
StructField("department", StringType(), True)
])
data = [
(1, "Alice", 30, 75000.0, "Engineering"),
(2, "Bob", None, 65000.0, None),
(3, None, 25, None, "Marketing"),
(4, "David", None, None, "Sales"),
(5, "Eve", 35, 80000.0, None)
]
df = spark.createDataFrame(data, schema)
df.show()
This produces a DataFrame where NULLs appear across numeric and string columns—a realistic scenario in production data pipelines.
Basic fillna() Operations
The simplest fillna() operation replaces all NULLs with a single value. PySpark only replaces NULLs in columns where the replacement value’s type matches:
# Fill all NULL values with 0 (only affects numeric columns)
df_filled = df.fillna(0)
df_filled.show()
# Fill all NULL values with "Unknown" (only affects string columns)
df_filled = df.fillna("Unknown")
df_filled.show()
The first example replaces NULLs in age and salary columns but leaves string column NULLs untouched. The second does the opposite.
For column-specific replacement, use the subset parameter:
# Fill NULLs only in the age column
df_filled = df.fillna(0, subset=["age"])
df_filled.show()
# Fill NULLs in multiple specific columns
df_filled = df.fillna("Not Specified", subset=["name", "department"])
df_filled.show()
The most powerful approach uses a dictionary mapping column names to replacement values:
# Fill different columns with different values in one operation
fill_values = {
"age": 0,
"salary": 50000.0,
"name": "Anonymous",
"department": "Unassigned"
}
df_filled = df.fillna(fill_values)
df_filled.show()
This dictionary approach executes in a single pass through the DataFrame, making it significantly more efficient than multiple sequential fillna() calls.
Filling by Data Type
Type-aware filling strategies prevent logical errors. You wouldn’t want to fill a salary column with “Unknown”—it should be a numeric value that makes sense for downstream calculations.
from pyspark.sql.functions import col, mean
# Separate strategies for different types
numeric_fill = {"age": 0, "salary": 0.0}
string_fill = {"name": "Unknown", "department": "Unassigned"}
# Apply both in sequence (or combine dictionaries)
df_filled = df.fillna(numeric_fill).fillna(string_fill)
df_filled.show()
# Or combine dictionaries
all_fills = {**numeric_fill, **string_fill}
df_filled = df.fillna(all_fills)
df_filled.show()
For more sophisticated numeric filling, calculate statistics first:
# Calculate mean salary for intelligent filling
mean_salary = df.select(mean("salary")).collect()[0][0]
# Fill with calculated mean
df_filled = df.fillna({"salary": mean_salary})
df_filled.show()
# Fill age with median (requires percentile approximation)
from pyspark.sql.functions import expr
median_age = df.stat.approxQuantile("age", [0.5], 0.01)[0]
df_filled = df.fillna({"age": int(median_age)})
df_filled.show()
This approach ensures replacement values align with your data’s distribution rather than using arbitrary constants.
Advanced Replacement Strategies
For conditional NULL replacement based on other column values, combine when() and otherwise():
from pyspark.sql.functions import when, col
# Fill salary based on department
df_conditional = df.withColumn(
"salary",
when(col("salary").isNull() & (col("department") == "Engineering"), 70000.0)
.when(col("salary").isNull() & (col("department") == "Marketing"), 60000.0)
.when(col("salary").isNull() & (col("department") == "Sales"), 65000.0)
.when(col("salary").isNull(), 55000.0) # Default for unknown departments
.otherwise(col("salary"))
)
df_conditional.show()
This conditional logic applies business rules: different departments get different default salaries.
You can also use window functions for forward-fill or backward-fill patterns:
from pyspark.sql.window import Window
from pyspark.sql.functions import last, first
# Forward fill within a partition
window_spec = Window.partitionBy("department").orderBy("user_id").rowsBetween(Window.unboundedPreceding, 0)
df_ffill = df.withColumn(
"salary",
last("salary", ignorenulls=True).over(window_spec)
)
df_ffill.show()
Chain operations for complex transformations:
# Multi-step NULL handling pipeline
df_processed = (df
.fillna({"name": "Anonymous"}) # Fill names first
.withColumn("age", when(col("age").isNull(), 30).otherwise(col("age"))) # Conditional age
.fillna({"department": "General"}) # Fill departments
.withColumn("salary",
when(col("salary").isNull(), col("age") * 2000) # Salary based on age
.otherwise(col("salary")))
)
df_processed.show()
Performance Considerations & Best Practices
Performance matters in distributed processing. The fillna() method is optimized for Spark’s execution engine, while custom UDFs introduce serialization overhead.
from pyspark.sql.functions import udf
from pyspark.sql.types import DoubleType
import time
# Create a larger dataset for performance testing
large_df = spark.range(0, 1000000).select(
col("id"),
when(col("id") % 3 == 0, None).otherwise(col("id").cast("double")).alias("value")
)
# Method 1: fillna() - Optimized
start = time.time()
result1 = large_df.fillna({"value": 0.0})
result1.count() # Trigger execution
fillna_time = time.time() - start
# Method 2: UDF - Slower due to serialization
@udf(DoubleType())
def fill_null_udf(value):
return 0.0 if value is None else value
start = time.time()
result2 = large_df.withColumn("value", fill_null_udf(col("value")))
result2.count()
udf_time = time.time() - start
print(f"fillna() time: {fillna_time:.2f}s")
print(f"UDF time: {udf_time:.2f}s")
Best practices for NULL handling:
- Handle NULLs before joins: NULL keys don’t match in joins, causing silent data loss
- Use dictionary-based fillna(): Single-pass execution is faster than multiple calls
- Document replacement logic: Future maintainers need to understand why you chose specific values
- Validate after filling: Check that replacement didn’t introduce logical errors
- Consider keeping NULL indicators: Add boolean columns to track which values were originally NULL
# Add NULL indicator before filling
df_with_indicator = df.withColumn("salary_was_null", col("salary").isNull())
df_filled = df_with_indicator.fillna({"salary": 50000.0})
df_filled.show()
Common Pitfalls and Troubleshooting
The most common pitfall is confusing NULL with NaN. In PySpark, these are different:
from pyspark.sql.functions import isnan, isnull
import math
# Create DataFrame with both NULL and NaN
data_with_nan = [
(1, 10.0),
(2, float('nan')),
(3, None),
(4, 20.0)
]
df_nan = spark.createDataFrame(data_with_nan, ["id", "value"])
# fillna() does NOT replace NaN values
df_filled = df_nan.fillna({"value": 0.0})
df_filled.show()
# You need separate logic for NaN
df_fixed = df_nan.withColumn(
"value",
when(isnan(col("value")), 0.0)
.when(isnull(col("value")), 0.0)
.otherwise(col("value"))
)
df_fixed.show()
Type mismatches silently fail—if you try to fill an integer column with a float, PySpark ignores it:
# This won't work as expected
df_wrong = df.fillna({"age": 25.5}) # age is IntegerType
df_wrong.show() # age NULLs remain
# Cast the value to match
df_correct = df.fillna({"age": int(25)})
df_correct.show() # Now it works
Nested structures require special handling:
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
# Nested structure
nested_schema = StructType([
StructField("id", IntegerType(), True),
StructField("details", StructType([
StructField("city", StringType(), True),
StructField("zip", StringType(), True)
]), True)
])
nested_data = [
(1, {"city": "NYC", "zip": "10001"}),
(2, {"city": None, "zip": "20001"}),
(3, None)
]
df_nested = spark.createDataFrame(nested_data, nested_schema)
# fillna() doesn't work on nested fields directly
# Use withColumn instead
df_nested_filled = df_nested.withColumn(
"details",
when(col("details").isNull(),
struct(lit("Unknown").alias("city"), lit("00000").alias("zip")))
.otherwise(col("details"))
)
df_nested_filled.show(truncate=False)
Master these patterns and you’ll handle NULL values confidently in production PySpark pipelines. The key is understanding that NULL handling isn’t just about replacing missing values—it’s about applying business logic that maintains data integrity throughout your transformations.