PySpark - Filter Rows with NULL Values
• PySpark provides `isNull()` and `isNotNull()` methods for filtering NULL values, which are more reliable than Python's `None` comparisons in distributed environments
Key Insights
• PySpark provides isNull() and isNotNull() methods for filtering NULL values, which are more reliable than Python’s None comparisons in distributed environments
• The dropna() method offers a convenient alternative for removing NULL values with fine-grained control over thresholds and column subsets
• NULL filtering should be applied early in your transformation pipeline and consider partition distribution to optimize performance in large-scale data processing
Introduction
NULL values are inevitable in real-world data. Whether you’re dealing with incomplete sensor readings, optional user profile fields, or failed data joins, knowing how to filter NULL values efficiently in PySpark is essential for data cleaning and preparation. Unlike pandas or standard Python, PySpark operates in a distributed environment where NULL handling must account for data partitioning, lazy evaluation, and cluster resources.
The challenge with NULL values in distributed computing isn’t just identifying them—it’s doing so efficiently across potentially billions of rows spread across multiple worker nodes. A poorly constructed NULL filter can trigger unnecessary shuffles or create data skew that cripples your job’s performance.
Let’s start with a sample DataFrame to demonstrate various NULL filtering techniques:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
spark = SparkSession.builder.appName("NullFiltering").getOrCreate()
# Create sample data with NULL values
data = [
(1, "Alice", 29, "Engineering"),
(2, "Bob", None, "Marketing"),
(3, None, 35, "Sales"),
(4, "David", 42, None),
(5, None, None, None),
(6, "Frank", 31, "Engineering")
]
schema = StructType([
StructField("id", IntegerType(), False),
StructField("name", StringType(), True),
StructField("age", IntegerType(), True),
StructField("department", StringType(), True)
])
df = spark.createDataFrame(data, schema)
df.show()
This creates a DataFrame with strategic NULL placements that we’ll use throughout our examples.
Filter Rows WHERE Column IS NULL
To identify rows containing NULL values in specific columns, use the isNull() method combined with the col() function. This approach is explicit, readable, and works consistently across all PySpark data types.
# Filter rows where age is NULL
null_ages = df.filter(col("age").isNull())
null_ages.show()
# Output:
# +---+-----+----+----------+
# | id| name| age|department|
# +---+-----+----+----------+
# | 2| Bob|null| Marketing|
# | 5| null|null| null|
# +---+-----+----+----------+
The where() method is functionally identical to filter() and can be used interchangeably based on your SQL background preference:
# Equivalent using where()
null_ages = df.where(col("age").isNull())
For filtering rows with NULL values in multiple columns using OR logic, combine conditions with the | operator:
# Filter rows where name OR department is NULL
null_name_or_dept = df.filter(
col("name").isNull() | col("department").isNull()
)
null_name_or_dept.show()
# Output:
# +---+-----+----+----------+
# | id| name| age|department|
# +---+-----+----+----------+
# | 3| null| 35| Sales|
# | 4|David| 42| null|
# | 5| null|null| null|
# +---+-----+----+----------+
You can extend this pattern to any number of columns. Just remember that each additional condition increases the complexity of the filter predicate that Catalyst (PySpark’s optimizer) must process.
Filter Rows WHERE Column IS NOT NULL
More commonly, you’ll want to keep only valid rows by filtering out NULLs. The isNotNull() method handles this cleanly:
# Keep only rows with non-NULL age values
valid_ages = df.filter(col("age").isNotNull())
valid_ages.show()
# Output:
# +---+-----+---+------------+
# | id| name|age| department|
# +---+-----+---+------------+
# | 1|Alice| 29| Engineering|
# | 3| null| 35| Sales|
# | 4|David| 42| null|
# | 6|Frank| 31| Engineering|
# +---+-----+---+------------+
For stricter data quality requirements where multiple columns must all be non-NULL, chain conditions with the & operator:
# Keep only rows where BOTH name AND age are non-NULL
complete_records = df.filter(
col("name").isNotNull() & col("age").isNotNull()
)
complete_records.show()
# Output:
# +---+-----+---+------------+
# | id| name|age| department|
# +---+-----+---+------------+
# | 1|Alice| 29| Engineering|
# | 4|David| 42| null|
# | 6|Frank| 31| Engineering|
# +---+-----+---+------------+
The dropna() method provides a more concise alternative for removing rows with NULL values:
# Drop rows with ANY NULL values
df.dropna().show()
# Equivalent to checking all columns with isNotNull()
# Only row 1 and 6 survive this filter
Advanced NULL Filtering Techniques
The dropna() method offers powerful parameters for sophisticated NULL filtering strategies. The how parameter controls whether to drop rows with ANY NULL or ALL NULL values:
# Drop rows where ALL values are NULL (none in our dataset)
df.dropna(how='all').show()
# Drop rows where ANY value is NULL
df.dropna(how='any').show()
# Drop rows with NULL in specific columns only
df.dropna(subset=['name', 'age']).show()
The subset parameter is particularly useful when you only care about NULL values in specific columns:
# Keep rows where name and age are non-NULL, ignore department
critical_fields = df.dropna(subset=['name', 'age'])
critical_fields.show()
# Output:
# +---+-----+---+------------+
# | id| name|age| department|
# +---+-----+---+------------+
# | 1|Alice| 29| Engineering|
# | 4|David| 42| null|
# | 6|Frank| 31| Engineering|
# +---+-----+---+------------+
For threshold-based filtering, use the thresh parameter to specify the minimum number of non-NULL values required:
# Keep rows with at least 3 non-NULL values (excluding id)
df.dropna(thresh=3, subset=['name', 'age', 'department']).show()
You can also build complex conditions by combining multiple NULL checks:
# Find rows where exactly one of name or age is NULL
one_null = df.filter(
(col("name").isNull() & col("age").isNotNull()) |
(col("name").isNotNull() & col("age").isNull())
)
one_null.show()
Handling NULL Values in Different Data Types
NULL behavior varies across data types, and it’s crucial to distinguish between NULL and empty values, especially for strings:
# Create DataFrame with NULL vs empty strings
string_data = [
(1, "Alice", ""),
(2, None, "text"),
(3, "", None)
]
df_strings = spark.createDataFrame(
string_data,
["id", "col1", "col2"]
)
# NULL and empty string are different
df_strings.filter(col("col1").isNull()).show() # Only row 2
df_strings.filter(col("col1") == "").show() # Only rows 1 and 3
This distinction matters because NULL represents missing data while an empty string is a valid value. Always be explicit about which you’re filtering.
For complex types like arrays and structs, isNull() checks if the entire structure is NULL, not individual elements:
from pyspark.sql.functions import array, struct
# Create DataFrame with complex types
complex_data = [
(1, ["a", "b"], {"name": "Alice", "age": 29}),
(2, None, {"name": "Bob", "age": None}),
(3, ["c"], None)
]
df_complex = spark.createDataFrame(
complex_data,
["id", "tags", "person"]
)
# Check if entire array is NULL
df_complex.filter(col("tags").isNull()).show()
# Check if struct is NULL (not if fields within are NULL)
df_complex.filter(col("person").isNull()).show()
# To check nested fields, access them directly
df_complex.filter(col("person.age").isNull()).show()
Performance Best Practices
NULL filtering performance matters at scale. Here are strategies to optimize your NULL filtering operations:
Filter Early: Apply NULL filters as early as possible in your transformation pipeline to reduce data volume for subsequent operations:
# Good: Filter NULLs immediately after reading
df = spark.read.parquet("large_dataset.parquet") \
.filter(col("critical_field").isNotNull())
# Then continue transformations on smaller dataset
result = df.groupBy("category").count()
Partition Awareness: NULL filtering can create partition skew if NULLs aren’t evenly distributed. Consider repartitioning after aggressive NULL filtering:
# If NULL filtering significantly reduces data size
filtered_df = df.filter(col("age").isNotNull())
# Check partition distribution
print(f"Partitions before: {df.rdd.getNumPartitions()}")
print(f"Partitions after: {filtered_df.rdd.getNumPartitions()}")
# Repartition if needed to avoid small partition overhead
if filtered_df.count() < df.count() * 0.5:
filtered_df = filtered_df.coalesce(df.rdd.getNumPartitions() // 2)
Avoid Repeated NULL Checks: If you’re filtering on the same NULL condition multiple times, cache the result:
# Cache filtered DataFrame if used multiple times
clean_df = df.filter(col("age").isNotNull()).cache()
# Now use clean_df in multiple operations without re-filtering
result1 = clean_df.groupBy("department").avg("age")
result2 = clean_df.filter(col("age") > 30).count()
Use Column Pruning: Only select columns you need before NULL filtering to reduce memory pressure:
# Select required columns first
df.select("id", "name", "age") \
.filter(col("age").isNotNull()) \
.show()
NULL filtering is foundational to data quality in PySpark. Master these techniques, understand the performance implications, and you’ll handle missing data efficiently even at petabyte scale. The key is being explicit about your NULL handling strategy, filtering early, and monitoring partition distribution to avoid performance pitfalls.