Spark Scala - Handle NULL Values

NULL values are the bane of distributed data processing. They represent missing, unknown, or inapplicable data—and Spark treats them with SQL semantics, meaning NULL propagates through most...

Key Insights

  • NULL values in Spark propagate through expressions and can silently corrupt your results—understanding this behavior is essential for writing correct data pipelines.
  • The DataFrameNaFunctions API (na.drop() and na.fill()) provides the most concise way to handle NULLs, but coalesce and when/otherwise offer finer control for complex business logic.
  • NULL-safe comparisons using the <=> operator prevent subtle bugs in joins and filters where standard equality treats NULL as unknown rather than a matchable value.

Introduction to NULL Handling in Spark

NULL values are the bane of distributed data processing. They represent missing, unknown, or inapplicable data—and Spark treats them with SQL semantics, meaning NULL propagates through most expressions. Add two numbers where one is NULL? You get NULL. Compare NULL to NULL with ===? You get NULL, not true.

Spark stores NULLs efficiently in its internal Tungsten format using null bitmaps, so there’s no significant storage penalty. But the computational cost of mishandling NULLs is high: incorrect aggregations, failed joins, and silent data corruption that surfaces only in production.

This article covers the practical techniques you need to detect, remove, replace, and safely compare NULL values in Spark Scala applications.

Detecting NULL Values

Before you can handle NULLs, you need to find them. Spark provides three primary functions for NULL detection in DataFrame columns.

import org.apache.spark.sql.functions._
import spark.implicits._

val df = Seq(
  (1, Some("Alice"), Some(30)),
  (2, None, Some(25)),
  (3, Some("Charlie"), None),
  (4, None, None)
).toDF("id", "name", "age")

// Find rows where name is NULL
val nullNames = df.filter(col("name").isNull)

// Find rows where name is NOT NULL
val validNames = df.filter(col("name").isNotNull)

// Count NULLs per column
df.select(
  count(when(col("name").isNull, 1)).as("null_names"),
  count(when(col("age").isNull, 1)).as("null_ages")
).show()

The isNaN function handles a different case—it detects IEEE 754 NaN (Not a Number) values in floating-point columns, which are distinct from NULL:

val dfWithNaN = Seq(
  (1, Some(10.0)),
  (2, Some(Double.NaN)),
  (3, None)
).toDF("id", "value")

// NaN and NULL are different
dfWithNaN.filter(col("value").isNaN).show()  // Row 2
dfWithNaN.filter(col("value").isNull).show() // Row 3

For comprehensive NULL auditing across an entire DataFrame, iterate over columns:

val nullCounts = df.columns.map { colName =>
  df.filter(col(colName).isNull).count()
}

df.columns.zip(nullCounts).foreach { case (name, count) =>
  println(s"$name: $count NULLs")
}

Removing NULL Values with na.drop()

When NULLs indicate invalid or unusable records, dropping them is often the cleanest solution. The DataFrameNaFunctions API, accessed via df.na, provides flexible row removal.

val df = Seq(
  (1, Some("Alice"), Some(30)),
  (2, None, Some(25)),
  (3, Some("Charlie"), None),
  (4, None, None)
).toDF("id", "name", "age")

// Drop rows with ANY NULL values (default behavior)
val noNulls = df.na.drop()
// Result: Only row 1 (Alice, 30)

// Equivalent explicit call
val noNullsExplicit = df.na.drop("any")

// Drop rows only if ALL specified columns are NULL
val keepPartial = df.na.drop("all")
// Result: Rows 1, 2, 3 (row 4 dropped because both name AND age are NULL)

Column-specific dropping gives you precise control:

// Drop rows where name OR age is NULL
val cleanedSubset = df.na.drop(Seq("name", "age"))

// Drop rows where name is NULL (age can be NULL)
val nameRequired = df.na.drop(Seq("name"))

// Threshold-based: keep rows with at least 2 non-NULL values
val minValidCols = df.na.drop(2)

The threshold parameter is particularly useful for wide tables where you want to keep rows that have “enough” valid data without specifying every column.

Filling NULL Values with na.fill()

Replacing NULLs with default values is often preferable to dropping rows, especially when you need to preserve record counts for downstream aggregations.

val df = Seq(
  (1, Some("Alice"), Some(30), Some(50000.0)),
  (2, None, Some(25), None),
  (3, Some("Charlie"), None, Some(60000.0))
).toDF("id", "name", "age", "salary")

// Fill all string columns with "Unknown"
val filledStrings = df.na.fill("Unknown")

// Fill all numeric columns with 0
val filledNumerics = df.na.fill(0)

// Type-specific fills in one call
val filledMixed = df.na.fill("Unknown").na.fill(0)

For production code, column-specific fills with a Map provide explicit control:

// Different defaults for different columns
val filledSpecific = df.na.fill(Map(
  "name" -> "Unknown",
  "age" -> 0,
  "salary" -> 0.0
))

// Fill only specific columns, leave others as NULL
val partialFill = df.na.fill(Map("name" -> "N/A"))

One gotcha: na.fill() is type-sensitive. Filling a string column with an integer silently does nothing—no error, no fill. Always match types:

// This works
df.na.fill(Map("age" -> 0))

// This silently fails (string value for integer column)
df.na.fill(Map("age" -> "0"))  // age remains NULL

Using coalesce and when/otherwise for Conditional Replacement

When replacement logic depends on other columns or requires conditional evaluation, coalesce and when/otherwise provide the flexibility you need.

coalesce returns the first non-NULL value from a list of columns or expressions:

val df = Seq(
  (1, Some("alice@work.com"), Some("alice@personal.com")),
  (2, None, Some("bob@personal.com")),
  (3, None, None)
).toDF("id", "work_email", "personal_email")

// Use work email, fall back to personal, then to a default
val withEmail = df.withColumn(
  "contact_email",
  coalesce(col("work_email"), col("personal_email"), lit("no-email@example.com"))
)

For complex conditional logic, when/otherwise is more expressive:

val df = Seq(
  (1, Some("Alice"), Some(30)),
  (2, None, Some(25)),
  (3, Some("Charlie"), None)
).toDF("id", "name", "age")

// Replace NULL age based on name presence
val withDefaultAge = df.withColumn(
  "age_filled",
  when(col("age").isNull && col("name").isNotNull, lit(18))
    .when(col("age").isNull && col("name").isNull, lit(0))
    .otherwise(col("age"))
)

// Categorize NULL handling
val withCategory = df.withColumn(
  "data_quality",
  when(col("name").isNull && col("age").isNull, "poor")
    .when(col("name").isNull || col("age").isNull, "partial")
    .otherwise("complete")
)

Chaining multiple when clauses is cleaner than nested if logic and executes efficiently in Spark’s query optimizer.

NULL-Safe Comparisons and Joins

Standard equality in Spark follows SQL semantics: NULL === NULL evaluates to NULL (unknown), not true. This breaks joins and filters in subtle ways.

val df1 = Seq((1, Some("A")), (2, None)).toDF("id", "key")
val df2 = Seq((Some("A"), "value1"), (None, "value2")).toDF("key", "value")

// Standard join misses NULL matches
val standardJoin = df1.join(df2, df1("key") === df2("key"))
// Result: Only the "A" row matches

// NULL-safe join includes NULL === NULL
val nullSafeJoin = df1.join(df2, df1("key") <=> df2("key"))
// Result: Both rows match (A=A and NULL=NULL)

The <=> operator (null-safe equality) treats NULL as a concrete value that equals other NULLs. Use it when NULL represents a meaningful category rather than missing data.

For filtering, the same principle applies:

val df = Seq(
  (1, Some("A")),
  (2, None),
  (3, Some("A"))
).toDF("id", "status")

// This misses NULL rows when comparing
val filterValue: Option[String] = None
val standard = df.filter(col("status") === filterValue.orNull)
// Result: Empty (NULL === NULL is NULL, which is falsy)

// NULL-safe comparison
val nullSafe = df.filter(col("status") <=> filterValue.orNull)
// Result: Row 2 (NULL <=> NULL is true)

Best Practices and Performance Considerations

Filter NULLs early. If you know certain NULLs make records invalid for your use case, drop them immediately after reading. This reduces data volume for all downstream operations:

val cleanDf = spark.read.parquet("data.parquet")
  .na.drop(Seq("required_column_1", "required_column_2"))
  .cache()  // Cache the cleaned version

Design schemas with NULL semantics in mind. Use nullable = false in your schema when columns genuinely cannot be NULL. Spark can optimize better when it knows NULLs are impossible:

import org.apache.spark.sql.types._

val schema = StructType(Seq(
  StructField("id", IntegerType, nullable = false),
  StructField("name", StringType, nullable = true),
  StructField("created_at", TimestampType, nullable = false)
))

Beware NULL propagation in aggregations. Most aggregate functions ignore NULLs (sum, avg, count(column)), but count(*) counts all rows. This discrepancy causes bugs:

// These return different results when NULLs exist
df.select(count("*"), count("nullable_column")).show()

Use na.fill() before arithmetic operations. NULL in any operand makes the entire expression NULL:

// Bad: NULL salary makes total_comp NULL
df.withColumn("total_comp", col("salary") + col("bonus"))

// Good: Handle NULLs first
df.na.fill(Map("salary" -> 0.0, "bonus" -> 0.0))
  .withColumn("total_comp", col("salary") + col("bonus"))

Document your NULL handling strategy. Whether you treat NULLs as zeros, unknowns, or errors affects business logic. Make this explicit in code comments and data dictionaries.

NULL handling isn’t glamorous, but it’s fundamental to correct data processing. Master these techniques, and you’ll eliminate an entire class of subtle bugs from your Spark applications.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.