Null Handling in PySpark vs Pandas vs SQL

Missing data is inevitable. Sensors fail, users skip form fields, and upstream systems send incomplete records. How you handle these gaps determines whether your pipeline produces reliable results or...

Key Insights

  • Pandas treats NaN as a float value that propagates through operations, while PySpark and SQL treat NULL as a distinct “unknown” state with three-valued logic—this fundamental difference causes subtle bugs when migrating code between platforms.
  • Null behavior in joins is a common source of data loss: SQL and PySpark exclude null keys from inner joins entirely, while Pandas will match NaN to NaN by default, potentially creating unexpected cartesian products.
  • Always explicitly handle nulls before aggregations and comparisons rather than relying on default behavior—each platform’s defaults differ enough that implicit handling leads to inconsistent results across your data stack.

The Null Problem

Missing data is inevitable. Sensors fail, users skip form fields, and upstream systems send incomplete records. How you handle these gaps determines whether your pipeline produces reliable results or silently corrupts downstream analytics.

The challenge compounds when you work across multiple platforms. A null-handling strategy that works perfectly in Pandas will behave differently in PySpark, and SQL has its own set of rules entirely. Understanding these differences isn’t academic—it’s essential for anyone building data pipelines that span multiple technologies.

Let’s clarify terminology first. Null represents the absence of a value—an unknown. NaN (Not a Number) is a specific IEEE 754 floating-point value indicating an undefined mathematical result. None is Python’s singleton representing “no value.” In practice, these concepts blur together, and each platform treats them differently.

Null Representation Across Platforms

Each platform has its own internal model for missing data, and these differences affect everything from memory usage to comparison semantics.

Pandas has historically used np.nan (a float) for missing values, which forces integer columns to become floats when they contain nulls. The newer pd.NA scalar provides a more consistent missing value indicator that works across types, but adoption is still incomplete.

PySpark uses a true null type at the Spark SQL level. Nulls don’t force type coercion, and a nullable integer column remains an integer.

SQL treats NULL as a special marker meaning “unknown,” following three-valued logic where comparisons with NULL return NULL rather than true or false.

# Pandas - multiple representations of missing data
import pandas as pd
import numpy as np

df_pandas = pd.DataFrame({
    'int_col': [1, None, 3],           # Becomes float64 due to None
    'float_col': [1.0, np.nan, 3.0],   # NaN is native to floats
    'str_col': ['a', None, 'c'],       # None stays as None
    'nullable_int': pd.array([1, pd.NA, 3], dtype='Int64')  # Nullable integer
})

print(df_pandas.dtypes)
# int_col          float64
# float_col        float64
# str_col           object
# nullable_int       Int64
# PySpark - consistent null handling
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, IntegerType, StringType

spark = SparkSession.builder.getOrCreate()

df_spark = spark.createDataFrame([
    (1, 'a'),
    (None, None),
    (3, 'c')
], ['int_col', 'str_col'])

df_spark.printSchema()
# root
#  |-- int_col: long (nullable = true)
#  |-- str_col: string (nullable = true)
-- SQL - NULL is a first-class concept
CREATE TABLE example (
    int_col INTEGER,      -- Nullable by default
    str_col VARCHAR(10),
    required_col INTEGER NOT NULL  -- Explicitly non-nullable
);

INSERT INTO example (int_col, str_col, required_col) 
VALUES (1, 'a', 1), (NULL, NULL, 2), (3, 'c', 3);

Detecting Null Values

Finding nulls seems straightforward until you realize that standard equality checks don’t work the way you expect.

# Pandas null detection
df = pd.DataFrame({'value': [1, np.nan, None, 3]})

# These are equivalent
df['value'].isna()
df['value'].isnull()

# Count nulls per column
df.isna().sum()

# Filter rows with any null
df[df.isna().any(axis=1)]

# Filter rows where specific column is null
df[df['value'].isna()]
# PySpark null detection
from pyspark.sql.functions import col, isnan, when, count

# Check for nulls
df_spark.filter(col('int_col').isNull())
df_spark.filter(col('int_col').isNotNull())

# Count nulls per column - note: isnan() is for NaN, not null
df_spark.select([
    count(when(col(c).isNull(), c)).alias(c) 
    for c in df_spark.columns
]).show()

# For float columns, you might need both
df_spark.filter(col('float_col').isNull() | isnan(col('float_col')))
-- SQL null detection
SELECT * FROM example WHERE int_col IS NULL;
SELECT * FROM example WHERE int_col IS NOT NULL;

-- This returns NO rows, even for null values!
SELECT * FROM example WHERE int_col = NULL;  -- Wrong!

-- Count nulls per column
SELECT 
    COUNT(*) - COUNT(int_col) AS int_col_nulls,
    COUNT(*) - COUNT(str_col) AS str_col_nulls
FROM example;

The SQL example highlights a critical gotcha: NULL = NULL evaluates to NULL, not TRUE. This three-valued logic trips up even experienced developers.

Null Behavior in Operations

Nulls propagate through operations differently across platforms, and these differences cause real bugs.

# Pandas - NaN propagates in arithmetic, but aggregations skip it
df = pd.DataFrame({'a': [1, np.nan, 3], 'b': [4, 5, 6]})

df['sum'] = df['a'] + df['b']  # Row 2 becomes NaN
df['a'].sum()   # Returns 4.0 (skips NaN)
df['a'].mean()  # Returns 2.0 (skips NaN)

# String concatenation
df_str = pd.DataFrame({'first': ['John', None], 'last': ['Doe', 'Smith']})
df_str['first'] + ' ' + df_str['last']  # Second row is NaN
# PySpark - null propagates consistently
from pyspark.sql.functions import col, sum as spark_sum, mean, concat, lit

df = spark.createDataFrame([(1, 4), (None, 5), (3, 6)], ['a', 'b'])

# Arithmetic with nulls
df.withColumn('sum', col('a') + col('b')).show()
# Row 2 sum is null

# Aggregations ignore nulls
df.agg(spark_sum('a'), mean('a')).show()
# sum: 4, mean: 2.0

# String concatenation - null poisons the result
df_str = spark.createDataFrame([('John', 'Doe'), (None, 'Smith')], ['first', 'last'])
df_str.withColumn('full', concat(col('first'), lit(' '), col('last'))).show()
# Second row full_name is null
-- SQL - NULL propagates, aggregations ignore
SELECT a + b AS sum FROM example;  -- NULL + anything = NULL

SELECT SUM(int_col), AVG(int_col) FROM example;  -- Nulls excluded

-- String concatenation varies by database
-- PostgreSQL: NULL || 'text' = NULL
-- MySQL with CONCAT(): CONCAT(NULL, 'text') = NULL
-- SQL Server: NULL + 'text' = NULL (with default settings)

Filling and Replacing Nulls

Every platform provides mechanisms to replace nulls, but the syntax and capabilities differ significantly.

# Pandas fill strategies
df = pd.DataFrame({
    'value': [1, np.nan, np.nan, 4, np.nan],
    'category': ['A', None, 'B', None, 'C']
})

# Simple fill
df['value'].fillna(0)

# Forward fill (propagate last valid value)
df['value'].ffill()

# Backward fill
df['value'].bfill()

# Fill with column-specific values
df.fillna({'value': 0, 'category': 'Unknown'})

# Interpolation for numeric columns
df['value'].interpolate(method='linear')
# PySpark fill strategies
from pyspark.sql.functions import coalesce, lit, last, when

df = spark.createDataFrame([
    (1, 'A'), (None, None), (None, 'B'), (4, None), (None, 'C')
], ['value', 'category'])

# Simple fill
df.fillna(0, subset=['value'])
df.fillna({'value': 0, 'category': 'Unknown'})

# Coalesce - returns first non-null value
df.withColumn('value', coalesce(col('value'), lit(0)))

# Forward fill requires window functions
from pyspark.sql.window import Window

window = Window.orderBy('monotonically_increasing_id').rowsBetween(
    Window.unboundedPreceding, Window.currentRow
)
df.withColumn('value_filled', last('value', ignorenulls=True).over(window))
-- SQL fill strategies
-- COALESCE returns first non-null argument
SELECT COALESCE(int_col, 0) AS filled_int FROM example;

-- COALESCE with multiple fallbacks
SELECT COALESCE(primary_phone, secondary_phone, 'No phone') FROM contacts;

-- NULLIF - returns NULL if arguments are equal (inverse operation)
SELECT NULLIF(status, 'unknown') FROM records;  -- Converts 'unknown' to NULL

-- Conditional replacement
SELECT 
    CASE WHEN int_col IS NULL THEN 0 ELSE int_col END AS filled
FROM example;

Nulls in Joins and Grouping

This is where platform differences cause the most data quality issues. Null behavior in joins is genuinely different, not just syntactically different.

# Pandas - NaN matches NaN in joins by default!
left = pd.DataFrame({'key': [1, np.nan, 3], 'left_val': ['a', 'b', 'c']})
right = pd.DataFrame({'key': [1, np.nan, 4], 'right_val': ['x', 'y', 'z']})

pd.merge(left, right, on='key', how='inner')
# Returns rows for key=1 AND key=NaN (both match!)

# Groupby includes NaN as a group (pandas 1.1+)
df = pd.DataFrame({'group': ['A', None, 'A', None], 'value': [1, 2, 3, 4]})
df.groupby('group', dropna=False).sum()
# Shows separate group for NaN values
# PySpark - nulls do NOT match in joins
left = spark.createDataFrame([(1, 'a'), (None, 'b'), (3, 'c')], ['key', 'left_val'])
right = spark.createDataFrame([(1, 'x'), (None, 'y'), (4, 'z')], ['key', 'right_val'])

left.join(right, 'key', 'inner').show()
# Only returns row for key=1, null keys excluded!

# To match nulls explicitly
from pyspark.sql.functions import coalesce, lit

left_filled = left.withColumn('join_key', coalesce(col('key'), lit(-999)))
right_filled = right.withColumn('join_key', coalesce(col('key'), lit(-999)))
left_filled.join(right_filled, 'join_key', 'inner').show()

# Groupby excludes nulls from groups
df = spark.createDataFrame([('A', 1), (None, 2), ('A', 3), (None, 4)], ['group', 'value'])
df.groupBy('group').sum('value').show()
# Null group appears as separate row
-- SQL - NULL keys excluded from inner joins
SELECT * FROM left_table l
INNER JOIN right_table r ON l.key = r.key;
-- Rows where either key is NULL are excluded

-- To include null matches
SELECT * FROM left_table l
INNER JOIN right_table r ON l.key = r.key 
   OR (l.key IS NULL AND r.key IS NULL);

-- GROUP BY behavior varies by database
-- Most databases: NULL values form their own group
SELECT group_col, SUM(value) 
FROM example 
GROUP BY group_col;

Best Practices and Recommendations

After years of dealing with null-related bugs across platforms, here’s what actually works:

Explicitly handle nulls early. Don’t let nulls propagate through your pipeline hoping they’ll work out. Define your null-handling strategy at ingestion time.

Document your null semantics. Does null mean “unknown,” “not applicable,” or “not yet collected”? These have different implications for analysis.

Test with null data. Your unit tests should include null values in every column. The bugs you catch will justify the effort.

Use platform-appropriate patterns:

Operation Pandas PySpark SQL
Check null isna() isNull() IS NULL
Fill null fillna() fillna() / coalesce() COALESCE()
Drop nulls dropna() dropna() WHERE col IS NOT NULL
Null-safe equals equals() eqNullSafe() IS NOT DISTINCT FROM

Watch your joins. If you’re migrating from Pandas to PySpark and your row counts change, check for null keys first. This is the most common source of “missing data” bugs.

Prefer explicit over implicit. Instead of relying on aggregation functions to skip nulls automatically, filter them out explicitly. Your code becomes self-documenting, and you won’t be surprised when platform defaults differ.

Null handling isn’t glamorous, but it’s foundational. Get it wrong, and every downstream analysis inherits the problem. Get it right, and you build pipelines that produce consistent results regardless of which platform executes them.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.