PySpark - Drop Rows with NULL Values (dropna)

NULL values are inevitable in real-world data. Whether they come from incomplete user inputs, failed API calls, or data integration issues, you need a systematic approach to handle them. PySpark's...

Key Insights

  • PySpark’s dropna() method provides fine-grained control over NULL removal with parameters like how, subset, and thresh to match different data quality requirements
  • The default behavior (how='any') aggressively drops rows with any NULL value, which can lead to significant data loss—always specify parameters based on your business logic
  • Use subset to focus on critical columns and thresh for flexible tolerance of missing data, especially when dealing with wide DataFrames where some NULLs are acceptable

Understanding NULL Values in PySpark DataFrames

NULL values are inevitable in real-world data. Whether they come from incomplete user inputs, failed API calls, or data integration issues, you need a systematic approach to handle them. PySpark’s dropna() method (also accessible as na.drop()) gives you precise control over removing rows with NULL values, but using it effectively requires understanding its parameters.

Unlike pandas where you might casually drop NULLs on smaller datasets, PySpark operations run on distributed data across clusters. Dropping rows incorrectly can eliminate valuable data or fail to clean problematic records, both of which waste computational resources and compromise analysis quality.

Let’s create a sample DataFrame that we’ll use throughout this article:

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType

spark = SparkSession.builder.appName("DropNAExample").getOrCreate()

data = [
    (1, "Alice", "alice@email.com", 25),
    (2, "Bob", None, 30),
    (3, None, "charlie@email.com", None),
    (4, "David", "david@email.com", 35),
    (None, "Eve", None, None),
    (6, "Frank", "frank@email.com", 28),
    (7, None, None, None)
]

schema = StructType([
    StructField("id", IntegerType(), True),
    StructField("name", StringType(), True),
    StructField("email", StringType(), True),
    StructField("age", IntegerType(), True)
])

df = spark.createDataFrame(data, schema)
df.show()

Output:

+----+-----+------------------+----+
|  id| name|             email| age|
+----+-----+------------------+----+
|   1|Alice|  alice@email.com|  25|
|   2|  Bob|              NULL|  30|
|   3| NULL|charlie@email.com|NULL|
|   4|David|  david@email.com|  35|
|NULL|  Eve|              NULL|NULL|
|   6|Frank|  frank@email.com|  28|
|   7| NULL|              NULL|NULL|
+----+-----+------------------+----+

Basic dropna() Usage

The default behavior of dropna() drops any row containing at least one NULL value in any column. This is the most aggressive approach:

# Default dropna - removes rows with ANY NULL
cleaned_df = df.dropna()
cleaned_df.show()

Output:

+---+-----+----------------+---+
| id| name|           email|age|
+---+-----+----------------+---+
|  1|Alice|alice@email.com| 25|
|  4|David|david@email.com| 35|
|  6|Frank|frank@email.com| 28|
+---+-----+----------------+---+

We went from 7 rows to 3 rows. While this ensures complete data quality, you’ve lost 57% of your dataset. In many scenarios, this is unacceptable. Rows 2 and 3 have valuable information despite missing some fields.

Controlling NULL Threshold with the ‘how’ Parameter

The how parameter gives you two options:

  • how='any': Drop row if ANY column contains NULL (default)
  • how='all': Drop row only if ALL columns are NULL
# Drop only if ALL values in a row are NULL
df_any = df.dropna(how='any')
df_all = df.dropna(how='all')

print("how='any' (default):")
df_any.show()

print("\nhow='all':")
df_all.show()

Output for how='all':

+----+-----+------------------+----+
|  id| name|             email| age|
+----+-----+------------------+----+
|   1|Alice|  alice@email.com|  25|
|   2|  Bob|              NULL|  30|
|   3| NULL|charlie@email.com|NULL|
|   4|David|  david@email.com|  35|
|NULL|  Eve|              NULL|NULL|
|   6|Frank|  frank@email.com|  28|
+----+-----+------------------+----+

Row 7 disappeared because all its values were NULL. The how='all' approach is useful when you want to eliminate completely empty rows while preserving partially complete records.

Targeting Specific Columns with ‘subset’

In most real-world scenarios, not all columns are equally important. You might require certain critical fields to be non-NULL while tolerating NULLs elsewhere. The subset parameter accepts a list of column names to check:

# Drop rows only if 'id' or 'email' contains NULL
critical_columns = ['id', 'email']
df_subset = df.dropna(subset=critical_columns)
df_subset.show()

Output:

+---+-----+------------------+----+
| id| name|             email| age|
+---+-----+------------------+----+
|  1|Alice|  alice@email.com|  25|
|  3| NULL|charlie@email.com|NULL|
|  4|David|  david@email.com|  35|
|  6|Frank|  frank@email.com|  28|
+---+-----+------------------+----+

Notice row 3 survived despite having NULL name and age, because both id and email are present. Row 2 was dropped because email is NULL, and rows 5 and 7 were dropped due to missing id or email.

You can combine subset with how:

# Drop only if ALL specified columns are NULL
df_subset_all = df.dropna(subset=['name', 'email'], how='all')
df_subset_all.show()

This keeps rows where at least one of name or email is non-NULL.

Using ’thresh’ Parameter for Minimum Non-NULL Values

The thresh (threshold) parameter specifies the minimum number of non-NULL values a row must have to be kept. This is particularly useful for wide DataFrames where you can tolerate some missing data:

# Keep rows with at least 3 non-NULL values
df_thresh = df.dropna(thresh=3)
df_thresh.show()

Output:

+---+-----+------------------+---+
| id| name|             email|age|
+---+-----+------------------+---+
|  1|Alice|  alice@email.com| 25|
|  2|  Bob|              NULL| 30|
|  4|David|  david@email.com| 35|
|  6|Frank|  frank@email.com| 28|
+---+-----+------------------+---+

Row 2 has 3 non-NULL values (id, name, age), so it’s kept. Row 3 has only 2 non-NULL values, so it’s dropped. This approach works well when you need “mostly complete” records without requiring every field.

Here’s a practical example with more columns:

# Wider DataFrame example
wide_data = [
    (1, "A", "B", "C", None, None, "G"),  # 5 non-NULL
    (2, None, None, None, None, None, "G"),  # 2 non-NULL
    (3, "A", "B", None, "E", None, None),  # 4 non-NULL
]

wide_df = spark.createDataFrame(wide_data, ["c1", "c2", "c3", "c4", "c5", "c6", "c7"])

# Keep rows with at least 4 non-NULL values
wide_df.dropna(thresh=4).show()

This filters out row 2 (only 2 non-NULL values) while keeping rows 1 and 3.

Practical Use Cases and Best Practices

Use Case 1: User Registration Data Cleaning

When processing user registrations, you might require certain fields while others are optional:

# Simulated user registration data
users = spark.createDataFrame([
    (1, "john@example.com", "John", "Doe", None, None),
    (2, "jane@example.com", "Jane", None, "555-0123", "Engineer"),
    (3, None, "Bob", "Smith", None, None),  # Invalid: no email
    (4, "alice@example.com", None, None, None, None),  # Valid: has email
], ["user_id", "email", "first_name", "last_name", "phone", "job_title"])

# Email is mandatory for user accounts
valid_users = users.dropna(subset=['email'])

# Further processing pipeline
result = (valid_users
    .dropna(subset=['first_name', 'last_name'], how='all')  # Need at least one name
    .filter("user_id IS NOT NULL")  # Additional validation
)

result.show()

Use Case 2: Sensor Data with Expected Gaps

For IoT sensor data where occasional missing readings are normal:

# Keep sensor readings with at least 70% of measurements
# If you have 10 sensors, thresh=7 means at least 7 must report
sensor_data = spark.read.parquet("s3://bucket/sensor_data/")
clean_sensors = sensor_data.dropna(thresh=7)

Performance Considerations

  1. Use subset to reduce processing: Checking fewer columns is faster than checking all columns
  2. Apply dropna() early: Remove bad data before expensive operations like joins
  3. Consider alternatives: Sometimes fillna() with default values is better than losing data
  4. Partition awareness: dropna() is a narrow transformation—it doesn’t shuffle data, making it relatively efficient

When NOT to Use dropna()

Don’t blindly drop NULLs when:

  • NULL has semantic meaning (e.g., “not applicable” vs. “unknown”)
  • You can impute missing values with domain knowledge
  • The data loss would be too significant (>20-30% of rows)
  • Downstream processes can handle NULLs appropriately
# Alternative: Fill with defaults instead of dropping
df_filled = df.fillna({
    'age': 0,
    'email': 'unknown@example.com',
    'name': 'Anonymous'
})

Choosing the Right Approach

Here’s a decision framework:

  • Use default dropna(): When data quality is critical and you can afford to lose incomplete records
  • Use how='all': To remove completely empty rows while preserving partial data
  • Use subset: When specific columns are business-critical (IDs, timestamps, keys)
  • Use thresh: For wide DataFrames where you need “mostly complete” records
  • Combine parameters: dropna(subset=['id', 'timestamp'], how='any') for precise control

The key is understanding your data requirements. A user ID might be mandatory, but a missing phone number might be acceptable. Define these rules explicitly in your data pipeline rather than applying blanket NULL removal.

PySpark’s dropna() is powerful precisely because it offers this granular control. Use it thoughtfully, and your data cleaning operations will be both effective and efficient.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.