PySpark - Filter Rows with Multiple Conditions

Filtering rows in PySpark is fundamental to data processing workflows, but real-world scenarios rarely involve simple single-condition filters. You typically need to combine multiple...

Key Insights

  • PySpark requires bitwise operators (&, |, ~) instead of Python’s logical operators (and, or, not) for DataFrame filtering, and each condition must be wrapped in parentheses to avoid precedence errors.
  • SQL-style string expressions offer familiar syntax for complex filters, but Column-based expressions provide better type safety and IDE support for production code.
  • Filter ordering matters for performance—placing the most selective conditions first and leveraging predicate pushdown can dramatically reduce data shuffling and computation time on large datasets.

Introduction

Filtering rows in PySpark is fundamental to data processing workflows, but real-world scenarios rarely involve simple single-condition filters. You typically need to combine multiple criteria—filtering customers by region AND purchase amount, excluding certain product categories OR status codes, handling NULL values appropriately. With datasets containing millions or billions of rows, understanding how to write efficient multi-condition filters isn’t just about correctness; it directly impacts job runtime and cluster costs.

Here’s a baseline single-condition filter to establish context:

from pyspark.sql import SparkSession
from pyspark.sql.functions import col

spark = SparkSession.builder.appName("FilterExample").getOrCreate()

# Sample data
data = [
    (1, "Alice", 28, "NY", 5000),
    (2, "Bob", 35, "CA", 7500),
    (3, "Charlie", 42, "TX", 6200),
    (4, "Diana", 31, "NY", 8100)
]
df = spark.createDataFrame(data, ["id", "name", "age", "state", "salary"])

# Simple filter
filtered_df = df.filter(col("age") > 30)
filtered_df.show()

This works fine, but production scenarios demand more sophisticated filtering logic.

Using Multiple Conditions with Column Expressions

The most common mistake when filtering PySpark DataFrames is using Python’s and/or operators. This will not work. PySpark requires bitwise operators because DataFrame operations are lazy expressions, not boolean evaluations:

  • & for AND
  • | for OR
  • ~ for NOT

Each condition must be wrapped in parentheses due to operator precedence. Here’s the correct syntax:

# Multiple AND conditions
result = df.filter((col("age") > 30) & (col("salary") > 7000))
result.show()
# Returns: Bob (35, 7500) and Diana (31, 8100)

# Multiple OR conditions
result = df.filter((col("state") == "CA") | (col("state") == "TX"))
result.show()
# Returns: Bob (CA) and Charlie (TX)

# Mixed AND/OR with proper parentheses
result = df.filter(
    ((col("state") == "NY") & (col("age") < 30)) |
    ((col("state") == "CA") & (col("salary") > 7000))
)
result.show()
# Returns: Alice (NY, 28) and Bob (CA, 7500)

The parentheses aren’t optional—without them, you’ll get cryptic errors or incorrect results due to Python’s operator precedence rules. The bitwise & operator has higher precedence than comparison operators, so col("age") > 30 & col("salary") > 7000 is evaluated as col("age") > (30 & col("salary")) > 7000, which is nonsensical.

You can also chain multiple filter() calls, which is equivalent to AND operations and sometimes more readable:

result = df.filter(col("age") > 30) \
           .filter(col("salary") > 7000) \
           .filter(col("state") != "TX")

Spark’s optimizer will combine these into a single filter operation, so there’s no performance penalty for chaining.

Using the filter() Method with SQL-like Syntax

If you come from a SQL background, PySpark supports string-based filter expressions that mirror WHERE clauses:

# SQL-style WHERE clause
result = df.filter("age > 30 AND salary > 7000")
result.show()

# Complex conditions with SQL operators
result = df.filter("state IN ('NY', 'CA') AND age BETWEEN 25 AND 35")
result.show()

# Using LIKE for pattern matching
data_with_names = [
    (1, "Alice Smith", 28),
    (2, "Bob Johnson", 35),
    (3, "Alice Jones", 42)
]
df2 = spark.createDataFrame(data_with_names, ["id", "name", "age"])

result = df2.filter("name LIKE 'Alice%' AND age > 30")
result.show()
# Returns: Alice Jones

SQL syntax is concise and familiar, but it has drawbacks. You lose type checking, IDE autocomplete, and refactoring support. Column name typos won’t be caught until runtime. For quick exploratory analysis, SQL strings are fine. For production pipelines, Column expressions are safer.

Using isin() and between() for Cleaner Code

PySpark provides specialized methods that make common filtering patterns more readable than raw boolean logic:

# Instead of multiple OR conditions
# Bad: df.filter((col("state") == "NY") | (col("state") == "CA") | (col("state") == "TX"))

# Good: Use isin()
result = df.filter(col("state").isin(["NY", "CA", "TX"]))
result.show()

# Range filtering with between()
result = df.filter(col("age").between(30, 40))
result.show()
# Returns ages 30-40 inclusive

# Combining specialized methods with other conditions
result = df.filter(
    (col("state").isin(["NY", "CA"])) &
    (col("salary").between(6000, 8000)) &
    (col("age") > 25)
)
result.show()

The isin() method is particularly valuable when filtering against a large list of values, potentially sourced from another DataFrame or external configuration. It’s both more readable and more efficient than chaining OR conditions.

Handling NULL Values in Filters

NULL handling is where many PySpark filters produce unexpected results. NULL values don’t equal anything—not even other NULLs—and they propagate through boolean expressions in ways that might surprise you:

# Data with NULLs
data_with_nulls = [
    (1, "Alice", 28, 5000),
    (2, "Bob", None, 7500),
    (3, "Charlie", 42, None),
    (4, "Diana", None, None)
]
df_nulls = spark.createDataFrame(data_with_nulls, ["id", "name", "age", "salary"])

# This filter excludes NULL ages (NULL > 30 evaluates to NULL, which is falsy)
result = df_nulls.filter(col("age") > 30)
result.show()
# Returns: Only Charlie (42)

# Explicitly handling NULLs
result = df_nulls.filter(col("age").isNotNull() & (col("age") > 30))
result.show()

# Finding NULLs in multiple columns
result = df_nulls.filter(col("age").isNull() | col("salary").isNull())
result.show()
# Returns: Bob, Charlie, Diana

# Filtering with NULL-safe operations
result = df_nulls.filter(
    (col("age").isNotNull()) &
    (col("salary").isNotNull()) &
    (col("age") > 30) &
    (col("salary") > 6000)
)
result.show()

For cleaner code when NULLs represent missing data that should be excluded, consider using na.drop() before filtering:

# Drop rows with any NULLs, then filter
result = df_nulls.na.drop().filter((col("age") > 30) & (col("salary") > 6000))

# Or drop NULLs from specific columns only
result = df_nulls.na.drop(subset=["age", "salary"]).filter(col("age") > 30)

Performance Optimization Tips

Filter performance matters at scale. Here are strategies to optimize multi-condition filters:

1. Predicate Pushdown: Spark automatically pushes filters down to data sources (Parquet, ORC, databases) when possible. This happens before data is loaded into memory. Filters on partition columns are especially efficient.

2. Condition Ordering: Place the most selective filters first. While Spark’s optimizer is smart, helping it by filtering out the most rows early reduces downstream computation.

3. Column Pruning: Select only needed columns before filtering when working with wide tables.

# Check execution plan
df.filter(
    (col("age") > 30) & (col("salary") > 7000)
).explain(True)

# Compare with different filter order
df.filter(
    (col("salary") > 7000) & (col("age") > 30)
).explain(True)

# Optimal: Select columns first, then filter
df.select("id", "age", "salary", "state") \
  .filter((col("age") > 30) & (col("salary") > 7000)) \
  .explain(True)

The explain() method shows the physical plan. Look for filter operations pushed to the scan stage—that’s predicate pushdown in action.

Complete Working Example

Here’s a realistic scenario: filtering transaction data for a business report.

from pyspark.sql import SparkSession
from pyspark.sql.functions import col
from datetime import datetime

spark = SparkSession.builder.appName("TransactionFilter").getOrCreate()

# Realistic transaction dataset
transactions = [
    (1, "2024-01-15", "Electronics", 1200, "NY", "completed"),
    (2, "2024-01-16", "Clothing", 350, "CA", "completed"),
    (3, "2024-01-17", "Electronics", 2500, "TX", "pending"),
    (4, "2024-01-18", "Home", 800, "NY", "completed"),
    (5, "2024-01-19", "Electronics", 1800, "CA", "cancelled"),
    (6, "2024-01-20", "Clothing", 150, "TX", "completed"),
    (7, "2024-01-21", "Electronics", 3200, "NY", None),
]

df = spark.createDataFrame(
    transactions, 
    ["id", "date", "category", "amount", "state", "status"]
)

# Business requirement: Find high-value electronics transactions
# in specific states, excluding cancelled/pending orders
result = df.filter(
    (col("category") == "Electronics") &
    (col("amount") >= 1500) &
    (col("state").isin(["NY", "CA"])) &
    (col("status").isNotNull()) &
    (~col("status").isin(["cancelled", "pending"]))
)

print("High-value completed electronics orders:")
result.show()

# Alternative using SQL syntax
result_sql = df.filter("""
    category = 'Electronics' 
    AND amount >= 1500 
    AND state IN ('NY', 'CA')
    AND status IS NOT NULL
    AND status NOT IN ('cancelled', 'pending')
""")

print("Same result using SQL syntax:")
result_sql.show()

# Performance comparison
print("\nColumn expression plan:")
result.explain()

print("\nSQL expression plan:")
result_sql.explain()

Both approaches produce identical results and execution plans. Choose Column expressions for production code where type safety and maintainability matter, or SQL syntax for ad-hoc analysis where speed of writing matters more.

The key to effective PySpark filtering is understanding that you’re building expression trees, not evaluating boolean logic. Master the bitwise operators, embrace parentheses, handle NULLs explicitly, and always check your execution plans when performance matters. Your data pipelines will be faster, more reliable, and easier to maintain.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.