How to Filter by Multiple Conditions in PySpark
Filtering data is the bread and butter of data engineering. Whether you're cleaning datasets, building ETL pipelines, or preparing data for machine learning, you'll spend a significant portion of...
Key Insights
- Always wrap individual conditions in parentheses when combining with
&,|, or~operators—PySpark’s operator precedence will silently produce wrong results otherwise. - SQL-style string expressions offer better readability for complex filters, but column-based expressions provide compile-time safety and better IDE support.
- Order your filter conditions strategically: place the most selective (eliminates most rows) first, and always handle nulls explicitly in compound conditions.
Introduction
Filtering data is the bread and butter of data engineering. Whether you’re cleaning datasets, building ETL pipelines, or preparing data for machine learning, you’ll spend a significant portion of your time selecting subsets of rows based on various criteria.
In real-world scenarios, single-condition filters rarely cut it. You need to find active users who signed up in the last 30 days, transactions above a threshold from specific regions, or products in certain categories that are currently in stock. This means combining multiple conditions—and doing it correctly in PySpark has a few gotchas that trip up even experienced developers.
This article covers every practical approach to multi-condition filtering in PySpark, from basic logical operators to dynamic filter construction for complex use cases.
Basic Filtering Refresher
PySpark provides two equivalent methods for filtering: filter() and where(). They’re interchangeable—use whichever reads better in your context.
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
spark = SparkSession.builder.appName("FilteringDemo").getOrCreate()
# Sample data
data = [
("Alice", 32, "Engineering", 95000, "active"),
("Bob", 28, "Sales", 65000, "active"),
("Charlie", 45, "Engineering", 120000, "inactive"),
("Diana", 35, "Marketing", 78000, "active"),
("Eve", 29, "Sales", 55000, "inactive"),
]
df = spark.createDataFrame(data, ["name", "age", "department", "salary", "status"])
# Single condition filtering - all equivalent
df.filter(col("age") > 30).show()
df.where(col("age") > 30).show()
df.filter(df.age > 30).show()
df.filter("age > 30").show()
The col() function is the preferred approach because it’s explicit and works consistently across all contexts, including when referencing columns from joined DataFrames.
Combining Conditions with Logical Operators
Here’s where most developers first encounter trouble. PySpark uses bitwise operators for logical operations: & for AND, | for OR, and ~ for NOT. The critical rule: always wrap each condition in parentheses.
AND Conditions
# Find active employees over 30
# WRONG - will throw an error or produce unexpected results
# df.filter(col("age") > 30 & col("status") == "active")
# CORRECT - parentheses around each condition
df.filter((col("age") > 30) & (col("status") == "active")).show()
The parentheses aren’t optional. Python’s operator precedence evaluates > and == after &, which means col("age") > 30 & col("status") gets parsed as col("age") > (30 & col("status"))—not what you want.
OR Conditions
# Find employees in Sales OR Marketing
df.filter(
(col("department") == "Sales") | (col("department") == "Marketing")
).show()
# Find employees who are either highly paid OR in engineering
df.filter(
(col("salary") > 100000) | (col("department") == "Engineering")
).show()
Complex Combinations with Precedence
When mixing AND and OR, use parentheses to make precedence explicit:
# Active employees who are either in Sales with salary > 60k
# OR in Engineering regardless of salary
df.filter(
(col("status") == "active") &
(
((col("department") == "Sales") & (col("salary") > 60000)) |
(col("department") == "Engineering")
)
).show()
NOT Conditions
# Employees NOT in Engineering
df.filter(~(col("department") == "Engineering")).show()
# Alternative using !=
df.filter(col("department") != "Engineering").show()
# NOT active AND NOT in Sales
df.filter(
~(col("status") == "active") & ~(col("department") == "Sales")
).show()
Using SQL-Style Expressions
For complex filters, SQL-style string expressions often produce more readable code. If you’re comfortable with SQL, this approach feels natural:
# Simple AND condition
df.filter("age > 30 AND status = 'active'").show()
# Complex condition with OR
df.filter("""
(department = 'Sales' AND salary > 60000)
OR (department = 'Engineering' AND age < 40)
""").show()
Using BETWEEN, IN, and LIKE
SQL expressions support familiar SQL operators that would require more verbose column expressions:
# BETWEEN for ranges
df.filter("salary BETWEEN 60000 AND 100000").show()
# IN for multiple values
df.filter("department IN ('Sales', 'Marketing', 'HR')").show()
# LIKE for pattern matching
df.filter("name LIKE 'A%'").show()
# Combining everything
df.filter("""
department IN ('Sales', 'Engineering')
AND salary BETWEEN 50000 AND 100000
AND name LIKE '%e'
AND status = 'active'
""").show()
The tradeoff: SQL strings aren’t validated until runtime. Typos in column names or syntax errors won’t surface until your job runs—potentially hours into a pipeline. Column expressions catch these issues earlier.
Filtering with isin() and between()
PySpark provides dedicated methods for common multi-value patterns that are cleaner than chaining OR conditions:
Using isin()
# Instead of this verbose approach
df.filter(
(col("department") == "Sales") |
(col("department") == "Marketing") |
(col("department") == "HR")
).show()
# Use isin() for cleaner code
target_departments = ["Sales", "Marketing", "HR"]
df.filter(col("department").isin(target_departments)).show()
# Combine with other conditions
df.filter(
(col("department").isin(["Sales", "Marketing"])) &
(col("status") == "active")
).show()
# Negate with ~
df.filter(~col("department").isin(["Engineering"])).show()
Using between()
# Filter salary range
df.filter(col("salary").between(60000, 100000)).show()
# Works with dates too
from pyspark.sql.functions import to_date
df_with_dates = df.withColumn("hire_date", to_date(col("name"))) # dummy example
# In practice:
# df.filter(col("hire_date").between("2023-01-01", "2024-12-31"))
# Combine between with other conditions
df.filter(
(col("salary").between(50000, 90000)) &
(col("age").between(25, 35)) &
(col("status") == "active")
).show()
Dynamic Filtering with reduce()
Sometimes you need to build filters programmatically—from configuration files, user input, or other runtime sources. The functools.reduce function handles this elegantly:
from functools import reduce
from pyspark.sql.functions import col, lit
# Define conditions as a list
conditions = [
col("age") > 25,
col("status") == "active",
col("salary") > 50000,
]
# Combine with AND
combined_filter = reduce(lambda a, b: a & b, conditions)
df.filter(combined_filter).show()
# Combine with OR
combined_filter_or = reduce(lambda a, b: a | b, conditions)
df.filter(combined_filter_or).show()
Building Filters from Dictionaries
# Filter specifications from config or API
filter_specs = {
"department": ["Sales", "Marketing"],
"status": ["active"],
}
# Build conditions dynamically
conditions = []
for column, values in filter_specs.items():
if len(values) == 1:
conditions.append(col(column) == values[0])
else:
conditions.append(col(column).isin(values))
if conditions:
final_filter = reduce(lambda a, b: a & b, conditions)
df.filter(final_filter).show()
Handling Empty Condition Lists
from pyspark.sql.functions import lit
def build_filter(conditions):
"""Safely build a filter from a list of conditions."""
if not conditions:
return lit(True) # No filter = return all rows
return reduce(lambda a, b: a & b, conditions)
# Works even with empty list
df.filter(build_filter([])).show() # Returns all rows
Performance Considerations and Best Practices
Filter Ordering Matters
Place the most selective conditions first. Spark’s optimizer is good, but helping it doesn’t hurt:
# If status='inactive' eliminates 90% of rows, put it first
df.filter(
(col("status") == "inactive") & # Most selective
(col("department") == "Engineering") &
(col("salary") > 100000) # Least selective
).show()
Handling Nulls Explicitly
Null handling in compound conditions catches many developers off guard. Any comparison with null returns null (not true or false), which filters out the row:
# Data with nulls
data_with_nulls = [
("Alice", 32, "active"),
("Bob", None, "active"),
("Charlie", 45, None),
]
df_nulls = spark.createDataFrame(data_with_nulls, ["name", "age", "status"])
# This EXCLUDES rows where age is null
df_nulls.filter(col("age") > 30).show()
# To include nulls, handle explicitly
df_nulls.filter(
(col("age") > 30) | (col("age").isNull())
).show()
# Filter for non-null values in compound conditions
df_nulls.filter(
(col("age").isNotNull()) &
(col("status").isNotNull()) &
(col("age") > 30) &
(col("status") == "active")
).show()
# Or use coalesce for default values
from pyspark.sql.functions import coalesce
df_nulls.filter(
coalesce(col("age"), lit(0)) > 30
).show()
Partition Pruning
When filtering on partitioned columns, Spark can skip entire partitions. Always filter on partition columns when possible:
# If data is partitioned by date and region
# df.filter(
# (col("date") == "2024-01-15") & # Partition column - enables pruning
# (col("region") == "US") & # Partition column - enables pruning
# (col("amount") > 1000) # Regular column
# )
Avoid Repeated Filter Calls
Chain conditions in a single filter rather than multiple filter calls when possible:
# Less efficient - creates multiple stages
df.filter(col("age") > 30).filter(col("status") == "active").filter(col("salary") > 50000)
# More efficient - single filter with combined conditions
df.filter(
(col("age") > 30) &
(col("status") == "active") &
(col("salary") > 50000)
)
Spark’s optimizer often combines these anyway, but being explicit reduces planning overhead and makes intent clearer.
Multi-condition filtering in PySpark is straightforward once you internalize the parentheses rule and understand your options. Use column expressions for type safety, SQL strings for readability in complex cases, and reduce() when conditions need to be dynamic. Handle nulls explicitly, and you’ll avoid the most common debugging sessions.