How to Filter Rows in PySpark
Row filtering is the bread and butter of data processing. Whether you're cleaning messy datasets, extracting subsets for analysis, or preparing data for machine learning, you'll filter rows...
Key Insights
- PySpark’s
filter()andwhere()methods are completely interchangeable—choose one for consistency and stick with it across your codebase - Always wrap multiple conditions in parentheses when using
&,|, and~operators; Python’s operator precedence will silently produce wrong results otherwise - Filter early in your pipeline, especially on partitioned columns, to leverage predicate pushdown and avoid shuffling unnecessary data through expensive operations
Introduction
Row filtering is the bread and butter of data processing. Whether you’re cleaning messy datasets, extracting subsets for analysis, or preparing data for machine learning, you’ll filter rows constantly. In PySpark, this operation is deceptively simple on the surface but has nuances that can trip up even experienced developers.
This guide covers everything from basic filtering syntax to advanced pattern matching and performance optimization. By the end, you’ll know not just how to filter rows, but how to do it efficiently at scale.
Basic Filtering with filter() and where()
PySpark gives you two methods for filtering rows: filter() and where(). Here’s the thing—they’re identical. Spark’s source code literally has where as an alias for filter. The dual naming exists to accommodate developers coming from different backgrounds (SQL users prefer where, functional programmers prefer filter).
Pick one and use it consistently. I prefer filter() because it’s more explicit about what’s happening: you’re filtering rows based on a condition.
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("FilterExample").getOrCreate()
# Sample data
data = [
("Alice", 35, "Engineering"),
("Bob", 28, "Marketing"),
("Charlie", 42, "Engineering"),
("Diana", 31, "Sales")
]
df = spark.createDataFrame(data, ["name", "age", "department"])
# These two lines produce identical results
filtered_df = df.filter(df.age > 30)
filtered_df = df.where(df.age > 30)
filtered_df.show()
# +-------+---+-----------+
# | name|age| department|
# +-------+---+-----------+
# | Alice| 35|Engineering|
# |Charlie| 42|Engineering|
# | Diana| 31| Sales|
# +-------+---+-----------+
The condition df.age > 30 creates a Column expression that Spark evaluates for each row. Rows where the expression evaluates to True pass through; everything else gets dropped.
Filtering with Column Expressions
You have three ways to reference columns in filter conditions. Each has its place.
from pyspark.sql import functions as F
# Method 1: DataFrame attribute access
df.filter(df.status == "active")
# Method 2: DataFrame bracket notation
df.filter(df["status"] == "active")
# Method 3: col() function
df.filter(F.col("status") == "active")
Method 1 is concise but breaks if your column name contains spaces or special characters. Method 2 handles any column name but requires the DataFrame reference. Method 3 is the most flexible—it works with any column name and doesn’t require the DataFrame object, making it ideal for reusable functions.
All comparison operators work as expected:
# Equality
df.filter(F.col("status") == "active")
# Inequality
df.filter(F.col("status") != "inactive")
# Greater than / less than
df.filter(F.col("salary") > 50000)
df.filter(F.col("age") <= 65)
# Greater than or equal / less than or equal
df.filter(F.col("score") >= 80)
df.filter(F.col("quantity") < 100)
One gotcha: use == for equality, not =. The single equals sign is assignment in Python, and using it in a filter will raise an error.
Combining Multiple Conditions
Real-world filtering rarely involves a single condition. PySpark uses bitwise operators for combining conditions: & for AND, | for OR, and ~ for NOT.
Here’s the critical rule: always wrap individual conditions in parentheses. Python’s operator precedence evaluates & and | before comparison operators, which produces unexpected results.
# WRONG - will raise an error or produce incorrect results
df.filter(F.col("salary") > 50000 & F.col("department") == "Engineering")
# CORRECT - parentheses ensure proper evaluation order
df.filter((F.col("salary") > 50000) & (F.col("department") == "Engineering"))
Here’s a comprehensive example:
# Sample employee data
employees = [
("Alice", 75000, "Engineering", "active"),
("Bob", 45000, "Marketing", "active"),
("Charlie", 85000, "Engineering", "inactive"),
("Diana", 55000, "Sales", "active"),
("Eve", 95000, "Engineering", "active")
]
df = spark.createDataFrame(employees, ["name", "salary", "department", "status"])
# AND condition: high-earning active engineers
high_earning_engineers = df.filter(
(F.col("salary") > 50000) &
(F.col("department") == "Engineering") &
(F.col("status") == "active")
)
# OR condition: either Engineering or high salary
eng_or_high_salary = df.filter(
(F.col("department") == "Engineering") |
(F.col("salary") > 70000)
)
# NOT condition: everyone except Marketing
not_marketing = df.filter(~(F.col("department") == "Marketing"))
# Complex combination
complex_filter = df.filter(
((F.col("department") == "Engineering") | (F.col("department") == "Sales")) &
(F.col("salary") >= 50000) &
~(F.col("status") == "inactive")
)
Filtering with SQL Expressions
Sometimes SQL syntax is more readable, especially for complex conditions. PySpark accepts SQL expression strings directly in filter():
# SQL string syntax
df.filter("age BETWEEN 25 AND 40 AND city = 'NYC'")
# More examples
df.filter("salary > 50000")
df.filter("department IN ('Engineering', 'Sales')")
df.filter("name LIKE 'A%'")
df.filter("status IS NOT NULL")
This approach shines when you’re translating existing SQL queries or when the condition is complex enough that the SQL reads more clearly:
# Complex condition - SQL version is arguably more readable
df.filter("""
(department = 'Engineering' OR department = 'Sales')
AND salary BETWEEN 50000 AND 100000
AND hire_date >= '2020-01-01'
AND status != 'terminated'
""")
# Equivalent PySpark column expression
df.filter(
((F.col("department") == "Engineering") | (F.col("department") == "Sales")) &
(F.col("salary") >= 50000) & (F.col("salary") <= 100000) &
(F.col("hire_date") >= "2020-01-01") &
(F.col("status") != "terminated")
)
The SQL syntax also supports subqueries and more advanced SQL features, though you’ll need to register the DataFrame as a temporary view for truly complex queries.
Advanced Filtering Techniques
PySpark provides specialized methods for common filtering patterns that are more expressive than raw comparisons.
isin() for Multiple Values
# Filter for specific categories
target_departments = ["Engineering", "Sales", "Marketing"]
df.filter(F.col("department").isin(target_departments))
# Negation: exclude specific values
df.filter(~F.col("department").isin(["HR", "Legal"]))
Null Handling
# Find rows with null values
df.filter(F.col("email").isNull())
# Find rows with non-null values
df.filter(F.col("email").isNotNull())
# Combine with other conditions
df.filter(F.col("email").isNotNull() & (F.col("status") == "active"))
Pattern Matching with like() and rlike()
# SQL LIKE patterns (% = any characters, _ = single character)
df.filter(F.col("name").like("A%")) # Names starting with A
df.filter(F.col("email").like("%@gmail.com")) # Gmail addresses
df.filter(F.col("code").like("AB_")) # AB followed by any single character
# Regular expressions with rlike()
df.filter(F.col("phone").rlike(r"^\d{3}-\d{3}-\d{4}$")) # Phone number format
df.filter(F.col("email").rlike(r"@(gmail|yahoo)\.com$")) # Gmail or Yahoo
df.filter(F.col("name").rlike(r"^[A-Z][a-z]+$")) # Capitalized names
between() for Range Queries
# Numeric ranges
df.filter(F.col("age").between(25, 40))
# Date ranges
df.filter(F.col("created_at").between("2023-01-01", "2023-12-31"))
# Equivalent to
df.filter((F.col("age") >= 25) & (F.col("age") <= 40))
Performance Considerations
Filtering seems straightforward, but placement and strategy significantly impact performance at scale.
Filter Early, Filter Often
The most important optimization: push filters as early as possible in your pipeline. Every row you eliminate early is a row that doesn’t flow through subsequent operations.
# INEFFICIENT: Filter after expensive join
result = (
large_orders_df
.join(customers_df, "customer_id")
.join(products_df, "product_id")
.filter(F.col("order_date") >= "2024-01-01") # Late filter
)
# EFFICIENT: Filter before joins
result = (
large_orders_df
.filter(F.col("order_date") >= "2024-01-01") # Early filter
.join(customers_df, "customer_id")
.join(products_df, "product_id")
)
The second version potentially eliminates millions of rows before the expensive join operations.
Leverage Partitioned Columns
If your data is partitioned (common in data lakes), filtering on partition columns enables partition pruning—Spark reads only relevant partitions instead of scanning everything.
# If data is partitioned by date, this filter triggers partition pruning
df.filter(F.col("date") == "2024-01-15")
# Spark only reads the 2024-01-15 partition, skipping all others
Predicate Pushdown
When reading from external sources (Parquet, JDBC, Delta Lake), Spark can push filter predicates down to the data source. The source handles filtering, reducing data transfer.
# Predicate pushdown happens automatically for supported sources
df = spark.read.parquet("s3://bucket/sales_data/")
filtered = df.filter(F.col("region") == "US") # Pushed to Parquet reader
# Verify with explain()
filtered.explain(True)
Check the physical plan output—you’ll see PushedFilters showing which predicates were pushed down.
Avoid Filter Anti-Patterns
# ANTI-PATTERN: UDF in filter (prevents predicate pushdown)
from pyspark.sql.types import BooleanType
@F.udf(BooleanType())
def is_valid(value):
return value is not None and len(value) > 0
df.filter(is_valid(F.col("name"))) # Can't be pushed down
# BETTER: Use built-in functions
df.filter(F.col("name").isNotNull() & (F.length(F.col("name")) > 0))
Built-in functions can be optimized and pushed down; UDFs cannot.
Row filtering in PySpark is fundamental but nuanced. Master the syntax, understand the operators, and always think about where in your pipeline filters belong. Your future self debugging a slow job will thank you.