PySpark - Filter Rows in DataFrame (where/filter)

Key Insights

where() and filter() are complete aliases in PySpark—use whichever reads better in your context, but be consistent within your codebase
Always wrap individual conditions in parentheses when combining with logical operators (&, |, ~) to avoid operator precedence issues that will break your queries
Filter as early as possible in your transformation pipeline to leverage Spark’s predicate pushdown optimization and reduce data shuffling across the cluster

Introduction

Filtering rows is one of the most fundamental operations in PySpark data processing. Whether you’re cleaning data, extracting subsets for analysis, or implementing business logic, you’ll use row filtering constantly. PySpark provides two methods for this: where() and filter(). These are complete aliases—they execute identical code under the hood and have zero performance difference.

The choice between them is purely stylistic. SQL developers often prefer where() because it mirrors SQL syntax, while developers from functional programming backgrounds might favor filter(). Pick one and stick with it for consistency.

Here’s proof they’re identical:

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("FilterDemo").getOrCreate()

data = [
    (1, "Alice", 28, "NYC"),
    (2, "Bob", 35, "LA"),
    (3, "Charlie", 22, "NYC"),
    (4, "Diana", 31, "Chicago")
]

df = spark.createDataFrame(data, ["id", "name", "age", "city"])

# These produce identical results
result_where = df.where(df.age > 25)
result_filter = df.filter(df.age > 25)

result_where.show()
result_filter.show()
# Both output the same three rows

Row filtering is essential in distributed processing because it reduces the data volume early in your pipeline. Less data means less network transfer, less memory consumption, and faster subsequent operations.

Basic Filtering with Single Conditions

Single-condition filters use standard comparison operators. You can reference columns using either the DataFrame attribute notation (df.column_name) or the bracket notation (df["column_name"]).

Filter rows where age exceeds 25:

df.filter(df.age > 25).show()

# Output:
# +---+-----+---+-------+
# | id| name|age|   city|
# +---+-----+---+-------+
# |  1|Alice| 28|    NYC|
# |  2|  Bob| 35|     LA|
# |  4|Diana| 31|Chicago|
# +---+-----+---+-------+

Filter for exact matches using the equality operator:

df.filter(df.city == "NYC").show()

# Output:
# +---+-------+---+----+
# | id|   name|age|city|
# +---+-------+---+----+
# |  1|  Alice| 28| NYC|
# |  3|Charlie| 22| NYC|
# +---+-------+---+----+

Handling null values requires special methods because NULL comparisons don’t work with standard operators:

data_with_nulls = [
    (1, "Alice", 28, "NYC"),
    (2, "Bob", None, "LA"),
    (3, "Charlie", 22, None),
]

df_nulls = spark.createDataFrame(data_with_nulls, ["id", "name", "age", "city"])

# Filter rows where age is null
df_nulls.filter(df_nulls.age.isNull()).show()

# Filter rows where city is NOT null
df_nulls.filter(df_nulls.city.isNotNull()).show()

Don’t use == None or != None—these won’t work correctly with Spark’s null handling semantics.

Multiple Conditions and Logical Operators

Combining conditions requires bitwise operators: & for AND, | for OR, and ~ for NOT. This is different from Python’s standard and, or, not keywords, which won’t work here.

Critical requirement: Wrap each condition in parentheses. Python’s operator precedence will break your query without them.

# Filter with AND: age > 25 AND city is NYC
df.filter((df.age > 25) & (df.city == "NYC")).show()

# Output:
# +---+-----+---+----+
# | id| name|age|city|
# +---+-----+---+----+
# |  1|Alice| 28| NYC|
# +---+-----+---+----+

OR conditions for multiple acceptable values:

# Filter rows where city is NYC OR LA
df.filter((df.city == "NYC") | (df.city == "LA")).show()

# Output:
# +---+-------+---+----+
# | id|   name|age|city|
# +---+-------+---+----+
# |  1|  Alice| 28| NYC|
# |  2|    Bob| 35|  LA|
# |  3|Charlie| 22| NYC|
# +---+-------+---+----+

Complex conditions with NOT operator:

# Age over 25 AND NOT in Chicago
df.filter((df.age > 25) & ~(df.city == "Chicago")).show()

# Nested conditions: (age > 30) OR (age < 25 AND city == "NYC")
df.filter((df.age > 30) | ((df.age < 25) & (df.city == "NYC"))).show()

Missing parentheses is one of the most common PySpark mistakes. The error messages are cryptic, so save yourself debugging time and always use them.

String Filtering Operations

PySpark provides specialized methods for string filtering beyond simple equality checks.

Pattern matching with like() uses SQL wildcards (% for multiple characters, _ for single character):

data_products = [
    (1, "iPhone 14 Pro"),
    (2, "iPad Air"),
    (3, "iPhone 13"),
    (4, "MacBook Pro"),
    (5, "iMac")
]

df_products = spark.createDataFrame(data_products, ["id", "product"])

# Find all iPhone products
df_products.filter(df_products.product.like("iPhone%")).show()

# Products ending with "Pro"
df_products.filter(df_products.product.like("%Pro")).show()

# Products containing "Book"
df_products.filter(df_products.product.like("%Book%")).show()

For complex pattern matching, use rlike() with regular expressions:

# Match products starting with 'i' followed by uppercase letter
df_products.filter(df_products.product.rlike("^i[A-Z]")).show()

# Match products with numbers
df_products.filter(df_products.product.rlike("\\d+")).show()

Simpler string methods for common cases:

# Products starting with "iP"
df_products.filter(df_products.product.startswith("iP")).show()

# Products ending with "Air"
df_products.filter(df_products.product.endswith("Air")).show()

# Products containing "Mac"
df_products.filter(df_products.product.contains("Mac")).show()

These methods are more readable than like() for simple prefix/suffix/contains operations.

Advanced Filtering Techniques

The isin() method efficiently filters for multiple discrete values—cleaner than chaining OR conditions:

# Filter for multiple cities
df.filter(df.city.isin("NYC", "Chicago", "LA")).show()

# Equivalent to, but cleaner than:
df.filter((df.city == "NYC") | (df.city == "Chicago") | (df.city == "LA")).show()

# Works with lists too
valid_cities = ["NYC", "LA"]
df.filter(df.city.isin(valid_cities)).show()

For complex logic, SQL expressions via expr() can be more readable:

from pyspark.sql.functions import expr

# Complex condition as SQL string
df.filter(expr("age > 25 AND city IN ('NYC', 'LA')")).show()

# Useful for dynamic query construction
age_threshold = 30
city_list = "('NYC', 'Chicago')"
query = f"age > {age_threshold} AND city IN {city_list}"
df.filter(expr(query)).show()

Using col() from pyspark.sql.functions allows filtering without referencing the DataFrame:

from pyspark.sql.functions import col

# Useful in method chains where df isn't in scope
df.filter(col("age") > 25).filter(col("city") == "NYC").show()

# Particularly helpful in reusable functions
def filter_by_age(dataframe, min_age):
    return dataframe.filter(col("age") > min_age)

result = filter_by_age(df, 25)

Performance Considerations

Filter placement significantly impacts performance. Spark’s Catalyst optimizer performs predicate pushdown—moving filters as close to the data source as possible. When reading from formats like Parquet or databases, filters can eliminate data before it’s even loaded into memory.

# Good: Filter immediately after reading
df = spark.read.parquet("large_dataset.parquet")
df_filtered = df.filter(col("date") >= "2024-01-01")

# Less efficient: Multiple operations before filtering
df_processed = df.select("*").withColumn("new_col", col("age") * 2)
df_filtered = df_processed.filter(col("date") >= "2024-01-01")

Order your filters from most selective to least selective when possible. While Spark optimizes this, helping the optimizer reduces planning overhead:

# Better: Very selective filter first
df.filter(col("user_id") == 12345).filter(col("age") > 18)

# Less optimal: Broad filter first
df.filter(col("age") > 18).filter(col("user_id") == 12345)

Use explain() to verify your filters are being pushed down:

df_filtered = df.filter(col("age") > 25)
df_filtered.explain(True)

# Look for "PushedFilters" in the output—shows what got pushed to the source

Avoid filtering with UDFs (User Defined Functions) when possible. UDFs prevent predicate pushdown and force Python serialization:

from pyspark.sql.functions import udf
from pyspark.sql.types import BooleanType

# Avoid this
def is_valid_age(age):
    return age > 25 and age < 65

is_valid_udf = udf(is_valid_age, BooleanType())
df.filter(is_valid_udf(col("age")))  # Slow, no pushdown

# Do this instead
df.filter((col("age") > 25) & (col("age") < 65))  # Fast, optimized

When working with partitioned data, filter on partition columns first. This allows entire partitions to be skipped:

# If data is partitioned by year and month
df = spark.read.parquet("partitioned_data/")
df.filter((col("year") == 2024) & (col("month") == 1)).show()
# Spark only reads the 2024/1 partition

Row filtering is straightforward but has significant performance implications in distributed systems. Filter early, filter often, and use built-in methods over custom functions. Your cluster will thank you.