Filter/Where in PySpark vs Pandas vs SQL

Filtering rows is the most common data operation you'll write. Every analysis starts with 'give me the rows where X.' Yet the syntax and behavior differ enough between Pandas, PySpark, and SQL that...

Key Insights

  • PySpark’s filter() and where() are identical—use whichever reads better in context, but be consistent across your codebase
  • Null handling is the biggest gotcha when translating filters between tools: Pandas includes nulls in inequality comparisons while PySpark and SQL exclude them silently
  • PySpark’s lazy evaluation means you can chain multiple filters without performance penalty—the optimizer combines them automatically

The Universal Operation Nobody Agrees On

Filtering rows is the most common data operation you’ll write. Every analysis starts with “give me the rows where X.” Yet the syntax and behavior differ enough between Pandas, PySpark, and SQL that direct translation leads to subtle bugs.

If you’re a data engineer moving between local prototyping in Pandas and production PySpark jobs—or translating analyst SQL into pipeline code—you need to understand these differences. Not just the syntax, but the semantics.

Basic Filtering Syntax Comparison

Let’s start with the simplest case: filter rows where age exceeds 30.

Pandas:

import pandas as pd

df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie'],
    'age': [25, 35, 42]
})

# Boolean indexing - the Pandas way
result = df[df['age'] > 30]

# Using query() for SQL-like syntax
result = df.query('age > 30')

PySpark:

from pyspark.sql import SparkSession
from pyspark.sql.functions import col

spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame([
    ('Alice', 25), ('Bob', 35), ('Charlie', 42)
], ['name', 'age'])

# Using filter() with column expression
result = df.filter(col('age') > 30)

# Using where() - identical behavior
result = df.where(col('age') > 30)

# String expression syntax
result = df.filter('age > 30')

SQL:

SELECT * FROM people WHERE age > 30;

The PySpark filter() vs where() question comes up constantly. They’re aliases—literally the same method. Use where() when you want SQL-familiar code, filter() when you’re thinking functionally. Pick one convention for your team and stick with it.

Multiple Conditions and Boolean Logic

Real filters combine conditions. Here’s where syntax divergence gets interesting.

Pandas:

# AND conditions - use & operator with parentheses
result = df[(df['age'] > 30) & (df['department'] == 'Engineering')]

# OR conditions - use | operator
result = df[(df['age'] > 30) | (df['salary'] > 100000)]

# Complex combination
result = df[
    ((df['age'] > 30) & (df['department'] == 'Engineering')) | 
    (df['salary'] > 100000)
]

# query() is cleaner for complex logic
result = df.query('(age > 30 and department == "Engineering") or salary > 100000')

PySpark:

from pyspark.sql.functions import col

# AND conditions - use & operator
result = df.filter((col('age') > 30) & (col('department') == 'Engineering'))

# OR conditions - use | operator  
result = df.filter((col('age') > 30) | (col('salary') > 100000))

# Complex combination
result = df.filter(
    ((col('age') > 30) & (col('department') == 'Engineering')) | 
    (col('salary') > 100000)
)

# String expression alternative
result = df.filter(
    "(age > 30 AND department = 'Engineering') OR salary > 100000"
)

SQL:

SELECT * FROM people 
WHERE (age > 30 AND department = 'Engineering') 
   OR salary > 100000;

The critical gotcha: parentheses are mandatory in Pandas and PySpark when combining conditions. Python’s operator precedence puts & and | higher than comparison operators. Without parentheses, df['age'] > 30 & df['department'] == 'Engineering' evaluates as df['age'] > (30 & df['department']) == 'Engineering'—which throws an error or produces nonsense.

For negation, Pandas uses ~, PySpark uses ~ or the not keyword in string expressions, and SQL uses NOT:

# Pandas
result = df[~(df['age'] > 30)]

# PySpark
result = df.filter(~(col('age') > 30))

# SQL
# SELECT * FROM people WHERE NOT age > 30;

Filtering with Null/None Values

This is where most translation bugs hide. Each tool handles nulls differently in comparisons.

Pandas treats NaN as a value that propagates:

df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie'],
    'age': [25, None, 42]
})

# This INCLUDES the row where age is None in some operations
# but EXCLUDES it in boolean comparisons
result = df[df['age'] > 30]  # Returns only Charlie

# Explicit null checks
result = df[df['age'].isna()]       # Rows where age is null
result = df[df['age'].notna()]      # Rows where age is not null
result = df[df['age'].fillna(0) > 30]  # Handle nulls explicitly

PySpark excludes nulls from all comparisons silently:

from pyspark.sql.functions import col, isnan, isnull

df = spark.createDataFrame([
    ('Alice', 25), ('Bob', None), ('Charlie', 42)
], ['name', 'age'])

# This excludes Bob entirely - no error, no warning
result = df.filter(col('age') > 30)

# Explicit null handling
result = df.filter(col('age').isNull())      # Rows where age is null
result = df.filter(col('age').isNotNull())   # Rows where age is not null

# Include nulls in your logic explicitly
result = df.filter((col('age') > 30) | col('age').isNull())

SQL follows three-valued logic:

-- Excludes NULL rows (NULL > 30 evaluates to UNKNOWN, not TRUE)
SELECT * FROM people WHERE age > 30;

-- Explicit null handling
SELECT * FROM people WHERE age IS NULL;
SELECT * FROM people WHERE age IS NOT NULL;
SELECT * FROM people WHERE age > 30 OR age IS NULL;

The danger: when you translate a Pandas filter to PySpark, you might silently drop null rows you were keeping. Always audit your null handling when porting code.

String and Pattern Matching Filters

Text filtering requires different functions across tools.

Pandas:

df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Amanda', 'Charlie'],
    'email': ['alice@test.com', 'bob@example.org', 'amanda@test.com', 'charlie@work.net']
})

# Starts with
result = df[df['name'].str.startswith('A')]

# Contains substring
result = df[df['email'].str.contains('test')]

# Regex matching
result = df[df['email'].str.match(r'.*@.*\.com$')]

# Case-insensitive
result = df[df['name'].str.lower().str.startswith('a')]

PySpark:

from pyspark.sql.functions import col, lower

# Starts with
result = df.filter(col('name').startswith('A'))

# Contains substring
result = df.filter(col('email').contains('test'))

# Regex matching (uses Java regex syntax)
result = df.filter(col('email').rlike(r'.*@.*\.com$'))

# Case-insensitive
result = df.filter(lower(col('name')).startswith('a'))

# LIKE patterns (SQL-style wildcards)
result = df.filter(col('name').like('A%'))

SQL:

-- Starts with (using LIKE)
SELECT * FROM people WHERE name LIKE 'A%';

-- Contains substring
SELECT * FROM people WHERE email LIKE '%test%';

-- Regex (syntax varies by database)
SELECT * FROM people WHERE email REGEXP '.*@.*\.com$';  -- MySQL
SELECT * FROM people WHERE email ~ '.*@.*\.com$';       -- PostgreSQL

Watch out: PySpark’s rlike() uses Java regex syntax, which differs subtly from Python’s re module. Escape sequences and character classes behave differently.

Performance Considerations

Pandas filters execute immediately on in-memory data. PySpark filters are lazy—they build an execution plan that’s optimized before running.

# PySpark: These two approaches produce identical execution plans
result1 = df.filter(col('age') > 30).filter(col('department') == 'Engineering')
result2 = df.filter((col('age') > 30) & (col('department') == 'Engineering'))

# Verify with explain()
result1.explain()

Output:

== Physical Plan ==
*(1) Filter ((age#1 > 30) AND (department#2 = Engineering))
+- *(1) Scan [name#0, age#1, department#2]

The optimizer combines chained filters automatically. Write readable code; let Spark optimize.

For data sources that support it, PySpark pushes filters down to the source:

# Reading from Parquet with filter pushdown
df = spark.read.parquet('people.parquet').filter(col('age') > 30)
df.explain()
== Physical Plan ==
*(1) FileScan parquet [name#0,age#1] 
     PushedFilters: [IsNotNull(age), GreaterThan(age,30)]

The filter happens during the scan—Spark never loads filtered-out rows into memory. This is a massive performance win on large datasets that Pandas can’t match.

Quick Reference Cheat Sheet

Operation Pandas PySpark SQL
Basic filter df[df['col'] > x] df.filter(col('col') > x) WHERE col > x
AND (cond1) & (cond2) (cond1) & (cond2) cond1 AND cond2
OR (cond1) | (cond2) (cond1) | (cond2) cond1 OR cond2
NOT ~(condition) ~(condition) NOT condition
Is null df['col'].isna() col('col').isNull() col IS NULL
Not null df['col'].notna() col('col').isNotNull() col IS NOT NULL
Starts with .str.startswith('x') .startswith('x') LIKE 'x%'
Contains .str.contains('x') .contains('x') LIKE '%x%'
Regex .str.match(r'...') .rlike(r'...') REGEXP / ~
In list .isin([1,2,3]) .isin([1,2,3]) IN (1,2,3)

The syntax differences are learnable. The null handling differences will bite you. Test your filters with null data before deploying to production.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.