Filter/Where in PySpark vs Pandas vs SQL
Filtering rows is the most common data operation you'll write. Every analysis starts with 'give me the rows where X.' Yet the syntax and behavior differ enough between Pandas, PySpark, and SQL that...
Key Insights
- PySpark’s
filter()andwhere()are identical—use whichever reads better in context, but be consistent across your codebase - Null handling is the biggest gotcha when translating filters between tools: Pandas includes nulls in inequality comparisons while PySpark and SQL exclude them silently
- PySpark’s lazy evaluation means you can chain multiple filters without performance penalty—the optimizer combines them automatically
The Universal Operation Nobody Agrees On
Filtering rows is the most common data operation you’ll write. Every analysis starts with “give me the rows where X.” Yet the syntax and behavior differ enough between Pandas, PySpark, and SQL that direct translation leads to subtle bugs.
If you’re a data engineer moving between local prototyping in Pandas and production PySpark jobs—or translating analyst SQL into pipeline code—you need to understand these differences. Not just the syntax, but the semantics.
Basic Filtering Syntax Comparison
Let’s start with the simplest case: filter rows where age exceeds 30.
Pandas:
import pandas as pd
df = pd.DataFrame({
'name': ['Alice', 'Bob', 'Charlie'],
'age': [25, 35, 42]
})
# Boolean indexing - the Pandas way
result = df[df['age'] > 30]
# Using query() for SQL-like syntax
result = df.query('age > 30')
PySpark:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame([
('Alice', 25), ('Bob', 35), ('Charlie', 42)
], ['name', 'age'])
# Using filter() with column expression
result = df.filter(col('age') > 30)
# Using where() - identical behavior
result = df.where(col('age') > 30)
# String expression syntax
result = df.filter('age > 30')
SQL:
SELECT * FROM people WHERE age > 30;
The PySpark filter() vs where() question comes up constantly. They’re aliases—literally the same method. Use where() when you want SQL-familiar code, filter() when you’re thinking functionally. Pick one convention for your team and stick with it.
Multiple Conditions and Boolean Logic
Real filters combine conditions. Here’s where syntax divergence gets interesting.
Pandas:
# AND conditions - use & operator with parentheses
result = df[(df['age'] > 30) & (df['department'] == 'Engineering')]
# OR conditions - use | operator
result = df[(df['age'] > 30) | (df['salary'] > 100000)]
# Complex combination
result = df[
((df['age'] > 30) & (df['department'] == 'Engineering')) |
(df['salary'] > 100000)
]
# query() is cleaner for complex logic
result = df.query('(age > 30 and department == "Engineering") or salary > 100000')
PySpark:
from pyspark.sql.functions import col
# AND conditions - use & operator
result = df.filter((col('age') > 30) & (col('department') == 'Engineering'))
# OR conditions - use | operator
result = df.filter((col('age') > 30) | (col('salary') > 100000))
# Complex combination
result = df.filter(
((col('age') > 30) & (col('department') == 'Engineering')) |
(col('salary') > 100000)
)
# String expression alternative
result = df.filter(
"(age > 30 AND department = 'Engineering') OR salary > 100000"
)
SQL:
SELECT * FROM people
WHERE (age > 30 AND department = 'Engineering')
OR salary > 100000;
The critical gotcha: parentheses are mandatory in Pandas and PySpark when combining conditions. Python’s operator precedence puts & and | higher than comparison operators. Without parentheses, df['age'] > 30 & df['department'] == 'Engineering' evaluates as df['age'] > (30 & df['department']) == 'Engineering'—which throws an error or produces nonsense.
For negation, Pandas uses ~, PySpark uses ~ or the not keyword in string expressions, and SQL uses NOT:
# Pandas
result = df[~(df['age'] > 30)]
# PySpark
result = df.filter(~(col('age') > 30))
# SQL
# SELECT * FROM people WHERE NOT age > 30;
Filtering with Null/None Values
This is where most translation bugs hide. Each tool handles nulls differently in comparisons.
Pandas treats NaN as a value that propagates:
df = pd.DataFrame({
'name': ['Alice', 'Bob', 'Charlie'],
'age': [25, None, 42]
})
# This INCLUDES the row where age is None in some operations
# but EXCLUDES it in boolean comparisons
result = df[df['age'] > 30] # Returns only Charlie
# Explicit null checks
result = df[df['age'].isna()] # Rows where age is null
result = df[df['age'].notna()] # Rows where age is not null
result = df[df['age'].fillna(0) > 30] # Handle nulls explicitly
PySpark excludes nulls from all comparisons silently:
from pyspark.sql.functions import col, isnan, isnull
df = spark.createDataFrame([
('Alice', 25), ('Bob', None), ('Charlie', 42)
], ['name', 'age'])
# This excludes Bob entirely - no error, no warning
result = df.filter(col('age') > 30)
# Explicit null handling
result = df.filter(col('age').isNull()) # Rows where age is null
result = df.filter(col('age').isNotNull()) # Rows where age is not null
# Include nulls in your logic explicitly
result = df.filter((col('age') > 30) | col('age').isNull())
SQL follows three-valued logic:
-- Excludes NULL rows (NULL > 30 evaluates to UNKNOWN, not TRUE)
SELECT * FROM people WHERE age > 30;
-- Explicit null handling
SELECT * FROM people WHERE age IS NULL;
SELECT * FROM people WHERE age IS NOT NULL;
SELECT * FROM people WHERE age > 30 OR age IS NULL;
The danger: when you translate a Pandas filter to PySpark, you might silently drop null rows you were keeping. Always audit your null handling when porting code.
String and Pattern Matching Filters
Text filtering requires different functions across tools.
Pandas:
df = pd.DataFrame({
'name': ['Alice', 'Bob', 'Amanda', 'Charlie'],
'email': ['alice@test.com', 'bob@example.org', 'amanda@test.com', 'charlie@work.net']
})
# Starts with
result = df[df['name'].str.startswith('A')]
# Contains substring
result = df[df['email'].str.contains('test')]
# Regex matching
result = df[df['email'].str.match(r'.*@.*\.com$')]
# Case-insensitive
result = df[df['name'].str.lower().str.startswith('a')]
PySpark:
from pyspark.sql.functions import col, lower
# Starts with
result = df.filter(col('name').startswith('A'))
# Contains substring
result = df.filter(col('email').contains('test'))
# Regex matching (uses Java regex syntax)
result = df.filter(col('email').rlike(r'.*@.*\.com$'))
# Case-insensitive
result = df.filter(lower(col('name')).startswith('a'))
# LIKE patterns (SQL-style wildcards)
result = df.filter(col('name').like('A%'))
SQL:
-- Starts with (using LIKE)
SELECT * FROM people WHERE name LIKE 'A%';
-- Contains substring
SELECT * FROM people WHERE email LIKE '%test%';
-- Regex (syntax varies by database)
SELECT * FROM people WHERE email REGEXP '.*@.*\.com$'; -- MySQL
SELECT * FROM people WHERE email ~ '.*@.*\.com$'; -- PostgreSQL
Watch out: PySpark’s rlike() uses Java regex syntax, which differs subtly from Python’s re module. Escape sequences and character classes behave differently.
Performance Considerations
Pandas filters execute immediately on in-memory data. PySpark filters are lazy—they build an execution plan that’s optimized before running.
# PySpark: These two approaches produce identical execution plans
result1 = df.filter(col('age') > 30).filter(col('department') == 'Engineering')
result2 = df.filter((col('age') > 30) & (col('department') == 'Engineering'))
# Verify with explain()
result1.explain()
Output:
== Physical Plan ==
*(1) Filter ((age#1 > 30) AND (department#2 = Engineering))
+- *(1) Scan [name#0, age#1, department#2]
The optimizer combines chained filters automatically. Write readable code; let Spark optimize.
For data sources that support it, PySpark pushes filters down to the source:
# Reading from Parquet with filter pushdown
df = spark.read.parquet('people.parquet').filter(col('age') > 30)
df.explain()
== Physical Plan ==
*(1) FileScan parquet [name#0,age#1]
PushedFilters: [IsNotNull(age), GreaterThan(age,30)]
The filter happens during the scan—Spark never loads filtered-out rows into memory. This is a massive performance win on large datasets that Pandas can’t match.
Quick Reference Cheat Sheet
| Operation | Pandas | PySpark | SQL |
|---|---|---|---|
| Basic filter | df[df['col'] > x] |
df.filter(col('col') > x) |
WHERE col > x |
| AND | (cond1) & (cond2) |
(cond1) & (cond2) |
cond1 AND cond2 |
| OR | (cond1) | (cond2) |
(cond1) | (cond2) |
cond1 OR cond2 |
| NOT | ~(condition) |
~(condition) |
NOT condition |
| Is null | df['col'].isna() |
col('col').isNull() |
col IS NULL |
| Not null | df['col'].notna() |
col('col').isNotNull() |
col IS NOT NULL |
| Starts with | .str.startswith('x') |
.startswith('x') |
LIKE 'x%' |
| Contains | .str.contains('x') |
.contains('x') |
LIKE '%x%' |
| Regex | .str.match(r'...') |
.rlike(r'...') |
REGEXP / ~ |
| In list | .isin([1,2,3]) |
.isin([1,2,3]) |
IN (1,2,3) |
The syntax differences are learnable. The null handling differences will bite you. Test your filters with null data before deploying to production.