PySpark - SQL WHERE Clause Examples | Application Architect

Key Insights

PySpark’s SQL WHERE clause and DataFrame filter() method are functionally equivalent, but WHERE clauses offer more readable syntax for complex conditions while filter() integrates better with programmatic DataFrame operations.
Filter pushdown optimization automatically moves WHERE clause predicates closer to the data source, dramatically improving query performance by reducing the amount of data scanned and transferred.
NULL handling in PySpark SQL requires explicit IS NULL/IS NOT NULL checks—standard equality operators (= NULL) will not work as expected and will silently return incorrect results.

Introduction to WHERE Clause in PySpark

Filtering data is fundamental to any data processing pipeline. PySpark provides two primary approaches: SQL-style WHERE clauses through spark.sql() and the DataFrame API’s filter() method. Both compile to the same execution plan, but they serve different use cases.

Use WHERE clauses when you’re writing complex analytical queries that benefit from SQL’s declarative syntax. Use filter() when you’re building programmatic pipelines where conditions are dynamically constructed. For teams with strong SQL backgrounds, WHERE clauses reduce the learning curve and make code reviews easier.

Let’s create a sample dataset for our examples:

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType, DateType
from datetime import date

spark = SparkSession.builder.appName("WHEREExamples").getOrCreate()

data = [
    (1, "John Smith", 35, "Engineering", 75000.0, date(2020, 1, 15)),
    (2, "Sarah Johnson", 28, "Sales", 62000.0, date(2021, 3, 22)),
    (3, "Mike Wilson", 42, "Engineering", 95000.0, date(2018, 7, 10)),
    (4, "Emily Brown", 31, "Marketing", 58000.0, date(2019, 11, 5)),
    (5, "David Lee", 29, "Sales", 55000.0, date(2022, 2, 18)),
    (6, "Lisa Anderson", 38, "Engineering", 88000.0, None),
    (7, "James Taylor", 26, "Marketing", None, date(2023, 1, 9)),
    (8, "Maria Garcia", 33, "Sales", 68000.0, date(2020, 9, 14))
]

schema = StructType([
    StructField("id", IntegerType(), False),
    StructField("name", StringType(), True),
    StructField("age", IntegerType(), True),
    StructField("department", StringType(), True),
    StructField("salary", DoubleType(), True),
    StructField("hire_date", DateType(), True)
])

df = spark.createDataFrame(data, schema)
df.createOrReplaceTempView("employees")

Basic WHERE Clause Filtering

The WHERE clause filters rows based on conditions. Standard comparison operators work exactly as you’d expect from SQL:

# Filter employees older than 30
result = spark.sql("""
    SELECT * FROM employees
    WHERE age > 30
""")
result.show()

# Equivalent using filter()
df.filter(df.age > 30).show()

# Multiple simple conditions
spark.sql("""
    SELECT name, department, salary 
    FROM employees
    WHERE salary >= 60000
""").show()

# Not equal operator
spark.sql("""
    SELECT * FROM employees
    WHERE department != 'Sales'
""").show()

The SQL syntax is cleaner for complex queries, especially when you’re selecting specific columns and applying multiple transformations. The filter() method shines when you’re chaining operations programmatically.

Multiple Conditions with Logical Operators

Combine conditions using AND, OR, and NOT operators. Parentheses control evaluation order—use them liberally for clarity:

# AND operator - both conditions must be true
spark.sql("""
    SELECT name, age, department, salary
    FROM employees
    WHERE age > 25 AND department = 'Sales'
""").show()

# OR operator - at least one condition must be true
spark.sql("""
    SELECT name, department, salary
    FROM employees
    WHERE department = 'Engineering' OR salary > 70000
""").show()

# Complex conditions with parentheses
spark.sql("""
    SELECT name, age, department, salary
    FROM employees
    WHERE (age > 30 AND department = 'Engineering')
       OR (department = 'Sales' AND salary > 60000)
""").show()

# NOT operator
spark.sql("""
    SELECT * FROM employees
    WHERE NOT (department = 'Marketing' OR age < 30)
""").show()

Operator precedence in PySpark SQL follows standard SQL rules: NOT evaluates first, then AND, then OR. Always use parentheses when mixing AND/OR to make your intent explicit—future maintainers will thank you.

Pattern Matching and String Operations

String filtering goes beyond exact matches. LIKE handles simple patterns while RLIKE provides full regex power:

# LIKE for pattern matching (% matches any characters)
spark.sql("""
    SELECT name FROM employees
    WHERE name LIKE '%son%'
""").show()

# LIKE with specific patterns
spark.sql("""
    SELECT name FROM employees
    WHERE name LIKE 'M%'  -- Names starting with M
""").show()

# RLIKE for regex patterns
spark.sql("""
    SELECT name FROM employees
    WHERE name RLIKE '^[JS]'  -- Names starting with J or S
""").show()

# Case-insensitive filtering with LOWER
spark.sql("""
    SELECT name, department FROM employees
    WHERE LOWER(department) = 'engineering'
""").show()

# Combining string functions
spark.sql("""
    SELECT name FROM employees
    WHERE LOWER(name) LIKE '%john%'
""").show()

RLIKE accepts Java regex patterns, which are slightly different from Python’s re module. Test your patterns thoroughly. For simple prefix/suffix matching, LIKE is faster and more readable.

Working with NULL Values and IN Clause

NULL handling trips up many developers. In SQL, NULL represents unknown values, so equality comparisons always return NULL (treated as false):

# WRONG - This won't work as expected
spark.sql("""
    SELECT * FROM employees
    WHERE salary = NULL  -- Returns no rows!
""").show()

# CORRECT - Use IS NULL
spark.sql("""
    SELECT name, salary FROM employees
    WHERE salary IS NULL
""").show()

# IS NOT NULL
spark.sql("""
    SELECT name, hire_date FROM employees
    WHERE hire_date IS NOT NULL
""").show()

# IN clause for multiple values
spark.sql("""
    SELECT name, department FROM employees
    WHERE department IN ('Engineering', 'Sales')
""").show()

# NOT IN clause
spark.sql("""
    SELECT name, department FROM employees
    WHERE department NOT IN ('Marketing')
""").show()

# Combining NULL checks with other conditions
spark.sql("""
    SELECT name, salary, hire_date FROM employees
    WHERE salary IS NOT NULL AND hire_date IS NOT NULL
""").show()

Be careful with NOT IN when the list contains NULLs—the entire condition evaluates to NULL. Use NOT EXISTS with subqueries for safer null-handling in complex scenarios.

Advanced WHERE Clause Patterns

BETWEEN simplifies range queries. Subqueries enable sophisticated filtering based on aggregated or derived data:

# BETWEEN for inclusive ranges
spark.sql("""
    SELECT name, age FROM employees
    WHERE age BETWEEN 28 AND 35
""").show()

# Date range filtering
spark.sql("""
    SELECT name, hire_date FROM employees
    WHERE hire_date BETWEEN '2020-01-01' AND '2021-12-31'
""").show()

# Subquery in WHERE clause
spark.sql("""
    SELECT name, salary FROM employees
    WHERE salary > (SELECT AVG(salary) FROM employees WHERE salary IS NOT NULL)
""").show()

# Correlated subquery
spark.sql("""
    SELECT e1.name, e1.department, e1.salary
    FROM employees e1
    WHERE e1.salary > (
        SELECT AVG(e2.salary) 
        FROM employees e2 
        WHERE e2.department = e1.department AND e2.salary IS NOT NULL
    )
""").show()

# Date functions in WHERE
spark.sql("""
    SELECT name, hire_date FROM employees
    WHERE YEAR(hire_date) = 2020
""").show()

Subqueries add computational overhead. When possible, use joins or window functions instead. Correlated subqueries are particularly expensive because they execute once per outer row.

Performance Tips and Best Practices

Understanding how Spark optimizes WHERE clauses is crucial for performance. The Catalyst optimizer performs predicate pushdown, moving filters as close to the data source as possible:

# Demonstrate pushdown with partitioned data
from pyspark.sql.functions import col

# Create partitioned sample
df.write.partitionBy("department").mode("overwrite").parquet("/tmp/employees_partitioned")
df_partitioned = spark.read.parquet("/tmp/employees_partitioned")

# Filter on partition column - extremely efficient
result = spark.sql("""
    SELECT * FROM employees
    WHERE department = 'Engineering'
""")

# View the physical plan
result.explain()

# Filter early, select late
# GOOD - filter first
spark.sql("""
    SELECT name, salary FROM employees
    WHERE department = 'Engineering' AND salary > 70000
""").show()

# Avoid functions on filtered columns when possible
# SLOWER - function prevents index usage
spark.sql("""
    SELECT * FROM employees
    WHERE YEAR(hire_date) = 2020
""").show()

# FASTER - use range comparison
spark.sql("""
    SELECT * FROM employees
    WHERE hire_date >= '2020-01-01' AND hire_date < '2021-01-01'
""").show()

Key performance guidelines:

Filter on partition columns first: When data is partitioned by department, filtering on department prunes entire partitions before reading data.

Avoid UDFs in WHERE clauses: User-defined functions prevent pushdown optimization and force row-by-row processing.

Use specific data types: Casting columns in WHERE clauses (e.g., CAST(age AS STRING)) prevents predicate pushdown. Store data in the correct type.

Combine filters efficiently: Multiple WHERE conditions connected by AND are more efficient than separate filter operations because they’re evaluated together.

The choice between WHERE and filter() matters less than you think—both compile to the same optimized plan. Choose based on readability and team preferences. SQL WHERE clauses excel in analytical contexts with complex conditions, while filter() fits better in programmatic ETL pipelines where conditions are built dynamically.

Always examine query plans with explain() when optimizing. The physical plan reveals whether your filters are being pushed down to the data source and shows the actual execution strategy Spark will use.