PySpark - Filter Rows Between Two Values
Filtering rows within a specific range is one of the most common operations in data processing. Whether you're analyzing sales data within a date range, identifying employees within a salary band, or...
Key Insights
- PySpark offers three primary methods for range filtering: the
between()function for simplicity, comparison operators for complex conditions, and SQL syntax for SQL-familiar teams—all produce identical results with negligible performance differences. - Always wrap comparison operators in parentheses when combining conditions with
&or|operators to avoid operator precedence errors that will crash your application. - Null values are excluded by default in range filters, but explicit null handling with
isNotNull()makes your intent clear and prevents unexpected results in production datasets.
Introduction
Filtering rows within a specific range is one of the most common operations in data processing. Whether you’re analyzing sales data within a date range, identifying employees within a salary band, or filtering sensor readings within acceptable thresholds, range-based filtering is essential for practical data analysis.
PySpark provides multiple approaches to filter rows between two values, each with distinct syntax and use cases. Understanding these methods allows you to write cleaner, more maintainable code while avoiding common pitfalls that can lead to incorrect results or runtime errors.
This article demonstrates three production-ready approaches: the between() method for straightforward cases, comparison operators for complex conditions, and SQL syntax for teams comfortable with SQL. You’ll learn when to use each approach and how to handle edge cases like null values properly.
Setup and Sample Data
Let’s create a realistic dataset to demonstrate filtering operations. We’ll use employee data with salaries, hire dates, and ages—typical fields where range filtering applies.
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, to_date
from datetime import datetime, date
# Initialize SparkSession
spark = SparkSession.builder \
.appName("FilterBetweenValues") \
.getOrCreate()
# Create sample employee data
data = [
(1, "Alice", 75000, "2019-03-15", 32),
(2, "Bob", 45000, "2020-07-22", 28),
(3, "Charlie", 95000, "2018-01-10", 45),
(4, "Diana", 120000, "2017-11-05", 38),
(5, "Eve", 55000, "2021-02-18", 26),
(6, "Frank", None, "2022-05-30", 29),
(7, "Grace", 85000, None, 35),
]
columns = ["id", "name", "salary", "hire_date", "age"]
df = spark.createDataFrame(data, columns)
# Convert hire_date to proper date type
df = df.withColumn("hire_date", to_date(col("hire_date")))
df.show()
This creates a DataFrame with intentional null values to demonstrate proper null handling—a critical consideration in production environments.
Using the between() Method
The between() method is PySpark’s most readable and concise approach for range filtering. It takes two arguments: the lower and upper bounds (both inclusive).
# Filter employees with salaries between 50,000 and 100,000
salary_filtered = df.filter(col("salary").between(50000, 100000))
print("Employees with salaries between 50K and 100K:")
salary_filtered.show()
Output:
+---+-------+------+----------+---+
| id| name|salary| hire_date|age|
+---+-------+------+----------+---+
| 1| Alice| 75000|2019-03-15| 32|
| 3|Charlie| 95000|2018-01-10| 45|
| 5| Eve| 55000|2021-02-18| 26|
| 7| Grace| 85000| null| 35|
+---+-------+------+----------+---+
The between() method works equally well with dates. Here’s how to filter employees hired within a specific date range:
from datetime import date
# Filter employees hired between 2018 and 2020
start_date = date(2018, 1, 1)
end_date = date(2020, 12, 31)
date_filtered = df.filter(col("hire_date").between(start_date, end_date))
print("Employees hired between 2018 and 2020:")
date_filtered.show()
Output:
+---+-------+------+----------+---+
| id| name|salary| hire_date|age|
+---+-------+------+----------+---+
| 1| Alice| 75000|2019-03-15| 32|
| 3|Charlie| 95000|2018-01-10| 45|
+---+-------+------+----------+---+
The between() method is inclusive on both ends, meaning it includes rows where the column value equals either boundary value.
Using Comparison Operators
For more complex filtering logic or when you need asymmetric boundaries (exclusive on one end), use comparison operators with the filter() or where() methods. These methods are functionally identical—use whichever reads better to your team.
# Equivalent to between() using comparison operators
salary_filtered_ops = df.filter(
(col("salary") >= 50000) & (col("salary") <= 100000)
)
print("Using comparison operators:")
salary_filtered_ops.show()
Critical: Always wrap each condition in parentheses. Python’s operator precedence can cause unexpected behavior without them:
# WRONG - Will cause errors or unexpected results
# df.filter(col("salary") >= 50000 & col("salary") <= 100000)
# CORRECT - Parentheses ensure proper evaluation
df.filter((col("salary") >= 50000) & (col("salary") <= 100000))
Comparison operators shine when you need exclusive boundaries or complex multi-condition filters:
# Exclusive upper bound: salary >= 50000 AND salary < 100000
exclusive_upper = df.filter(
(col("salary") >= 50000) & (col("salary") < 100000)
)
# Multiple range conditions
complex_filter = df.filter(
((col("salary") >= 50000) & (col("salary") <= 100000)) &
((col("age") >= 30) & (col("age") <= 40))
)
print("Employees with salary 50K-100K and age 30-40:")
complex_filter.show()
Output:
+---+-----+------+----------+---+
| id| name|salary| hire_date|age|
+---+-----+------+----------+---+
| 1|Alice| 75000|2019-03-15| 32|
| 7|Grace| 85000| null| 35|
+---+-----+------+----------+---+
SQL-Style Filtering with spark.sql()
Teams with strong SQL backgrounds often prefer SQL syntax. PySpark allows you to register DataFrames as temporary views and query them with standard SQL.
# Register DataFrame as temporary view
df.createOrReplaceTempView("employees")
# Use SQL BETWEEN syntax
sql_filtered = spark.sql("""
SELECT *
FROM employees
WHERE salary BETWEEN 50000 AND 100000
""")
print("Using SQL BETWEEN:")
sql_filtered.show()
SQL syntax is particularly valuable when migrating existing SQL queries to PySpark or when working with analysts more comfortable with SQL than Python:
# More complex SQL query with multiple conditions
complex_sql = spark.sql("""
SELECT name, salary, hire_date
FROM employees
WHERE salary BETWEEN 50000 AND 100000
AND hire_date BETWEEN '2018-01-01' AND '2020-12-31'
AND age >= 30
ORDER BY salary DESC
""")
print("Complex SQL query results:")
complex_sql.show()
The SQL approach produces identical execution plans to the DataFrame API methods—choose based on team preference and code maintainability, not performance.
Performance Considerations and Best Practices
All three filtering methods compile to the same logical plan in PySpark’s Catalyst optimizer, resulting in identical performance. Choose based on readability and team familiarity.
Null Handling: Range filters automatically exclude null values, but explicit null handling makes your intent clear:
# Explicit null handling
salary_with_null_check = df.filter(
col("salary").isNotNull() &
(col("salary").between(50000, 100000))
)
print("With explicit null handling:")
salary_with_null_check.show()
This produces the same result as between() alone but documents that null exclusion is intentional, not an oversight.
Partition Pruning: When filtering on partitioned columns, PySpark can skip entire partitions, dramatically improving performance:
# If your data is partitioned by hire_date
# Only partitions within the date range are scanned
optimized_filter = df.filter(
col("hire_date").between(date(2019, 1, 1), date(2020, 12, 31))
)
Predicate Pushdown: When reading from Parquet, ORC, or databases, range filters are pushed down to the data source, reducing the amount of data transferred:
# Filter is pushed to Parquet reader - only matching rows are read
df_parquet = spark.read.parquet("employees.parquet") \
.filter(col("salary").between(50000, 100000))
Type Safety: Ensure your boundary values match the column type to avoid implicit conversions:
# Good: matching types
df.filter(col("salary").between(50000, 100000))
# Avoid: type mismatches that cause implicit conversions
# df.filter(col("salary").between("50000", "100000"))
Conclusion
PySpark offers three equivalent approaches for filtering rows between two values: the between() method for simplicity, comparison operators for flexibility, and SQL syntax for SQL-familiar teams. All three produce identical performance through Catalyst optimization.
Use between() for straightforward range filters—it’s the most readable option. Switch to comparison operators when you need exclusive boundaries or complex multi-condition logic. Choose SQL syntax when migrating existing queries or working with SQL-oriented teams.
Always handle nulls explicitly in production code, wrap comparison operators in parentheses to avoid precedence issues, and leverage partition pruning and predicate pushdown for optimal performance on large datasets. These practices will help you write robust, maintainable data processing pipelines.