PySpark - SQL BETWEEN Operator | Application Architect

Key Insights

The BETWEEN operator in PySpark provides inclusive range filtering for numeric, date, and string data types, with both SQL and DataFrame API syntax options that perform identically under the hood.
While col.between(lower, upper) offers cleaner syntax than manual comparison operators, they compile to the same execution plan—choose based on code readability rather than performance concerns.
BETWEEN excludes NULL values by default and requires explicit NULL handling, a common pitfall when migrating from traditional SQL databases where behavior may differ.

Introduction to BETWEEN Operator in PySpark

The BETWEEN operator filters data within a specified range, making it essential for analytics workflows involving date ranges, price brackets, or any bounded numeric criteria. In PySpark, you have multiple ways to express this logic: native SQL syntax within spark.sql(), the DataFrame API’s between() method, or manual comparison operators.

Here’s the fundamental translation from SQL to PySpark:

from pyspark.sql import SparkSession
from pyspark.sql.functions import col

spark = SparkSession.builder.appName("BetweenExample").getOrCreate()

# Sample data
data = [(1, 25), (2, 35), (3, 45), (4, 55)]
df = spark.createDataFrame(data, ["id", "age"])

# SQL approach
df.createOrReplaceTempView("people")
result_sql = spark.sql("SELECT * FROM people WHERE age BETWEEN 30 AND 50")

# DataFrame API approach
result_df = df.filter(col("age").between(30, 50))

# Manual comparison (equivalent)
result_manual = df.filter((col("age") >= 30) & (col("age") <= 50))

All three approaches return identical results: records with ages 35 and 45. The BETWEEN operator is inclusive on both boundaries, a critical detail that catches developers off guard when they expect exclusive ranges.

Basic Syntax and Usage

The between() method accepts two parameters: the lower and upper bounds. Both boundaries are included in the result set. This inclusive behavior differs from some programming constructs like Python’s range(), which excludes the upper bound.

For numeric ranges:

# Create sample employee data
employees = [
    (1, "Alice", 32, 75000),
    (2, "Bob", 45, 85000),
    (3, "Charlie", 28, 62000),
    (4, "Diana", 52, 95000),
    (5, "Eve", 38, 78000)
]
df_emp = spark.createDataFrame(employees, ["id", "name", "age", "salary"])

# Filter employees aged 30-40 (inclusive)
filtered_age = df_emp.filter(col("age").between(30, 40))
filtered_age.show()

# Output includes ages 32 and 38, excludes 28, 45, 52

For date ranges, you need to work with proper date types:

from pyspark.sql.functions import to_date, lit

# Sample order data
orders = [
    (1, "2024-01-15", 150.00),
    (2, "2024-02-20", 200.00),
    (3, "2024-03-10", 175.00),
    (4, "2024-04-05", 220.00)
]
df_orders = spark.createDataFrame(orders, ["order_id", "order_date", "amount"])
df_orders = df_orders.withColumn("order_date", to_date(col("order_date")))

# Filter orders in Q1 2024
q1_orders = df_orders.filter(
    col("order_date").between(lit("2024-01-01"), lit("2024-03-31"))
)
q1_orders.show()

# Returns orders from January, February, and March

The lit() function wraps literal values, ensuring type compatibility. For date comparisons, PySpark handles string-to-date conversion automatically when the column is already a date type.

BETWEEN with Different Data Types

The BETWEEN operator adapts to various data types with intuitive behavior:

Integer and Float Ranges:

products = [
    (1, "Widget A", 19.99),
    (2, "Widget B", 45.50),
    (3, "Widget C", 89.99),
    (4, "Widget D", 120.00)
]
df_products = spark.createDataFrame(products, ["product_id", "name", "price"])

# Mid-range products
mid_range = df_products.filter(col("price").between(40.0, 100.0))
mid_range.show()
# Returns Widget B and Widget C

Timestamp Ranges:

from pyspark.sql.functions import to_timestamp

logs = [
    (1, "2024-01-15 08:30:00"),
    (2, "2024-01-15 12:45:00"),
    (3, "2024-01-15 18:20:00"),
    (4, "2024-01-15 23:10:00")
]
df_logs = spark.createDataFrame(logs, ["log_id", "timestamp"])
df_logs = df_logs.withColumn("timestamp", to_timestamp(col("timestamp")))

# Business hours (9 AM to 6 PM)
business_hours = df_logs.filter(
    col("timestamp").between(
        lit("2024-01-15 09:00:00").cast("timestamp"),
        lit("2024-01-15 18:00:00").cast("timestamp")
    )
)
business_hours.show()

String Alphabetical Ranges:

names = [("Alice",), ("Bob",), ("Charlie",), ("Diana",), ("Frank",)]
df_names = spark.createDataFrame(names, ["name"])

# Names between B and D alphabetically
b_to_d = df_names.filter(col("name").between("B", "D"))
b_to_d.show()
# Returns Bob, Charlie, Diana (inclusive on both ends)

String comparisons use lexicographic ordering. “Diana” is included because it starts with “D”, even though subsequent characters come after “D” alphabetically.

Alternative Approaches and Comparisons

You have three syntactically different but functionally equivalent ways to express range filters:

sales_data = [(i, i * 100) for i in range(1, 11)]
df_sales = spark.createDataFrame(sales_data, ["id", "revenue"])

# Method 1: between() method (most readable)
method1 = df_sales.filter(col("revenue").between(300, 700))

# Method 2: SQL BETWEEN
df_sales.createOrReplaceTempView("sales")
method2 = spark.sql("SELECT * FROM sales WHERE revenue BETWEEN 300 AND 700")

# Method 3: Manual comparison operators
method3 = df_sales.filter((col("revenue") >= 300) & (col("revenue") <= 700))

# All return identical results
print("Method 1 count:", method1.count())  # 5
print("Method 2 count:", method2.count())  # 5
print("Method 3 count:", method3.count())  # 5

Performance-wise, these compile to identical execution plans. Use explain() to verify:

method1.explain()
method3.explain()
# Both show the same physical plan with equivalent filter predicates

Choose between() for readability. Use manual operators when you need asymmetric boundaries (e.g., > instead of >= on one side).

Combining BETWEEN with Other Conditions

Real-world queries rarely use BETWEEN in isolation. Combine it with other predicates using & (AND) and | (OR):

transactions = [
    (1, "2024-01-10", 150, "Electronics"),
    (2, "2024-01-15", 75, "Books"),
    (3, "2024-02-05", 200, "Electronics"),
    (4, "2024-02-20", 50, "Clothing"),
    (5, "2024-03-01", 300, "Electronics")
]
df_trans = spark.createDataFrame(
    transactions, 
    ["trans_id", "date", "amount", "category"]
)
df_trans = df_trans.withColumn("date", to_date(col("date")))

# Electronics purchases in January-February, amount $100-$250
filtered = df_trans.filter(
    (col("date").between(lit("2024-01-01"), lit("2024-02-28"))) &
    (col("amount").between(100, 250)) &
    (col("category") == "Electronics")
)
filtered.show()
# Returns transaction 1 and 3

NOT BETWEEN for exclusion filtering:

# Exclude mid-range amounts (keep very low and very high)
extreme_values = df_trans.filter(
    ~col("amount").between(100, 200)
)
extreme_values.show()
# Returns transactions with amounts 75, 50, and 300

The ~ operator negates the condition. In SQL syntax, you’d use NOT BETWEEN:

df_trans.createOrReplaceTempView("transactions")
spark.sql("""
    SELECT * FROM transactions 
    WHERE amount NOT BETWEEN 100 AND 200
""").show()

Common Pitfalls and Best Practices

NULL Handling:

BETWEEN excludes NULL values, which can silently filter out more data than intended:

data_with_nulls = [
    (1, 25), (2, None), (3, 35), (4, None), (5, 45)
]
df_nulls = spark.createDataFrame(data_with_nulls, ["id", "age"])

result = df_nulls.filter(col("age").between(20, 50))
result.show()
# Only returns rows 1, 3, 5 - NULLs are excluded

To include NULLs explicitly:

result_with_nulls = df_nulls.filter(
    col("age").between(20, 50) | col("age").isNull()
)
result_with_nulls.show()

Boundary Precision:

With floating-point numbers, be aware of precision issues:

floats = [(1, 10.0), (2, 10.0000001), (3, 20.0)]
df_floats = spark.createDataFrame(floats, ["id", "value"])

# Might not behave as expected with very close boundaries
result = df_floats.filter(col("value").between(10.0, 20.0))

For financial data, use decimal types instead of floats to avoid precision errors.

Performance Optimization:

When filtering large datasets, ensure columns used in BETWEEN clauses are partitioned or indexed appropriately:

# Partition by date for time-series data
df_large = spark.read.parquet("s3://bucket/large-dataset/")
df_large.write.partitionBy("date").parquet("s3://bucket/partitioned-dataset/")

# Queries on date ranges will be much faster
optimized_query = spark.read.parquet("s3://bucket/partitioned-dataset/") \
    .filter(col("date").between(lit("2024-01-01"), lit("2024-01-31")))

Practical Use Case: E-Commerce Analytics

Here’s a complete example analyzing sales data:

from pyspark.sql.functions import to_date, col, sum as _sum, count

# Simulate e-commerce transaction data
sales_data = [
    (1, "2024-01-05", 45.99, "Electronics", "California"),
    (2, "2024-01-10", 120.50, "Clothing", "New York"),
    (3, "2024-01-15", 89.99, "Electronics", "Texas"),
    (4, "2024-02-01", 200.00, "Furniture", "California"),
    (5, "2024-02-10", 15.99, "Books", "Florida"),
    (6, "2024-02-15", 175.00, "Electronics", "New York"),
    (7, "2024-03-01", 95.00, "Clothing", "Texas"),
    (8, "2024-03-10", 300.00, "Furniture", "California")
]

df_sales = spark.createDataFrame(
    sales_data,
    ["order_id", "order_date", "amount", "category", "state"]
)
df_sales = df_sales.withColumn("order_date", to_date(col("order_date")))

# Business question: Find Q1 electronics and furniture sales 
# with transaction amounts between $50 and $250
q1_analysis = df_sales.filter(
    (col("order_date").between(lit("2024-01-01"), lit("2024-03-31"))) &
    (col("amount").between(50.0, 250.0)) &
    (col("category").isin("Electronics", "Furniture"))
)

# Aggregate results
summary = q1_analysis.groupBy("category") \
    .agg(
        count("order_id").alias("transaction_count"),
        _sum("amount").alias("total_revenue")
    )

summary.show()
# Provides actionable insights for inventory and marketing decisions

This pattern applies to log analysis, IoT sensor data filtering, financial transaction monitoring, and any domain requiring bounded range queries. The BETWEEN operator simplifies these common analytical tasks while maintaining query readability and performance.