PySpark - SQL HAVING Clause | Application Architect

Key Insights

The HAVING clause filters aggregated data after GROUP BY operations, while WHERE filters raw rows before aggregation—using them together gives you precise control over data at both levels
PySpark supports HAVING through both SQL strings and DataFrame API methods, but SQL syntax often provides clearer intent for complex aggregation filters
Performance-wise, push row-level filters into WHERE clauses and reserve HAVING for aggregate conditions—this reduces the data volume before expensive grouping operations

Introduction to HAVING Clause in PySpark

The HAVING clause is SQL’s mechanism for filtering grouped data based on aggregate conditions. While WHERE filters individual rows before aggregation, HAVING operates on the results after GROUP BY has created aggregated groups. This distinction is critical: you cannot use aggregate functions like SUM() or COUNT() in a WHERE clause, but you can—and should—use them in HAVING.

In PySpark, you’ll encounter scenarios where you need to filter based on computed aggregates: finding customers who made more than 10 purchases, identifying products with average ratings above 4.5, or locating departments where total salary exceeds budget thresholds. The HAVING clause handles these requirements elegantly.

PySpark provides two approaches for using HAVING: executing SQL strings via spark.sql() or using the DataFrame API with .filter() or .where() after aggregation. Both methods achieve the same results, but SQL syntax often reads more naturally for complex aggregate filtering.

Basic Syntax and Setup

Let’s establish a working environment with realistic sample data. We’ll use sales transaction data—a common scenario where HAVING clauses prove invaluable.

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, count, sum, avg, max, min

# Initialize SparkSession
spark = SparkSession.builder \
    .appName("HAVING Clause Examples") \
    .getOrCreate()

# Create sample sales data
sales_data = [
    ("Electronics", "Laptop", 1200, "2024-01-15"),
    ("Electronics", "Mouse", 25, "2024-01-16"),
    ("Electronics", "Keyboard", 75, "2024-01-17"),
    ("Electronics", "Monitor", 300, "2024-01-18"),
    ("Clothing", "Shirt", 40, "2024-01-15"),
    ("Clothing", "Pants", 60, "2024-01-16"),
    ("Clothing", "Shoes", 80, "2024-01-17"),
    ("Books", "Novel", 15, "2024-01-15"),
    ("Books", "Textbook", 95, "2024-01-16"),
    ("Furniture", "Chair", 150, "2024-01-15"),
    ("Furniture", "Desk", 400, "2024-01-16"),
    ("Furniture", "Lamp", 45, "2024-01-17"),
    ("Furniture", "Shelf", 120, "2024-01-18"),
]

columns = ["category", "product", "price", "sale_date"]
df = spark.createDataFrame(sales_data, columns)

# Register as temporary view for SQL queries
df.createOrReplaceTempView("sales")

The basic HAVING syntax in SQL follows this pattern:

SELECT column, AGGREGATE_FUNCTION(column)
FROM table
GROUP BY column
HAVING AGGREGATE_FUNCTION(column) condition

Simple HAVING with Aggregations

Let’s explore fundamental HAVING operations with common aggregate functions. We’ll start by finding categories with more than three products.

# Using SQL syntax
result = spark.sql("""
    SELECT category, COUNT(*) as product_count
    FROM sales
    GROUP BY category
    HAVING COUNT(*) > 3
""")
result.show()

# Output:
# +----------+-------------+
# |  category|product_count|
# +----------+-------------+
# |Furniture |            4|
# |Electronics|           4|
# +----------+-------------+

The DataFrame API equivalent:

# Using DataFrame API
df.groupBy("category") \
    .agg(count("*").alias("product_count")) \
    .filter(col("product_count") > 3) \
    .show()

Now let’s filter categories where total sales exceed $500:

# Find categories with total sales > $500
spark.sql("""
    SELECT category, SUM(price) as total_sales
    FROM sales
    GROUP BY category
    HAVING SUM(price) > 500
""").show()

# Output:
# +----------+-----------+
# |  category|total_sales|
# +----------+-----------+
# |Electronics|       1600|
# |Furniture  |        715|
# +----------+-----------+

Using AVG to find categories with average product price above $100:

spark.sql("""
    SELECT category, 
           AVG(price) as avg_price,
           COUNT(*) as product_count
    FROM sales
    GROUP BY category
    HAVING AVG(price) > 100
""").show()

# Output:
# +----------+---------+-------------+
# |  category|avg_price|product_count|
# +----------+---------+-------------+
# |Electronics|    400.0|            4|
# +----------+---------+-------------+

HAVING with Multiple Conditions

Real-world analytics often require filtering on multiple aggregate conditions simultaneously. Combine conditions using AND/OR operators:

# Categories with more than 2 products AND average price > $50
spark.sql("""
    SELECT category,
           COUNT(*) as product_count,
           AVG(price) as avg_price,
           SUM(price) as total_sales
    FROM sales
    GROUP BY category
    HAVING COUNT(*) > 2 AND AVG(price) > 50
""").show()

# Output:
# +----------+-------------+---------+-----------+
# |  category|product_count|avg_price|total_sales|
# +----------+-------------+---------+-----------+
# |Furniture |            4|  178.75 |        715|
# |Electronics|           4|    400.0|       1600|
# +----------+-------------+---------+-----------+

Using OR conditions to capture different criteria:

# Categories with either high volume (>3 products) OR high value (total > $600)
spark.sql("""
    SELECT category,
           COUNT(*) as product_count,
           SUM(price) as total_sales
    FROM sales
    GROUP BY category
    HAVING COUNT(*) > 3 OR SUM(price) > 600
""").show()

HAVING vs WHERE: Practical Comparison

Understanding when to use WHERE versus HAVING is crucial for writing efficient queries. WHERE filters rows before grouping; HAVING filters groups after aggregation.

# WHERE only - filters individual sales before grouping
spark.sql("""
    SELECT category, COUNT(*) as product_count
    FROM sales
    WHERE price > 50
    GROUP BY category
""").show()

# HAVING only - filters categories after counting all products
spark.sql("""
    SELECT category, COUNT(*) as product_count
    FROM sales
    GROUP BY category
    HAVING COUNT(*) > 2
""").show()

# WHERE + HAVING combined - filters rows first, then groups
spark.sql("""
    SELECT category, 
           COUNT(*) as expensive_product_count,
           AVG(price) as avg_price
    FROM sales
    WHERE price > 50
    GROUP BY category
    HAVING COUNT(*) > 2 AND AVG(price) > 100
""").show()

# Output:
# +----------+-----------------------+---------+
# |  category|expensive_product_count|avg_price|
# +----------+-----------------------+---------+
# |Electronics|                      3|    525.0|
# |Furniture  |                      3|  223.33 |
# +----------+-----------------------+---------+

The combined approach is most powerful: use WHERE to eliminate irrelevant rows early (reducing data volume), then use HAVING to filter the aggregated results.

Advanced HAVING Scenarios

HAVING clauses can handle sophisticated analytical requirements. Here’s how to use HAVING with CASE statements for conditional aggregation:

# Count products by price tier within each category
spark.sql("""
    SELECT category,
           SUM(CASE WHEN price > 100 THEN 1 ELSE 0 END) as premium_count,
           SUM(CASE WHEN price <= 100 THEN 1 ELSE 0 END) as standard_count
    FROM sales
    GROUP BY category
    HAVING SUM(CASE WHEN price > 100 THEN 1 ELSE 0 END) > 1
""").show()

Using HAVING with JOIN operations to filter aggregated data from multiple tables:

# Create a second dataset for demonstration
inventory_data = [
    ("Electronics", 50),
    ("Clothing", 200),
    ("Books", 150),
    ("Furniture", 75),
]

inventory_df = spark.createDataFrame(inventory_data, ["category", "stock_level"])
inventory_df.createOrReplaceTempView("inventory")

# Join and filter categories with high sales but low inventory
spark.sql("""
    SELECT s.category,
           COUNT(s.product) as products_sold,
           i.stock_level,
           SUM(s.price) as total_revenue
    FROM sales s
    JOIN inventory i ON s.category = i.category
    GROUP BY s.category, i.stock_level
    HAVING COUNT(s.product) > 3 AND i.stock_level < 100
""").show()

Performance comparison between DataFrame API and SQL:

from time import time

# SQL approach
start = time()
spark.sql("""
    SELECT category, COUNT(*) as cnt, AVG(price) as avg_p
    FROM sales
    GROUP BY category
    HAVING COUNT(*) > 2
""").collect()
sql_time = time() - start

# DataFrame API approach
start = time()
df.groupBy("category") \
    .agg(count("*").alias("cnt"), avg("price").alias("avg_p")) \
    .filter(col("cnt") > 2) \
    .collect()
df_time = time() - start

print(f"SQL time: {sql_time:.4f}s, DataFrame time: {df_time:.4f}s")

For small datasets, performance differences are negligible. For large-scale processing, both compile to the same execution plan, so choose based on readability and team preferences.

Common Pitfalls and Best Practices

Pitfall 1: Referencing non-aggregated columns in HAVING

# WRONG - category isn't aggregated or in GROUP BY at HAVING level
# This will cause an error
try:
    spark.sql("""
        SELECT category, COUNT(*) as cnt
        FROM sales
        GROUP BY category
        HAVING price > 100
    """).show()
except Exception as e:
    print(f"Error: {e}")

# CORRECT - use aggregate function or move to WHERE
spark.sql("""
    SELECT category, COUNT(*) as cnt, AVG(price) as avg_price
    FROM sales
    GROUP BY category
    HAVING AVG(price) > 100
""").show()

Pitfall 2: Using HAVING when WHERE would be more efficient

# INEFFICIENT - filters after grouping
spark.sql("""
    SELECT category, COUNT(*) as cnt
    FROM sales
    GROUP BY category
    HAVING category = 'Electronics'
""").show()

# EFFICIENT - filters before grouping
spark.sql("""
    SELECT category, COUNT(*) as cnt
    FROM sales
    WHERE category = 'Electronics'
    GROUP BY category
""").show()

Best Practices:

Filter early: Use WHERE for row-level conditions to reduce data volume before aggregation
Reserve HAVING for aggregates: Only use HAVING when filtering based on aggregate function results
Alias aggregate expressions: Use meaningful aliases for readability and reusability
Consider caching: If running multiple HAVING queries on the same grouped data, cache the intermediate result
Test with EXPLAIN: Use df.explain() to verify query plans and ensure optimal execution

The HAVING clause is indispensable for aggregate-based filtering in PySpark. Master the distinction between WHERE and HAVING, combine them strategically, and you’ll write cleaner, more efficient analytical queries. Whether you prefer SQL strings or DataFrame API methods, understanding HAVING’s role in the query execution pipeline will make you a more effective PySpark developer.