SQL - GROUP BY vs HAVING vs WHERE | Application Architect

Key Insights

WHERE filters individual rows before grouping occurs; HAVING filters aggregated groups after grouping—using the wrong one causes errors or performance problems
SQL executes in a specific order (FROM → WHERE → GROUP BY → HAVING → SELECT → ORDER BY), and understanding this order eliminates most confusion about which clause to use
Always prefer WHERE over HAVING when possible; filtering rows early reduces the dataset your database must aggregate, dramatically improving query performance

The Confusion Problem

Every developer learning SQL hits the same wall: you need to filter data, but sometimes WHERE works and sometimes it throws an error. You try HAVING, and suddenly the query runs. Or worse, both seem to work, but one returns wrong results.

The confusion stems from a fundamental misunderstanding of SQL’s execution order. Most developers write queries in SELECT-FROM-WHERE order, but that’s not how databases execute them. Understanding the actual execution pipeline makes these three clauses click into place permanently.

Here’s the execution order that matters:

FROM → WHERE → GROUP BY → HAVING → SELECT → ORDER BY

WHERE runs before grouping. HAVING runs after. That’s the entire mental model you need.

WHERE: Filtering Individual Rows

WHERE operates on raw table data before any aggregation happens. It examines each row independently and decides whether to include it in the result set.

Consider an orders table:

CREATE TABLE orders (
    order_id INT PRIMARY KEY,
    customer_id INT,
    product_id INT,
    amount DECIMAL(10,2),
    order_date DATE,
    status VARCHAR(20)
);

To find all completed orders from 2024:

SELECT order_id, customer_id, amount, order_date
FROM orders
WHERE status = 'completed'
  AND order_date >= '2024-01-01';

The database scans the orders table, evaluates each row against the WHERE conditions, and discards rows that don’t match. The surviving rows move to the next stage of query processing.

WHERE can reference any column in the table. It cannot reference aggregate functions because aggregation hasn’t happened yet. This query fails:

-- This will error
SELECT customer_id, COUNT(*) as order_count
FROM orders
WHERE COUNT(*) > 5  -- ERROR: aggregate functions not allowed in WHERE
GROUP BY customer_id;

The error makes sense once you understand execution order. WHERE runs before GROUP BY, so there’s nothing to count yet.

GROUP BY: Aggregating Data into Groups

GROUP BY collapses multiple rows into single summary rows based on shared column values. It transforms row-level data into group-level data.

To calculate total spending per customer:

SELECT 
    customer_id,
    COUNT(*) as order_count,
    SUM(amount) as total_spent,
    AVG(amount) as avg_order_value
FROM orders
WHERE status = 'completed'
GROUP BY customer_id;

This query first filters to completed orders (WHERE), then groups the remaining rows by customer_id. Each group produces one output row containing the aggregate calculations.

A critical rule: every column in your SELECT must either appear in GROUP BY or be wrapped in an aggregate function. This query fails in strict SQL modes:

-- This fails: order_date isn't grouped or aggregated
SELECT customer_id, order_date, SUM(amount)
FROM orders
GROUP BY customer_id;

The database doesn’t know which order_date to display when multiple rows collapse into one group. You must either group by it or aggregate it:

-- Option 1: Group by both columns
SELECT customer_id, order_date, SUM(amount)
FROM orders
GROUP BY customer_id, order_date;

-- Option 2: Aggregate the date
SELECT customer_id, MAX(order_date) as last_order, SUM(amount)
FROM orders
GROUP BY customer_id;

HAVING: Filtering Groups

HAVING filters the results of GROUP BY. It operates on aggregated values, not individual rows.

To find customers with more than five orders:

SELECT 
    customer_id,
    COUNT(*) as order_count,
    SUM(amount) as total_spent
FROM orders
WHERE status = 'completed'
GROUP BY customer_id
HAVING COUNT(*) > 5;

The execution flow:

FROM: Access the orders table
WHERE: Keep only completed orders
GROUP BY: Collapse rows by customer_id, calculate aggregates
HAVING: Keep only groups where COUNT(*) > 5
SELECT: Return the specified columns

HAVING can reference aggregate functions because grouping has already occurred. It can also reference columns in the GROUP BY clause:

SELECT 
    customer_id,
    COUNT(*) as order_count
FROM orders
GROUP BY customer_id
HAVING customer_id > 100 AND COUNT(*) > 5;

However, filtering customer_id in HAVING is wasteful. That filter could run earlier in WHERE.

WHERE vs HAVING: The Performance Difference

Both of these queries return the same results:

-- Query A: Filter in WHERE (correct)
SELECT 
    customer_id,
    COUNT(*) as order_count
FROM orders
WHERE customer_id > 100
GROUP BY customer_id
HAVING COUNT(*) > 5;

-- Query B: Filter in HAVING (wasteful)
SELECT 
    customer_id,
    COUNT(*) as order_count
FROM orders
GROUP BY customer_id
HAVING customer_id > 100 AND COUNT(*) > 5;

Query A filters rows before grouping. If 60% of orders have customer_id <= 100, the GROUP BY processes 40% of the data.

Query B groups all rows first, then discards groups. The database does unnecessary work aggregating data it will throw away.

On a table with 10 million orders, this difference can mean seconds versus minutes. Always push filters as early in the pipeline as possible.

Here’s a clear rule: if you can express a condition in WHERE, do it. Use HAVING only for conditions that require aggregated values.

-- Must use HAVING: condition depends on COUNT()
HAVING COUNT(*) > 5

-- Must use HAVING: condition depends on SUM()
HAVING SUM(amount) > 1000

-- Should use WHERE: condition on raw column value
WHERE status = 'completed'

-- Should use WHERE: condition on raw column value  
WHERE order_date >= '2024-01-01'

Putting It All Together

Real queries often combine all three clauses. Here’s a business question: “Find product categories with more than $10,000 in sales from completed orders in 2024, showing only categories that have at least 50 orders.”

SELECT 
    p.category,
    COUNT(DISTINCT o.order_id) as order_count,
    COUNT(DISTINCT o.customer_id) as unique_customers,
    SUM(o.amount) as total_revenue,
    AVG(o.amount) as avg_order_value
FROM orders o
JOIN products p ON o.product_id = p.product_id
WHERE o.status = 'completed'
  AND o.order_date >= '2024-01-01'
  AND o.order_date < '2025-01-01'
GROUP BY p.category
HAVING COUNT(DISTINCT o.order_id) >= 50
   AND SUM(o.amount) > 10000
ORDER BY total_revenue DESC;

Breaking down the execution:

FROM/JOIN: Combine orders and products tables
WHERE: Keep only completed orders from 2024 (row-level filters)
GROUP BY: Collapse into category groups, calculate aggregates
HAVING: Keep groups with 50+ orders and $10K+ revenue (group-level filters)
SELECT: Return the specified columns with aliases
ORDER BY: Sort by total revenue descending

Notice how WHERE handles conditions on raw column values (status, order_date) while HAVING handles conditions on aggregates (COUNT, SUM).

Common Mistakes and How to Fix Them

Mistake 1: Using HAVING without GROUP BY

-- This is technically valid but pointless
SELECT COUNT(*) as total_orders
FROM orders
HAVING COUNT(*) > 100;

Without GROUP BY, the entire result set is one implicit group. This works but confuses readers. If you’re not grouping, you probably meant WHERE with a subquery:

-- Clearer intent
SELECT COUNT(*) as total_orders
FROM orders
WHERE (SELECT COUNT(*) FROM orders) > 100;

Mistake 2: Filtering aggregates in WHERE

-- Fails: can't use aggregate in WHERE
SELECT customer_id, SUM(amount)
FROM orders
WHERE SUM(amount) > 500
GROUP BY customer_id;

Fix by moving to HAVING:

SELECT customer_id, SUM(amount)
FROM orders
GROUP BY customer_id
HAVING SUM(amount) > 500;

Mistake 3: Inefficient filtering in HAVING

-- Inefficient: filters after grouping
SELECT customer_id, SUM(amount)
FROM orders
GROUP BY customer_id
HAVING customer_id IN (1, 2, 3, 4, 5);

Refactor to filter early:

-- Efficient: filters before grouping
SELECT customer_id, SUM(amount)
FROM orders
WHERE customer_id IN (1, 2, 3, 4, 5)
GROUP BY customer_id;

The Decision Framework

When writing a filter condition, ask one question: does this condition require an aggregated value?

If yes, use HAVING. If no, use WHERE.

That’s it. The execution order handles the rest. WHERE filters rows before grouping because it runs first. HAVING filters groups after aggregation because it runs later. GROUP BY sits in the middle, transforming your data from rows to groups.

Stop memorizing syntax. Understand the pipeline, and the correct clause becomes obvious every time.