SQL GROUP BY and HAVING: Aggregation Queries

Key Insights

GROUP BY partitions rows into groups based on column values, allowing aggregate functions to operate on each group independently rather than the entire dataset
HAVING filters groups after aggregation while WHERE filters individual rows before grouping—understanding this distinction prevents common query errors and performance issues
Multi-column grouping creates unique groups from the combination of values, enabling hierarchical analysis like sales by category and region simultaneously

Understanding Aggregation and Why Grouping Matters

Aggregation functions—COUNT, SUM, AVG, MAX, and MIN—collapse multiple rows into summary values. Without GROUP BY, these functions operate on your entire result set, giving you a single answer. That’s useful for questions like “What’s our total revenue?” but falls short when you need segmented insights.

-- Simple aggregation: entire dataset
SELECT 
    COUNT(*) as total_orders,
    SUM(amount) as total_revenue,
    AVG(amount) as average_order_value
FROM orders;

This query returns one row with three numbers. But what if you need revenue broken down by product category, or average order value per customer? That’s where GROUP BY becomes essential—it transforms aggregation from a blunt instrument into a precision tool for analytical queries.

GROUP BY Fundamentals

The GROUP BY clause partitions your data into distinct groups based on one or more columns. Each aggregate function then calculates its result for each group separately. The syntax is straightforward: specify the columns to group by, and any aggregate functions will automatically apply to each group.

-- Group sales by product category
SELECT 
    category,
    COUNT(*) as num_products,
    SUM(revenue) as total_revenue,
    AVG(price) as avg_price
FROM products
GROUP BY category;

This query creates one result row per unique category value. If you have five categories, you’ll get five rows back. The COUNT, SUM, and AVG functions operate independently on each category’s subset of products.

Here’s a practical example counting users by country:

-- Count users by country
SELECT 
    country,
    COUNT(*) as user_count,
    COUNT(DISTINCT email_domain) as unique_domains
FROM users
GROUP BY country
ORDER BY user_count DESC;

The ORDER BY clause works with grouped results just like ungrouped queries. You can sort by the grouping column or any aggregate expression.

Another common pattern calculates metrics per customer:

-- Average order value by customer
SELECT 
    customer_id,
    COUNT(*) as order_count,
    SUM(amount) as lifetime_value,
    AVG(amount) as avg_order_value,
    MAX(amount) as largest_order
FROM orders
GROUP BY customer_id;

Each customer gets exactly one row in the results, regardless of how many orders they’ve placed. This is the core concept: GROUP BY collapses many rows into one per group.

Multi-Column Grouping

Grouping by multiple columns creates groups based on unique combinations of values. This enables hierarchical or cross-sectional analysis that single-column grouping can’t provide.

-- Sales grouped by category AND region
SELECT 
    category,
    region,
    COUNT(*) as num_sales,
    SUM(amount) as total_sales,
    AVG(amount) as avg_sale
FROM sales
GROUP BY category, region
ORDER BY category, region;

If you have 3 categories and 4 regions, you could have up to 12 result rows (3 × 4), though some combinations might not exist in your data. Each group represents a specific category-region pair.

This pattern is invaluable for time-series analysis:

-- Monthly revenue by product and store location
SELECT 
    product_id,
    store_location,
    DATE_TRUNC('month', sale_date) as month,
    COUNT(*) as transactions,
    SUM(quantity) as units_sold,
    SUM(amount) as revenue
FROM sales
WHERE sale_date >= '2024-01-01'
GROUP BY product_id, store_location, DATE_TRUNC('month', sale_date)
ORDER BY month, product_id, store_location;

This query produces one row for each unique combination of product, location, and month. You’re building a multi-dimensional view of your sales data, perfect for feeding into dashboards or further analysis.

Filtering Aggregated Results with HAVING

WHERE filters rows before grouping happens. But what if you need to filter based on aggregate values? That’s where HAVING comes in—it filters groups after aggregation is complete.

-- Find categories with total sales over $10,000
SELECT 
    category,
    SUM(amount) as total_sales,
    COUNT(*) as num_sales
FROM sales
GROUP BY category
HAVING SUM(amount) > 10000
ORDER BY total_sales DESC;

The HAVING clause evaluates after groups are formed and aggregates calculated. You can reference any aggregate function in HAVING, even ones not in your SELECT list.

Here’s a customer segmentation example:

-- Customers with more than 5 orders
SELECT 
    customer_id,
    COUNT(*) as order_count,
    SUM(amount) as total_spent
FROM orders
GROUP BY customer_id
HAVING COUNT(*) > 5
ORDER BY order_count DESC;

This identifies your frequent buyers. The HAVING clause eliminates any customer groups with 5 or fewer orders before returning results.

You can use multiple conditions in HAVING:

-- Products with high ratings and sufficient reviews
SELECT 
    product_id,
    AVG(rating) as avg_rating,
    COUNT(*) as review_count
FROM reviews
GROUP BY product_id
HAVING AVG(rating) > 4.0 AND COUNT(*) >= 10
ORDER BY avg_rating DESC;

This ensures you’re only seeing products with both high ratings and statistical significance (at least 10 reviews).

Combining WHERE and HAVING

Using WHERE and HAVING together is a powerful pattern: WHERE filters the input rows before grouping, and HAVING filters the resulting groups. This two-stage filtering is both more efficient and more expressive than either clause alone.

-- Sales by region for 2024 orders, showing only regions over $50,000
SELECT 
    region,
    COUNT(*) as order_count,
    SUM(amount) as total_sales,
    AVG(amount) as avg_order
FROM orders
WHERE order_date >= '2024-01-01' 
    AND order_date < '2025-01-01'
    AND status = 'completed'
GROUP BY region
HAVING SUM(amount) > 50000
ORDER BY total_sales DESC;

The WHERE clause first filters to completed orders from 2024. Then GROUP BY partitions those filtered rows by region. Finally, HAVING eliminates any regions that didn’t hit the $50,000 threshold. The order of operations matters: WHERE → GROUP BY → HAVING → ORDER BY.

This approach is more efficient than filtering everything in HAVING because WHERE reduces the dataset before the expensive grouping operation happens.

Common Patterns and Best Practices

Always order your grouped results intentionally. Databases don’t guarantee any particular order for grouped data:

-- Explicit ordering with aggregates
SELECT 
    category,
    COUNT(*) as product_count,
    AVG(price) as avg_price
FROM products
GROUP BY category
ORDER BY product_count DESC, avg_price DESC;

Handle NULLs explicitly in grouping columns. NULL values form their own group:

-- Handle NULL categories
SELECT 
    COALESCE(category, 'Uncategorized') as category,
    COUNT(*) as count
FROM products
GROUP BY category;

Critical pitfall: You cannot select columns that aren’t in your GROUP BY clause or wrapped in an aggregate function. This query is invalid:

-- ANTI-PATTERN: Don't do this
SELECT 
    category,
    product_name,  -- ERROR: not grouped or aggregated
    SUM(sales)
FROM products
GROUP BY category;

Which product_name would the database return when you have multiple products per category? The query is ambiguous. Some databases reject it outright; others return arbitrary values. Always aggregate or group every selected column.

For performance, index your GROUP BY columns, especially on large tables. Grouping requires sorting or hashing, and indexes dramatically speed up these operations.

Real-World Use Case: Sales Performance Dashboard

Let’s build a comprehensive sales report that identifies top-performing products by category with year-over-year comparison:

-- Top products by category with YoY growth
WITH current_year AS (
    SELECT 
        p.category,
        p.product_id,
        p.product_name,
        COUNT(*) as sales_count,
        SUM(s.amount) as revenue,
        AVG(s.amount) as avg_sale
    FROM sales s
    JOIN products p ON s.product_id = p.product_id
    WHERE s.sale_date >= '2024-01-01' 
        AND s.sale_date < '2025-01-01'
    GROUP BY p.category, p.product_id, p.product_name
    HAVING SUM(s.amount) > 5000
),
prior_year AS (
    SELECT 
        p.product_id,
        SUM(s.amount) as revenue
    FROM sales s
    JOIN products p ON s.product_id = p.product_id
    WHERE s.sale_date >= '2023-01-01' 
        AND s.sale_date < '2024-01-01'
    GROUP BY p.product_id
)
SELECT 
    cy.category,
    cy.product_name,
    cy.sales_count,
    cy.revenue as current_revenue,
    py.revenue as prior_revenue,
    ROUND(((cy.revenue - COALESCE(py.revenue, 0)) / 
           COALESCE(py.revenue, cy.revenue) * 100), 2) as yoy_growth_pct
FROM current_year cy
LEFT JOIN prior_year py ON cy.product_id = py.product_id
WHERE cy.revenue > COALESCE(py.revenue, 0)  -- Only growing products
ORDER BY cy.category, cy.revenue DESC;

This query demonstrates several advanced patterns: multi-column grouping, HAVING for threshold filtering, CTEs for organizing complex logic, and combining multiple aggregation levels. It answers the question: “Which products in each category are growing year-over-year and generating significant revenue?”

Mastering GROUP BY and HAVING transforms SQL from a simple data retrieval language into a powerful analytical tool. These clauses let you slice data into meaningful segments, calculate metrics for each segment, and filter results based on those metrics—the foundation of virtually all business intelligence and reporting queries.