SQL - Subquery (Nested Query) Tutorial
A subquery is a query nested inside another SQL statement. It's a query within a query, enclosed in parentheses, that the database evaluates to produce a result used by the outer query. Think of it...
Key Insights
- Subqueries are powerful tools for breaking complex problems into logical steps, but they’re not always the best choice—JOINs and CTEs often perform better and read more clearly
- Correlated subqueries execute once per row in the outer query, making them potential performance killers on large datasets; always test with realistic data volumes
- The placement of a subquery (WHERE, FROM, or SELECT clause) fundamentally changes its behavior and performance characteristics—choose deliberately
Introduction to Subqueries
A subquery is a query nested inside another SQL statement. It’s a query within a query, enclosed in parentheses, that the database evaluates to produce a result used by the outer query. Think of it as breaking a complex question into smaller, answerable pieces.
The classic debate: subqueries versus JOINs. Both can often solve the same problem, but they’re not interchangeable in every situation. Subqueries excel when you need to filter based on aggregate values, when the logic is naturally hierarchical, or when you want self-documenting code that reads like the business requirement. JOINs typically perform better for combining data from multiple tables and are the standard approach when you need columns from both tables in your result.
Here’s the same problem solved both ways:
-- Subquery approach: Find all products with above-average price
SELECT product_name, price
FROM products
WHERE price > (SELECT AVG(price) FROM products);
-- JOIN approach (requires a cross join with derived table)
SELECT p.product_name, p.price
FROM products p
CROSS JOIN (SELECT AVG(price) AS avg_price FROM products) avg_data
WHERE p.price > avg_data.avg_price;
The subquery version is cleaner here. The intent is obvious: find products where the price exceeds the average. The JOIN version works but adds cognitive overhead. This is a case where subqueries win on readability.
Subqueries in the WHERE Clause
WHERE clause subqueries are the most common type. They filter rows based on values computed by the inner query.
Single-value subqueries return exactly one value and work with standard comparison operators:
-- Find employees earning more than the company average
SELECT employee_name, salary, department_id
FROM employees
WHERE salary > (SELECT AVG(salary) FROM employees);
-- Find the most recent order
SELECT order_id, customer_id, order_date
FROM orders
WHERE order_date = (SELECT MAX(order_date) FROM orders);
If your subquery returns multiple values when the operator expects one, you’ll get an error. This is a common bug when data assumptions change.
Multi-value subqueries return a set of values and require IN, ANY, or ALL operators:
-- Find products in categories that have more than 100 items
SELECT product_id, product_name, category_id
FROM products
WHERE category_id IN (
SELECT category_id
FROM products
GROUP BY category_id
HAVING COUNT(*) > 100
);
-- Find employees earning more than ANY manager (at least one)
SELECT employee_name, salary
FROM employees
WHERE salary > ANY (
SELECT salary
FROM employees
WHERE job_title = 'Manager'
);
-- Find employees earning more than ALL managers (every single one)
SELECT employee_name, salary
FROM employees
WHERE salary > ALL (
SELECT salary
FROM employees
WHERE job_title = 'Manager'
);
The difference between ANY and ALL trips people up. ANY means “at least one value in the set.” ALL means “every value in the set.” Choose carefully.
Subqueries in the FROM Clause (Derived Tables)
When you place a subquery in the FROM clause, you create a derived table—a temporary result set that exists only for the duration of the query. This is useful when you need to aggregate data first, then filter or join against those aggregates.
-- Find customers who have placed more than 5 orders
SELECT
c.customer_name,
order_stats.order_count,
order_stats.total_spent
FROM customers c
INNER JOIN (
SELECT
customer_id,
COUNT(*) AS order_count,
SUM(total_amount) AS total_spent
FROM orders
GROUP BY customer_id
HAVING COUNT(*) > 5
) order_stats ON c.customer_id = order_stats.customer_id
ORDER BY order_stats.total_spent DESC;
Critical rule: derived tables must have an alias. The query above uses order_stats. Without it, most databases will throw a syntax error.
Derived tables are particularly useful for “top N per group” problems:
-- Find the top 3 products by revenue in each category
SELECT category_name, product_name, revenue
FROM (
SELECT
c.category_name,
p.product_name,
SUM(oi.quantity * oi.unit_price) AS revenue,
ROW_NUMBER() OVER (PARTITION BY c.category_id ORDER BY SUM(oi.quantity * oi.unit_price) DESC) AS rank
FROM categories c
INNER JOIN products p ON c.category_id = p.category_id
INNER JOIN order_items oi ON p.product_id = oi.product_id
GROUP BY c.category_id, c.category_name, p.product_id, p.product_name
) ranked_products
WHERE rank <= 3;
Subqueries in the SELECT Clause (Scalar Subqueries)
Scalar subqueries appear in the SELECT list and must return exactly one value. They compute a value for each row in the result:
-- Display each order with the customer's total order count
SELECT
o.order_id,
o.order_date,
o.total_amount,
(SELECT COUNT(*)
FROM orders o2
WHERE o2.customer_id = o.customer_id) AS customer_total_orders
FROM orders o
WHERE o.order_date >= '2024-01-01';
This looks elegant but hides a performance trap. The subquery executes once per row in the outer query. With 10,000 orders, that’s 10,000 subquery executions. A JOIN typically performs better:
-- Better approach using a JOIN
SELECT
o.order_id,
o.order_date,
o.total_amount,
customer_counts.total_orders AS customer_total_orders
FROM orders o
INNER JOIN (
SELECT customer_id, COUNT(*) AS total_orders
FROM orders
GROUP BY customer_id
) customer_counts ON o.customer_id = customer_counts.customer_id
WHERE o.order_date >= '2024-01-01';
Use scalar subqueries sparingly. They’re acceptable for small result sets or when the subquery can be efficiently cached, but they rarely scale well.
Correlated Subqueries
Correlated subqueries reference columns from the outer query, creating a dependency that forces the subquery to execute once per outer row. They’re powerful but expensive.
-- Find employees who earn more than their department's average
SELECT e.employee_name, e.salary, e.department_id
FROM employees e
WHERE e.salary > (
SELECT AVG(e2.salary)
FROM employees e2
WHERE e2.department_id = e.department_id -- Correlation here
);
The subquery references e.department_id from the outer query. For each employee, the database calculates that specific department’s average.
EXISTS and NOT EXISTS are the most common use cases for correlated subqueries:
-- Find customers who have never placed an order
SELECT c.customer_id, c.customer_name
FROM customers c
WHERE NOT EXISTS (
SELECT 1
FROM orders o
WHERE o.customer_id = c.customer_id
);
-- Find the latest order for each customer
SELECT o.order_id, o.customer_id, o.order_date
FROM orders o
WHERE o.order_date = (
SELECT MAX(o2.order_date)
FROM orders o2
WHERE o2.customer_id = o.customer_id
);
EXISTS checks whether the subquery returns any rows—it doesn’t care about the actual values. Using SELECT 1 is a convention that signals this intent.
Common Mistakes and Performance Tips
Mistake 1: Using subqueries when JOINs are clearer and faster
-- Slow: Subquery in SELECT executed per row
SELECT
p.product_name,
(SELECT category_name FROM categories WHERE category_id = p.category_id) AS category
FROM products p;
-- Better: Simple JOIN
SELECT p.product_name, c.category_name AS category
FROM products p
INNER JOIN categories c ON p.category_id = c.category_id;
Mistake 2: Not considering CTEs for complex queries
Common Table Expressions (CTEs) often provide the same functionality as subqueries with better readability:
-- Nested subqueries become hard to read
SELECT *
FROM (
SELECT customer_id, total_spent
FROM (
SELECT customer_id, SUM(total_amount) AS total_spent
FROM orders
GROUP BY customer_id
) customer_totals
WHERE total_spent > 1000
) high_value_customers;
-- CTE version is cleaner
WITH customer_totals AS (
SELECT customer_id, SUM(total_amount) AS total_spent
FROM orders
GROUP BY customer_id
),
high_value_customers AS (
SELECT customer_id, total_spent
FROM customer_totals
WHERE total_spent > 1000
)
SELECT * FROM high_value_customers;
Indexing strategies: Subqueries benefit from the same indexes as regular queries. Ensure columns used in JOIN conditions, WHERE clauses, and GROUP BY expressions within subqueries are indexed. For correlated subqueries, the correlation column in the inner query is critical—without an index, performance degrades rapidly.
Practical Use Cases
Here’s a real-world scenario combining multiple subquery types—a monthly sales report showing each salesperson’s performance relative to team and company averages:
WITH monthly_sales AS (
SELECT
s.salesperson_id,
s.team_id,
DATE_TRUNC('month', o.order_date) AS sale_month,
SUM(o.total_amount) AS monthly_total
FROM salespeople s
INNER JOIN orders o ON s.salesperson_id = o.salesperson_id
WHERE o.order_date >= '2024-01-01'
GROUP BY s.salesperson_id, s.team_id, DATE_TRUNC('month', o.order_date)
)
SELECT
sp.salesperson_name,
ms.sale_month,
ms.monthly_total,
(SELECT AVG(ms2.monthly_total)
FROM monthly_sales ms2
WHERE ms2.team_id = ms.team_id
AND ms2.sale_month = ms.sale_month) AS team_average,
(SELECT AVG(ms3.monthly_total)
FROM monthly_sales ms3
WHERE ms3.sale_month = ms.sale_month) AS company_average
FROM monthly_sales ms
INNER JOIN salespeople sp ON ms.salesperson_id = sp.salesperson_id
WHERE ms.monthly_total > (
SELECT AVG(monthly_total) * 0.5
FROM monthly_sales
)
ORDER BY ms.sale_month, ms.monthly_total DESC;
This query uses a CTE for the base aggregation, scalar subqueries for comparative metrics, and a WHERE subquery for filtering. In production, you’d likely refactor the scalar subqueries into JOINs for performance, but this demonstrates how the different types work together.
Subqueries are a fundamental SQL skill. Master them, but don’t overuse them. The best query is often the simplest one that meets your performance requirements.