How to Use Correlated Subqueries in MySQL
A correlated subquery is a subquery that references columns from the outer query. Unlike regular (non-correlated) subqueries that execute once and return a result set, correlated subqueries execute...
Key Insights
- Correlated subqueries reference columns from the outer query and execute once per row, making them powerful for row-by-row comparisons but potentially expensive for large datasets
- Use EXISTS instead of IN for correlated subqueries when checking for record existence—it’s more efficient because it stops at the first match
- In MySQL 8.0+, consider LATERAL joins or window functions as modern alternatives that often outperform traditional correlated subqueries while maintaining readability
What Are Correlated Subqueries?
A correlated subquery is a subquery that references columns from the outer query. Unlike regular (non-correlated) subqueries that execute once and return a result set, correlated subqueries execute repeatedly—once for each row processed by the outer query.
Here’s the fundamental difference:
-- Non-correlated subquery (executes once)
SELECT employee_name, salary
FROM employees
WHERE salary > (SELECT AVG(salary) FROM employees);
-- Correlated subquery (executes for each row)
SELECT e1.employee_name, e1.salary, e1.department_id
FROM employees e1
WHERE e1.salary > (
SELECT AVG(e2.salary)
FROM employees e2
WHERE e2.department_id = e1.department_id -- References outer query
);
The first query finds employees earning above the company-wide average. The second finds employees earning above their specific department’s average—notice how the inner query references e1.department_id from the outer query. That reference creates the correlation.
The performance implications are significant. If your employees table has 10,000 rows, the non-correlated subquery executes once. The correlated version potentially executes 10,000 times. This doesn’t automatically make correlated subqueries bad—sometimes they’re exactly what you need—but you must understand the cost.
Basic Syntax and Execution Flow
The structure of a correlated subquery follows this pattern:
SELECT outer_columns
FROM outer_table outer_alias
WHERE condition_with_subquery(
SELECT inner_columns
FROM inner_table inner_alias
WHERE inner_alias.column = outer_alias.column -- The correlation
);
MySQL’s execution flow works like this:
- Fetch a row from the outer query
- Pass correlated values to the inner query
- Execute the inner query with those values
- Use the inner query result to evaluate the outer query’s WHERE condition
- Repeat for the next row
Let’s see a practical example with employees and departments:
-- Find employees earning more than their department average
SELECT
e.employee_name,
e.department_id,
e.salary,
(SELECT AVG(e2.salary)
FROM employees e2
WHERE e2.department_id = e.department_id) AS dept_avg_salary
FROM employees e
WHERE e.salary > (
SELECT AVG(e3.salary)
FROM employees e3
WHERE e3.department_id = e.department_id
);
This query uses correlated subqueries in both SELECT and WHERE clauses. For each employee, MySQL calculates their department’s average salary twice—once to display it, once to filter by it. We’ll optimize this later.
Common Use Cases
Correlated subqueries excel at several specific tasks.
Finding Records with Relative Conditions
When you need to compare each record against a subset of data related to that specific record:
-- Find customers whose lifetime order value exceeds
-- the average for their country
SELECT
c.customer_id,
c.customer_name,
c.country,
(SELECT SUM(o.order_total)
FROM orders o
WHERE o.customer_id = c.customer_id) AS lifetime_value
FROM customers c
WHERE (
SELECT SUM(o.order_total)
FROM orders o
WHERE o.customer_id = c.customer_id
) > (
SELECT AVG(country_avg)
FROM (
SELECT SUM(o2.order_total) AS country_avg
FROM orders o2
JOIN customers c2 ON o2.customer_id = c2.customer_id
WHERE c2.country = c.country
GROUP BY o2.customer_id
) AS country_averages
);
Existence Checks with EXISTS
The EXISTS operator is particularly efficient with correlated subqueries because it stops processing as soon as it finds a match:
-- Find customers who have never placed an order
SELECT c.customer_id, c.customer_name
FROM customers c
WHERE NOT EXISTS (
SELECT 1
FROM orders o
WHERE o.customer_id = c.customer_id
);
-- Find duplicate email addresses
SELECT e1.employee_id, e1.email
FROM employees e1
WHERE EXISTS (
SELECT 1
FROM employees e2
WHERE e2.email = e1.email
AND e2.employee_id != e1.employee_id
);
The SELECT 1 is a common pattern with EXISTS—since EXISTS only cares whether rows exist, not what they contain, selecting a constant is more efficient than selecting actual columns.
Correlated Subqueries in SELECT and HAVING Clauses
You can use correlated subqueries to add calculated columns that depend on the current row:
-- Add each product's sales rank within its category
SELECT
p.product_id,
p.product_name,
p.category_id,
p.units_sold,
(SELECT COUNT(*) + 1
FROM products p2
WHERE p2.category_id = p.category_id
AND p2.units_sold > p.units_sold) AS category_rank
FROM products p
ORDER BY p.category_id, category_rank;
This calculates each product’s rank by counting how many products in the same category sold more units. For each product, the subquery counts products with higher sales in the same category, then adds 1.
In HAVING clauses, correlated subqueries filter grouped results:
-- Find departments where the highest salary exceeds
-- 3x the company-wide average
SELECT
department_id,
MAX(salary) AS max_dept_salary
FROM employees
GROUP BY department_id
HAVING MAX(salary) > (
SELECT AVG(salary) * 3
FROM employees e2
WHERE e2.department_id = employees.department_id
);
Performance Optimization
Correlated subqueries can be slow. Here’s how to optimize them.
Rewrite as JOINs When Possible
Many correlated subqueries can be rewritten as JOINs with better performance:
-- Correlated subquery approach
SELECT e.employee_name, e.salary, e.department_id
FROM employees e
WHERE e.salary > (
SELECT AVG(e2.salary)
FROM employees e2
WHERE e2.department_id = e.department_id
);
-- JOIN approach (faster)
SELECT e.employee_name, e.salary, e.department_id
FROM employees e
JOIN (
SELECT department_id, AVG(salary) AS avg_salary
FROM employees
GROUP BY department_id
) dept_avg ON e.department_id = dept_avg.department_id
WHERE e.salary > dept_avg.avg_salary;
Check the execution plan with EXPLAIN:
EXPLAIN SELECT e.employee_name, e.salary
FROM employees e
WHERE e.salary > (
SELECT AVG(e2.salary)
FROM employees e2
WHERE e2.department_id = e.department_id
);
Look for “DEPENDENT SUBQUERY” in the select_type column—this indicates a correlated subquery that executes multiple times.
Proper Indexing
Index the columns used in correlation conditions:
-- If correlating on department_id, ensure it's indexed
CREATE INDEX idx_dept ON employees(department_id);
-- For multi-column correlations, consider composite indexes
CREATE INDEX idx_dept_salary ON employees(department_id, salary);
Use LATERAL Joins (MySQL 8.0+)
LATERAL joins provide a cleaner, often faster alternative:
-- Using LATERAL (MySQL 8.0.14+)
SELECT e.employee_name, e.salary, dept_stats.avg_salary
FROM employees e
JOIN LATERAL (
SELECT AVG(salary) AS avg_salary
FROM employees e2
WHERE e2.department_id = e.department_id
) AS dept_stats
WHERE e.salary > dept_stats.avg_salary;
LATERAL allows the derived table to reference columns from preceding tables in the FROM clause, similar to a correlated subquery but with potentially better optimization.
Common Pitfalls and Best Practices
NULL Handling
Correlated subqueries can produce unexpected results with NULLs:
-- This might not work as expected
SELECT p.product_name
FROM products p
WHERE p.price > (
SELECT AVG(price)
FROM products p2
WHERE p2.category_id = p.category_id
);
If a category has no products, AVG() returns NULL, and price > NULL is never true. Fix it:
SELECT p.product_name
FROM products p
WHERE p.price > COALESCE((
SELECT AVG(price)
FROM products p2
WHERE p2.category_id = p.category_id
), 0);
Avoid Unnecessary Correlation
Don’t correlate when you don’t need to:
-- Unnecessarily correlated
SELECT e.employee_name
FROM employees e
WHERE e.department_id IN (
SELECT d.department_id
FROM departments d
WHERE d.location = 'New York'
AND d.department_id = e.department_id -- Unnecessary!
);
-- Better: simple subquery
SELECT e.employee_name
FROM employees e
WHERE e.department_id IN (
SELECT department_id
FROM departments
WHERE location = 'New York'
);
When Correlated Subqueries ARE the Best Choice
Despite performance concerns, correlated subqueries are sometimes optimal:
- When you need row-by-row context that’s difficult to express with JOINs
- With EXISTS for semi-joins on large tables (often faster than IN)
- For complex conditional logic where the alternative would be multiple self-joins
- When readability matters and performance is acceptable
The key is measuring actual performance with your data, not assuming. Use EXPLAIN, test with realistic data volumes, and choose the approach that balances performance with maintainability for your specific use case.
Correlated subqueries are a powerful tool in MySQL. Master them, understand their costs, and know when to use alternatives. Your queries will be both more expressive and more efficient.