SQL Query Optimization: EXPLAIN and Query Plans

Key Insights

Query plans reveal exactly how your database executes SQL statements, exposing bottlenecks like full table scans and inefficient joins that can slow queries by orders of magnitude
EXPLAIN ANALYZE provides actual execution metrics versus estimates, helping you identify where the query planner’s assumptions diverge from reality and causing performance issues
Understanding index selectivity and join order through query plans enables targeted optimizations rather than blindly adding indexes that may never be used

Introduction to Query Performance Analysis

Database performance problems rarely announce themselves clearly. A query that runs fine with 1,000 rows suddenly takes 30 seconds with 100,000 rows. Your application slows to a crawl during peak hours. The culprit? Almost always inefficient query execution.

EXPLAIN is your diagnostic tool for understanding exactly how the database engine processes your SQL statements. Think of it as an X-ray for queries—it reveals the execution plan, showing you the database’s strategy for retrieving data. Instead of guessing why a query is slow, you can see precisely which operations consume resources.

Here’s a real-world example of the difference proper optimization makes:

-- Before optimization: 2.3 seconds
SELECT o.order_id, o.order_date, c.customer_name, c.email
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
WHERE o.order_date >= '2024-01-01'
  AND c.country = 'USA';

-- After adding composite index: 0.04 seconds
CREATE INDEX idx_orders_date_customer ON orders(order_date, customer_id);
CREATE INDEX idx_customers_country ON customers(country);

That’s a 57x performance improvement from understanding what the query plan revealed.

Understanding EXPLAIN Output

EXPLAIN output varies between database systems, but the core concepts remain consistent. Let’s examine PostgreSQL and MySQL approaches.

PostgreSQL EXPLAIN:

EXPLAIN SELECT * FROM users WHERE email = 'john@example.com';

                                    QUERY PLAN
-----------------------------------------------------------------------------------
Seq Scan on users  (cost=0.00..35.50 rows=1 width=524)
  Filter: (email = 'john@example.com'::text)

Key components:

Seq Scan: Sequential scan (reads entire table)
cost=0.00..35.50: Startup cost..total cost (arbitrary units)
rows=1: Estimated rows returned
width=524: Average row width in bytes

MySQL EXPLAIN:

EXPLAIN SELECT * FROM users WHERE email = 'john@example.com';

+----+-------------+-------+------+---------------+------+---------+------+-------+-------------+
| id | select_type | table | type | possible_keys | key  | key_len | ref  | rows  | Extra       |
+----+-------------+-------+------+---------------+------+---------+------+-------+-------------+
|  1 | SIMPLE      | users | ALL  | NULL          | NULL | NULL    | NULL | 50000 | Using where |
+----+-------------+-------+------+---------------+------+---------+------+-------+-------------+

Critical fields:

type: Access method (ALL = full table scan, worst case)
possible_keys: Indexes that could be used
key: Index actually used (NULL means no index)
rows: Estimated rows examined

Both show the same problem: a full table scan with no index usage.

Common Performance Bottlenecks

Query plans expose specific performance killers. Here are the most common:

Full Table Scans:

-- Bad: Scans entire orders table
EXPLAIN SELECT * FROM orders WHERE YEAR(order_date) = 2024;

-- Good: Uses index
EXPLAIN SELECT * FROM orders WHERE order_date >= '2024-01-01' 
  AND order_date < '2025-01-01';

The first query applies a function to order_date, preventing index usage. The second query uses a range condition that can leverage an index on order_date.

N+1 Query Problem:

-- This pattern generates N+1 queries in application code
SELECT * FROM orders;  -- 1 query
-- Then for each order:
SELECT * FROM customers WHERE id = ?;  -- N queries

-- Solution: Single JOIN query
EXPLAIN ANALYZE 
SELECT o.*, c.customer_name 
FROM orders o 
JOIN customers c ON o.customer_id = c.id;

The EXPLAIN output will show a single query plan with a hash join or nested loop, vastly more efficient than hundreds of separate queries.

Inefficient Join Order:

-- Database joins small table to large table (inefficient)
EXPLAIN SELECT * 
FROM large_table l
JOIN small_table s ON l.id = s.large_table_id
WHERE s.status = 'active';

-- Execution plan shows:
-- 1. Scan large_table (1M rows)
-- 2. Join to small_table (100 rows)
-- 3. Filter by status

-- Better: Filter first, then join
EXPLAIN SELECT * 
FROM small_table s
JOIN large_table l ON s.large_table_id = l.id
WHERE s.status = 'active';

-- Now scans 100 rows instead of 1M

Modern query optimizers often handle this automatically, but EXPLAIN reveals when they don’t.

Index Strategy and EXPLAIN

Indexes are powerful but not magic. EXPLAIN shows whether your indexes are actually helping.

Before and After Index:

-- Before: Sequential scan
EXPLAIN ANALYZE SELECT * FROM products WHERE category = 'electronics';
-- Seq Scan on products (actual time=245.123..245.456 rows=1250)

CREATE INDEX idx_products_category ON products(category);

-- After: Index scan
EXPLAIN ANALYZE SELECT * FROM products WHERE category = 'electronics';
-- Index Scan using idx_products_category (actual time=0.234..2.456 rows=1250)

Composite Index Usage:

CREATE INDEX idx_orders_customer_date ON orders(customer_id, order_date);

-- Uses index efficiently (both columns)
EXPLAIN SELECT * FROM orders 
WHERE customer_id = 123 AND order_date >= '2024-01-01';
-- Index Scan using idx_orders_customer_date

-- Uses index partially (first column only)
EXPLAIN SELECT * FROM orders WHERE customer_id = 123;
-- Index Scan using idx_orders_customer_date

-- Cannot use index (second column only)
EXPLAIN SELECT * FROM orders WHERE order_date >= '2024-01-01';
-- Seq Scan on orders

Low Selectivity Index Ignored:

-- Boolean column with poor selectivity
CREATE INDEX idx_users_active ON users(is_active);

EXPLAIN SELECT * FROM users WHERE is_active = true;
-- Seq Scan on users (database ignores index)
-- Why? If 95% of users are active, scanning the index + table
-- is slower than just scanning the table

Advanced EXPLAIN Features

EXPLAIN ANALYZE executes the query and provides actual metrics, revealing where estimates diverge from reality.

EXPLAIN ANALYZE SELECT o.order_id, SUM(oi.quantity * oi.price)
FROM orders o
JOIN order_items oi ON o.order_id = oi.order_id
WHERE o.order_date >= '2024-01-01'
GROUP BY o.order_id;

-- Output shows:
-- Hash Join (actual time=45.234..89.456 rows=15234 width=16)
--   (estimated rows=1000)  <- Big discrepancy!
--   Hash Cond: (oi.order_id = o.order_id)
--   -> Seq Scan on order_items oi (actual time=0.123..23.456 rows=76543)
--   -> Hash (actual time=12.345..12.345 rows=15234)
--     -> Seq Scan on orders o (actual time=0.234..10.123 rows=15234)

The estimated 1,000 rows versus actual 15,234 rows indicates outdated statistics. Run ANALYZE orders; to update.

PostgreSQL BUFFERS analysis:

EXPLAIN (ANALYZE, BUFFERS) 
SELECT * FROM large_table WHERE indexed_column = 'value';

-- Shows I/O metrics:
-- Buffers: shared hit=234 read=12
-- "hit" = found in cache
-- "read" = read from disk

High “read” numbers indicate cache misses requiring disk I/O—a major performance bottleneck.

Optimization Techniques Based on Query Plans

Rewriting Correlated Subqueries:

-- Slow: Correlated subquery executes for each row
EXPLAIN ANALYZE 
SELECT c.customer_name,
  (SELECT COUNT(*) FROM orders o WHERE o.customer_id = c.customer_id) as order_count
FROM customers c;
-- Execution time: 3.2 seconds

-- Fast: JOIN with aggregation
EXPLAIN ANALYZE
SELECT c.customer_name, COUNT(o.order_id) as order_count
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
GROUP BY c.customer_id, c.customer_name;
-- Execution time: 0.3 seconds

CTE vs Subquery Performance:

-- CTE (may be materialized/optimized differently)
EXPLAIN ANALYZE
WITH recent_orders AS (
  SELECT customer_id, COUNT(*) as order_count
  FROM orders
  WHERE order_date >= CURRENT_DATE - INTERVAL '30 days'
  GROUP BY customer_id
)
SELECT c.*, ro.order_count
FROM customers c
JOIN recent_orders ro ON c.customer_id = ro.customer_id;

-- Compare with subquery version
-- PostgreSQL 12+ optimizes CTEs better

Complex Query Optimization:

-- Before: Multiple joins with filtering at the end
SELECT p.product_name, SUM(oi.quantity)
FROM products p
JOIN order_items oi ON p.product_id = oi.product_id
JOIN orders o ON oi.order_id = o.order_id
WHERE o.order_date >= '2024-01-01'
  AND p.category = 'electronics'
GROUP BY p.product_id, p.product_name;

-- After: Filter early with subquery
SELECT p.product_name, SUM(oi.quantity)
FROM (SELECT * FROM products WHERE category = 'electronics') p
JOIN order_items oi ON p.product_id = oi.product_id
JOIN (SELECT * FROM orders WHERE order_date >= '2024-01-01') o 
  ON oi.order_id = o.order_id
GROUP BY p.product_id, p.product_name;

Check EXPLAIN output to verify the optimizer applies filters early.

Best Practices and Monitoring

Make query plan analysis part of your development workflow:

PostgreSQL Slow Query Logging:

-- postgresql.conf
log_min_duration_statement = 1000  -- Log queries over 1 second
log_line_prefix = '%t [%p]: [%l-1] user=%u,db=%d,app=%a,client=%h '
log_statement = 'none'
log_duration = off
auto_explain.log_min_duration = 1000  -- Auto-EXPLAIN slow queries
auto_explain.log_analyze = true

MySQL Configuration:

-- my.cnf
slow_query_log = 1
long_query_time = 1
slow_query_log_file = /var/log/mysql/slow-query.log
log_queries_not_using_indexes = 1

Automated Analysis Script:

import psycopg2

def analyze_slow_queries(log_file):
    """Parse slow query log and run EXPLAIN on each query"""
    with open(log_file) as f:
        queries = parse_slow_log(f)
    
    conn = psycopg2.connect("dbname=mydb")
    cur = conn.cursor()
    
    for query in queries:
        cur.execute(f"EXPLAIN ANALYZE {query}")
        plan = cur.fetchall()
        
        # Check for full table scans
        if any('Seq Scan' in str(row) for row in plan):
            print(f"WARNING: Full table scan in query: {query[:100]}")
            print(plan)

Regular query plan reviews prevent performance degradation. Set up alerts for queries exceeding thresholds, review EXPLAIN output during code reviews, and maintain a performance baseline for critical queries.

The database query planner is sophisticated, but it’s not omniscient. EXPLAIN gives you visibility into its decisions, empowering you to write queries that work with the optimizer rather than against it. Master this tool, and you’ll solve performance problems before they reach production.