PostgreSQL CTEs and Recursive Queries

Common Table Expressions provide a way to write auxiliary statements within a larger query. Think of them as named subqueries that exist only for the duration of a single statement. They're defined...

Key Insights

  • CTEs (Common Table Expressions) improve query readability and enable recursive operations that would be impossible or extremely complex with standard SQL joins
  • Recursive CTEs require a base case and recursive term joined with UNION/UNION ALL, with PostgreSQL enforcing safeguards against infinite recursion
  • Performance-wise, CTEs in PostgreSQL act as optimization fences; use materialized CTEs for expensive operations you want to cache, or inline them when the optimizer should access them multiple times

Understanding Common Table Expressions

Common Table Expressions provide a way to write auxiliary statements within a larger query. Think of them as named subqueries that exist only for the duration of a single statement. They’re defined using the WITH clause and can be referenced multiple times within the main query.

WITH regional_sales AS (
    SELECT 
        region,
        SUM(amount) AS total_sales
    FROM orders
    WHERE order_date >= '2024-01-01'
    GROUP BY region
),
top_regions AS (
    SELECT region
    FROM regional_sales
    WHERE total_sales > (SELECT AVG(total_sales) FROM regional_sales)
)
SELECT 
    o.order_id,
    o.customer_id,
    o.amount,
    o.region
FROM orders o
INNER JOIN top_regions tr ON o.region = tr.region
WHERE o.order_date >= '2024-01-01'
ORDER BY o.amount DESC;

This query identifies regions with above-average sales, then retrieves all orders from those regions. Without CTEs, you’d need nested subqueries or temporary tables, both less readable.

Multiple CTEs and Chaining

You can define multiple CTEs in a single query, and later CTEs can reference earlier ones. This creates a logical flow that mirrors your thought process.

WITH customer_orders AS (
    SELECT 
        customer_id,
        COUNT(*) AS order_count,
        SUM(total_amount) AS lifetime_value
    FROM orders
    GROUP BY customer_id
),
customer_segments AS (
    SELECT 
        customer_id,
        order_count,
        lifetime_value,
        CASE 
            WHEN lifetime_value > 10000 THEN 'premium'
            WHEN lifetime_value > 5000 THEN 'standard'
            ELSE 'basic'
        END AS segment
    FROM customer_orders
),
segment_stats AS (
    SELECT 
        segment,
        COUNT(*) AS customer_count,
        AVG(lifetime_value) AS avg_value,
        MAX(lifetime_value) AS max_value
    FROM customer_segments
    GROUP BY segment
)
SELECT * FROM segment_stats
ORDER BY avg_value DESC;

Each CTE builds on the previous one, creating a clear data transformation pipeline.

Recursive CTEs: The Basics

Recursive CTEs solve problems involving hierarchical or graph-like data. They consist of two parts: a non-recursive term (base case) and a recursive term that references the CTE itself.

WITH RECURSIVE employee_hierarchy AS (
    -- Base case: start with top-level managers
    SELECT 
        employee_id,
        name,
        manager_id,
        1 AS level,
        name::TEXT AS path
    FROM employees
    WHERE manager_id IS NULL
    
    UNION ALL
    
    -- Recursive case: find direct reports
    SELECT 
        e.employee_id,
        e.name,
        e.manager_id,
        eh.level + 1,
        eh.path || ' > ' || e.name
    FROM employees e
    INNER JOIN employee_hierarchy eh ON e.manager_id = eh.employee_id
)
SELECT * FROM employee_hierarchy
ORDER BY level, name;

The base case selects employees without managers. The recursive term joins employees to the CTE, finding each level of the hierarchy. PostgreSQL executes this iteratively until no new rows are produced.

Recursive Queries for Tree Traversal

Recursive CTEs excel at traversing tree structures like organizational charts, category hierarchies, or bill-of-materials.

CREATE TABLE categories (
    id SERIAL PRIMARY KEY,
    name VARCHAR(100),
    parent_id INTEGER REFERENCES categories(id)
);

INSERT INTO categories (name, parent_id) VALUES
    ('Electronics', NULL),
    ('Computers', 1),
    ('Laptops', 2),
    ('Desktops', 2),
    ('Phones', 1),
    ('Smartphones', 5),
    ('Feature Phones', 5);

WITH RECURSIVE category_tree AS (
    -- Base: top-level categories
    SELECT 
        id,
        name,
        parent_id,
        0 AS depth,
        ARRAY[id] AS path,
        name AS full_path
    FROM categories
    WHERE parent_id IS NULL
    
    UNION ALL
    
    -- Recursive: child categories
    SELECT 
        c.id,
        c.name,
        c.parent_id,
        ct.depth + 1,
        ct.path || c.id,
        ct.full_path || ' / ' || c.name
    FROM categories c
    INNER JOIN category_tree ct ON c.parent_id = ct.id
)
SELECT 
    REPEAT('  ', depth) || name AS indented_name,
    depth,
    full_path
FROM category_tree
ORDER BY path;

This produces an indented hierarchy showing the complete path to each category. The path array prevents cycles by tracking visited nodes.

Graph Traversal and Cycle Detection

When dealing with graphs that might contain cycles, you need explicit cycle detection.

CREATE TABLE connections (
    from_node INTEGER,
    to_node INTEGER,
    PRIMARY KEY (from_node, to_node)
);

INSERT INTO connections VALUES
    (1, 2), (2, 3), (3, 4), (4, 5),
    (2, 6), (6, 7), (3, 7);

WITH RECURSIVE graph_traversal AS (
    -- Start from node 1
    SELECT 
        1 AS node,
        ARRAY[1] AS path,
        false AS is_cycle
    
    UNION ALL
    
    SELECT 
        c.to_node,
        gt.path || c.to_node,
        c.to_node = ANY(gt.path) AS is_cycle
    FROM graph_traversal gt
    INNER JOIN connections c ON c.from_node = gt.node
    WHERE NOT is_cycle  -- Stop traversing when cycle detected
)
SELECT DISTINCT
    node,
    path
FROM graph_traversal
WHERE NOT is_cycle
ORDER BY array_length(path, 1), path;

The is_cycle flag prevents infinite loops by checking if a node already exists in the current path.

Materialized vs Non-Materialized CTEs

By default, PostgreSQL can inline CTEs or materialize them. Since PostgreSQL 12, you can control this behavior explicitly.

-- Materialized: computed once, results cached
WITH regional_totals AS MATERIALIZED (
    SELECT 
        region,
        SUM(amount) AS total
    FROM large_orders_table
    GROUP BY region
)
SELECT * FROM regional_totals
WHERE total > 1000000
UNION ALL
SELECT * FROM regional_totals
WHERE total <= 1000000;

-- Not materialized: optimizer can inline
WITH filtered_orders AS NOT MATERIALIZED (
    SELECT * FROM orders WHERE status = 'completed'
)
SELECT customer_id, COUNT(*)
FROM filtered_orders
GROUP BY customer_id;

Use MATERIALIZED when you reference the CTE multiple times and want to avoid recomputation. Use NOT MATERIALIZED when you want the optimizer to push predicates into the CTE.

Practical Example: Running Totals with Gaps

Recursive CTEs can generate sequences and fill gaps in time-series data.

WITH RECURSIVE date_series AS (
    SELECT 
        MIN(order_date) AS date,
        MAX(order_date) AS max_date
    FROM orders
    
    UNION ALL
    
    SELECT 
        date + INTERVAL '1 day',
        max_date
    FROM date_series
    WHERE date < max_date
),
daily_sales AS (
    SELECT 
        order_date,
        SUM(amount) AS daily_total
    FROM orders
    GROUP BY order_date
)
SELECT 
    ds.date,
    COALESCE(daily_sales.daily_total, 0) AS daily_total,
    SUM(COALESCE(daily_sales.daily_total, 0)) OVER (ORDER BY ds.date) AS running_total
FROM date_series ds
LEFT JOIN daily_sales ON ds.date = daily_sales.order_date
ORDER BY ds.date;

This generates a complete date series and calculates running totals, even for days with no sales.

Performance Considerations

CTEs create optimization boundaries. The query planner won’t push predicates into a CTE unless it’s marked NOT MATERIALIZED. Profile your queries:

EXPLAIN ANALYZE
WITH expensive_calculation AS (
    SELECT 
        product_id,
        AVG(price) AS avg_price
    FROM price_history
    GROUP BY product_id
)
SELECT * FROM expensive_calculation
WHERE avg_price > 100;

For recursive queries, set max_stack_depth appropriately and consider adding a depth limit:

WITH RECURSIVE limited_recursion AS (
    SELECT id, parent_id, 0 AS depth
    FROM categories
    WHERE parent_id IS NULL
    
    UNION ALL
    
    SELECT c.id, c.parent_id, lr.depth + 1
    FROM categories c
    INNER JOIN limited_recursion lr ON c.parent_id = lr.id
    WHERE lr.depth < 10  -- Prevent excessive recursion
)
SELECT * FROM limited_recursion;

CTEs transform complex SQL into readable, maintainable code. Use them to break down complicated logic, traverse hierarchies, and generate sequences. Just remember their performance characteristics and choose materialization strategy based on your access patterns.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.