System Design: Database Indexing Strategies

Key Insights

Database indexes transform O(n) full table scans into O(log n) lookups, but each index adds write overhead—strategic indexing means choosing the right indexes, not the most indexes.
Column order in composite indexes follows the left-prefix rule: an index on (status, created_at) only helps queries that filter on status first, making selectivity and query patterns the primary design drivers.
Partial indexes and covering indexes are underutilized optimizations that can dramatically reduce both storage costs and query latency for specific access patterns.

The Cost of Full Table Scans

Every database query without an appropriate index becomes a full table scan. At 1,000 rows, nobody notices. At 1 million rows, queries slow to seconds. At 100 million rows, your application becomes unusable.

The mathematics are straightforward. A full table scan is O(n)—doubling your data doubles your query time. A B-tree index lookup is O(log n)—doubling your data adds one comparison. For a table with 100 million rows, that’s the difference between examining 100 million rows versus approximately 27 index lookups.

But indexes aren’t free. Every INSERT, UPDATE, and DELETE must maintain every index on the table. Over-indexing kills write performance just as surely as under-indexing kills read performance. The goal isn’t maximum indexes—it’s the right indexes for your actual query patterns.

Index Types and When to Use Each

PostgreSQL offers several index types, each optimized for different access patterns. Choosing the wrong type wastes storage and CPU cycles.

B-tree indexes are the default and handle 90% of use cases. They support equality and range queries, work with ORDER BY, and handle NULL values. Use B-tree unless you have a specific reason not to.

Hash indexes only support equality comparisons but use less space than B-tree for that specific case. Since PostgreSQL 10, they’re crash-safe and worth considering for exact-match lookups on large text fields.

GIN (Generalized Inverted Index) excels at indexing composite values—arrays, JSONB, and full-text search. When you need to find rows where an array contains a value, GIN is your answer.

GiST (Generalized Search Tree) handles geometric data, range types, and full-text search. It’s essential for PostGIS spatial queries and overlapping range searches.

-- B-tree: Default for most columns
CREATE INDEX idx_users_email ON users (email);

-- Hash: Equality-only lookups on large values
CREATE INDEX idx_sessions_token ON sessions USING hash (token);

-- GIN: Array and JSONB containment queries
CREATE INDEX idx_posts_tags ON posts USING gin (tags);

-- GiST: Range overlap queries
CREATE INDEX idx_reservations_during ON reservations USING gist (during);

-- Compare query plans
EXPLAIN ANALYZE SELECT * FROM users WHERE email = 'test@example.com';
-- Index Scan using idx_users_email on users (cost=0.42..8.44 rows=1 width=128)
--   Index Cond: (email = 'test@example.com')
-- Execution Time: 0.052 ms

EXPLAIN ANALYZE SELECT * FROM users_no_index WHERE email = 'test@example.com';
-- Seq Scan on users_no_index (cost=0.00..18334.00 rows=1 width=128)
--   Filter: (email = 'test@example.com')
-- Execution Time: 89.432 ms

The difference between 0.052ms and 89.432ms compounds across thousands of requests per second. That’s the difference between a responsive application and one that needs horizontal scaling to handle basic load.

Composite Indexes and Column Ordering

Composite indexes cover multiple columns, but column order determines which queries benefit. The left-prefix rule states that an index on (A, B, C) can satisfy queries on (A), (A, B), or (A, B, C)—but not (B), (C), or (B, C) alone.

This makes column ordering a critical design decision. Two factors guide the choice: query patterns and selectivity.

Query patterns come first. If your application always filters by status and sometimes adds a date range, put status first. If it always filters by date and sometimes adds status, reverse the order.

Selectivity matters when multiple orderings could work. High-selectivity columns (many distinct values) should generally come before low-selectivity columns (few distinct values) because they eliminate more rows earlier in the search.

-- Orders table with common query patterns
CREATE TABLE orders (
    id SERIAL PRIMARY KEY,
    user_id INTEGER,
    status VARCHAR(20),  -- Low selectivity: 5 distinct values
    created_at TIMESTAMP, -- High selectivity: millions of distinct values
    total_amount DECIMAL(10,2)
);

-- Index A: status first
CREATE INDEX idx_orders_status_created ON orders (status, created_at);

-- Index B: created_at first  
CREATE INDEX idx_orders_created_status ON orders (created_at, status);

-- Query 1: Filter by status, range on date
EXPLAIN ANALYZE 
SELECT * FROM orders 
WHERE status = 'pending' AND created_at > '2024-01-01';
-- Uses idx_orders_status_created effectively
-- Index Cond: ((status = 'pending') AND (created_at > '2024-01-01'))

-- Query 2: Only filter by date
EXPLAIN ANALYZE 
SELECT * FROM orders 
WHERE created_at > '2024-01-01';
-- Cannot use idx_orders_status_created (status not in WHERE)
-- CAN use idx_orders_created_status

-- Query 3: Only filter by status
EXPLAIN ANALYZE 
SELECT * FROM orders 
WHERE status = 'pending';
-- CAN use idx_orders_status_created
-- Cannot use idx_orders_created_status

If your application runs Query 1 and Query 3 frequently but never Query 2, choose idx_orders_status_created. If you need all three patterns, you might need both indexes—or you might reconsider your access patterns.

Covering Indexes and Index-Only Scans

Standard index lookups follow a two-step process: find row locations in the index, then fetch full rows from the table (heap). Covering indexes eliminate the second step by including all columns the query needs.

PostgreSQL’s INCLUDE clause adds non-indexed columns to the index leaf pages. These columns aren’t searchable but are available for index-only scans.

-- Standard index: requires heap fetch for email
CREATE INDEX idx_users_status ON users (status);

EXPLAIN ANALYZE 
SELECT id, email FROM users WHERE status = 'active';
-- Index Scan using idx_users_status on users
--   Heap Fetches: 45023

-- Covering index: includes email in index
CREATE INDEX idx_users_status_covering ON users (status) INCLUDE (id, email);

EXPLAIN ANALYZE 
SELECT id, email FROM users WHERE status = 'active';
-- Index Only Scan using idx_users_status_covering on users
--   Heap Fetches: 0

-- The difference matters at scale
-- With heap fetches: random I/O to table pages
-- Index-only scan: sequential read of index only

The trade-off is index size. Including columns increases storage and memory requirements. Use covering indexes for frequently-executed queries where the included columns are small and the performance gain justifies the storage cost.

Partial and Filtered Indexes

Partial indexes cover a subset of rows, defined by a WHERE clause. They’re smaller, faster to maintain, and more targeted than full indexes.

Common use cases include indexing only active records, recent data, or specific status values that queries filter on frequently.

-- Full index on all orders: 10GB
CREATE INDEX idx_orders_created ON orders (created_at);

-- Partial index on pending orders only: 50MB
CREATE INDEX idx_orders_pending_created ON orders (created_at)
WHERE status = 'pending';

-- Queries must match the partial index condition exactly
EXPLAIN ANALYZE 
SELECT * FROM orders 
WHERE status = 'pending' AND created_at > '2024-01-01';
-- Uses idx_orders_pending_created

-- This query cannot use the partial index
EXPLAIN ANALYZE 
SELECT * FROM orders 
WHERE status = 'shipped' AND created_at > '2024-01-01';
-- Falls back to full table scan or other indexes

-- Partial index for soft deletes
CREATE INDEX idx_users_active_email ON users (email)
WHERE deleted_at IS NULL;

-- Partial index for hot data
CREATE INDEX idx_logs_recent ON logs (created_at, level)
WHERE created_at > CURRENT_DATE - INTERVAL '7 days';
-- Note: This condition is evaluated at index creation time
-- For rolling windows, use a fixed date and recreate periodically

Partial indexes shine when your queries consistently filter on the same condition. If 95% of your order queries filter WHERE status = 'pending', a partial index on pending orders outperforms a full index in both storage and lookup speed.

Indexing Anti-Patterns

Over-indexing is as harmful as under-indexing. Each index consumes storage, memory, and write capacity. These anti-patterns appear in most production databases.

Indexing low-cardinality columns alone. An index on a boolean column or a status with five values rarely helps. The query planner often chooses a sequential scan because fetching 20% of the table through an index is slower than scanning the whole table.

Creating indexes speculatively. Adding indexes “just in case” wastes resources. Create indexes based on actual query patterns from production logs.

Ignoring unused indexes. Indexes created during development or for one-time queries persist forever, consuming write performance.

-- Find unused indexes
SELECT 
    schemaname || '.' || relname AS table,
    indexrelname AS index,
    pg_size_pretty(pg_relation_size(i.indexrelid)) AS index_size,
    idx_scan AS index_scans
FROM pg_stat_user_indexes ui
JOIN pg_index i ON ui.indexrelid = i.indexrelid
WHERE NOT indisunique  -- Exclude primary keys and unique constraints
  AND idx_scan < 50     -- Fewer than 50 scans
  AND pg_relation_size(i.indexrelid) > 1024 * 1024  -- Larger than 1MB
ORDER BY pg_relation_size(i.indexrelid) DESC;

-- Find duplicate indexes (same columns, same order)
SELECT 
    pg_size_pretty(sum(pg_relation_size(idx))::bigint) AS size,
    (array_agg(idx))[1] AS idx1, 
    (array_agg(idx))[2] AS idx2
FROM (
    SELECT indexrelid::regclass AS idx, 
           indrelid::regclass AS tbl, 
           indkey AS cols
    FROM pg_index
) sub
GROUP BY tbl, cols
HAVING count(*) > 1;

Run these queries monthly. Drop unused indexes after confirming they’re not needed for rare but critical operations.

Monitoring and Maintenance Strategy

Indexes degrade over time. Updates and deletes leave dead tuples. B-tree pages become sparse. Index bloat slows queries and wastes storage.

PostgreSQL’s autovacuum handles routine cleanup, but heavily-updated tables may need manual intervention. Monitor index health and reindex when bloat exceeds acceptable thresholds.

-- Check index bloat estimates
SELECT 
    schemaname || '.' || relname AS table,
    indexrelname AS index,
    pg_size_pretty(pg_relation_size(indexrelid)) AS index_size,
    idx_scan AS scans,
    idx_tup_read AS tuples_read,
    idx_tup_fetch AS tuples_fetched
FROM pg_stat_user_indexes
ORDER BY pg_relation_size(indexrelid) DESC
LIMIT 20;

-- More detailed bloat estimation using pgstattuple extension
CREATE EXTENSION IF NOT EXISTS pgstattuple;

SELECT 
    indexrelname,
    pg_size_pretty(pg_relation_size(indexrelid)) AS size,
    avg_leaf_density,
    leaf_fragmentation
FROM pg_stat_user_indexes ui,
     LATERAL pgstatindex(indexrelid) AS pgi
WHERE pg_relation_size(indexrelid) > 10 * 1024 * 1024  -- > 10MB
ORDER BY leaf_fragmentation DESC;

-- Rebuild bloated indexes without blocking writes
REINDEX INDEX CONCURRENTLY idx_orders_status_created;

-- Rebuild all indexes on a table
REINDEX TABLE CONCURRENTLY orders;

Schedule index health checks weekly. Reindex when fragmentation exceeds 30% or bloat significantly increases index size. Use CONCURRENTLY to avoid blocking production writes—it takes longer but doesn’t lock the table.

Build alerting around index scan efficiency. If idx_tup_read significantly exceeds idx_tup_fetch, queries are reading index entries that don’t return rows—a sign of bloat or poor index design.

Indexing strategy isn’t a one-time decision. Query patterns evolve, data distributions shift, and yesterday’s optimal index becomes tomorrow’s bottleneck. Treat indexes as living infrastructure that requires ongoing attention.