SQL Indexes: B-Tree, Hash, and Composite Indexes

Key Insights

B-Tree indexes handle 95% of use cases efficiently, supporting range queries, sorting, and equality comparisons through a balanced tree structure that guarantees O(log n) lookups
Hash indexes provide O(1) equality lookups but cannot support range queries or sorting—use them only when you exclusively query with exact matches on high-cardinality columns
Composite index column order matters critically: a (last_name, first_name) index works for queries filtering on last_name alone, but not for first_name alone due to the leftmost prefix rule

Understanding Database Indexes

Indexes are data structures that databases maintain separately from your tables to speed up data retrieval. Think of them like a book’s index—instead of reading every page to find mentions of “SQL optimization,” you check the index and jump directly to relevant pages.

The trade-off is straightforward: indexes dramatically improve read performance at the cost of write overhead and storage space. Every INSERT, UPDATE, or DELETE must update not just the table but all associated indexes. For read-heavy workloads, this trade-off is almost always worth it. For write-intensive systems, you need to be selective.

Let’s see the impact with a concrete example:

-- Create a users table with 1 million rows
CREATE TABLE users (
    id SERIAL PRIMARY KEY,
    email VARCHAR(255),
    created_at TIMESTAMP,
    last_login TIMESTAMP
);

-- Query without index
EXPLAIN ANALYZE
SELECT * FROM users WHERE email = 'john@example.com';

-- Seq Scan on users (cost=0.00..20834.00 rows=1 width=48) (actual time=145.234..289.456 rows=1 loops=1)
-- Planning Time: 0.123 ms
-- Execution Time: 289.478 ms

-- Create index
CREATE INDEX idx_users_email ON users(email);

-- Same query with index
EXPLAIN ANALYZE
SELECT * FROM users WHERE email = 'john@example.com';

-- Index Scan using idx_users_email (cost=0.42..8.44 rows=1 width=48) (actual time=0.034..0.036 rows=1 loops=1)
-- Planning Time: 0.098 ms
-- Execution Time: 0.052 ms

That’s a 5,500x improvement. This is why indexes matter.

B-Tree Indexes: The Default Workhorse

B-Tree (Balanced Tree) indexes are the default in PostgreSQL, MySQL, Oracle, and most relational databases. They organize data in a sorted tree structure where each node contains keys and pointers to child nodes or actual data rows.

The “balanced” part is crucial—the tree maintains roughly equal path lengths from root to leaf, guaranteeing O(log n) performance for searches, insertions, and deletions. For a million rows, you’re looking at roughly 20 comparisons maximum to find any value.

B-Tree indexes excel at:

Equality comparisons (WHERE id = 100)
Range queries (WHERE age BETWEEN 25 AND 35)
Sorting operations (ORDER BY created_at)
Pattern matching with left-anchored wildcards (WHERE name LIKE 'John%')

Here’s how they perform with range queries:

CREATE TABLE orders (
    id SERIAL PRIMARY KEY,
    user_id INTEGER,
    total_amount DECIMAL(10,2),
    order_date DATE
);

-- Create B-Tree index on order_date
CREATE INDEX idx_orders_date ON orders(order_date);

-- Range query benefits from B-Tree
EXPLAIN ANALYZE
SELECT * FROM orders 
WHERE order_date BETWEEN '2024-01-01' AND '2024-01-31';

-- Index Scan using idx_orders_date (cost=0.42..234.56 rows=1250 width=32)
--   Index Cond: ((order_date >= '2024-01-01') AND (order_date <= '2024-01-31'))

The database traverses the B-Tree to find the first matching value, then sequentially reads entries until it passes the upper bound. This is vastly more efficient than scanning the entire table.

Hash Indexes: Fast Equality Lookups

Hash indexes use a hash function to map keys to bucket locations, providing O(1) average-case lookups for exact matches. In PostgreSQL 10+, hash indexes are crash-safe and can be larger than memory, making them viable for production use.

The limitation is severe: hash indexes only support equality operators. No ranges, no sorting, no pattern matching.

-- Create hash index
CREATE INDEX idx_users_email_hash ON users USING HASH (email);

-- Equality query: hash index is ideal
EXPLAIN ANALYZE
SELECT * FROM users WHERE email = 'jane@example.com';

-- Index Scan using idx_users_email_hash (cost=0.00..8.02 rows=1 width=48)
--   Index Cond: (email = 'jane@example.com')

-- Range query: hash index is useless
EXPLAIN ANALYZE
SELECT * FROM users WHERE email > 'jane@example.com';

-- Seq Scan on users (cost=0.00..20834.00 rows=333333 width=48)
--   Filter: (email > 'jane@example.com')
-- (Hash index not used at all)

When should you use hash indexes? Rarely. B-Tree indexes perform comparably on equality checks and provide much more flexibility. Consider hash indexes only when:

You have extremely high-cardinality columns (UUIDs, long strings)
You only query with exact equality
Benchmarks show meaningful improvement over B-Tree

For most applications, stick with B-Tree.

Composite Indexes: Multiple Columns

Composite (multi-column) indexes are where index strategy gets interesting. They allow efficient filtering on multiple columns, but column order is critical.

The leftmost prefix rule states: a composite index on (A, B, C) can support queries filtering on:

(A)
(A, B)
(A, B, C)

But NOT:

(B)
(C)
(B, C)

Think of it like a phone book sorted by (last_name, first_name). You can efficiently find “Smith, John,” or all “Smiths,” but not all “Johns” regardless of last name.

CREATE TABLE employees (
    id SERIAL PRIMARY KEY,
    last_name VARCHAR(100),
    first_name VARCHAR(100),
    department VARCHAR(50),
    hire_date DATE
);

-- Create composite index
CREATE INDEX idx_employees_name ON employees(last_name, first_name, hire_date);

-- Query 1: Uses index efficiently (leftmost prefix)
EXPLAIN ANALYZE
SELECT * FROM employees WHERE last_name = 'Smith';
-- Index Scan using idx_employees_name

-- Query 2: Uses index efficiently (leftmost prefix)
EXPLAIN ANALYZE
SELECT * FROM employees 
WHERE last_name = 'Smith' AND first_name = 'John';
-- Index Scan using idx_employees_name

-- Query 3: Uses full index
EXPLAIN ANALYZE
SELECT * FROM employees 
WHERE last_name = 'Smith' 
  AND first_name = 'John' 
  AND hire_date > '2020-01-01';
-- Index Scan using idx_employees_name

-- Query 4: CANNOT use index (doesn't start with leftmost column)
EXPLAIN ANALYZE
SELECT * FROM employees WHERE first_name = 'John';
-- Seq Scan on employees (index not used)

-- Query 5: CANNOT use index
EXPLAIN ANALYZE
SELECT * FROM employees WHERE hire_date > '2020-01-01';
-- Seq Scan on employees (index not used)

Column order should reflect your query patterns. Put the most selective column first, or the column you filter on most frequently. For the employees table, if you frequently search by department first, create the index as (department, last_name, first_name).

Choosing the Right Index Type

Here’s a practical decision framework:

Use B-Tree (default) when:

You need range queries, sorting, or pattern matching
You’re unsure (it’s the safe choice)
You need index-only scans

Use Hash when:

You have proven B-Tree isn’t sufficient (benchmark first)
Queries are exclusively equality checks
Column has very high cardinality

Use Composite when:

Queries regularly filter on multiple columns together
You want to support multiple query patterns with one index

Consider this e-commerce scenario:

CREATE TABLE orders (
    id SERIAL PRIMARY KEY,
    user_id INTEGER,
    status VARCHAR(20),
    total_amount DECIMAL(10,2),
    created_at TIMESTAMP
);

-- Common queries:
-- 1. Find user's orders: WHERE user_id = ?
-- 2. Find pending orders: WHERE status = 'pending'
-- 3. Find user's pending orders: WHERE user_id = ? AND status = 'pending'
-- 4. Recent orders: WHERE created_at > ? ORDER BY created_at DESC

-- Recommended indexes:
CREATE INDEX idx_orders_user_status ON orders(user_id, status);
CREATE INDEX idx_orders_created ON orders(created_at DESC);

-- idx_orders_user_status handles queries 1 and 3
-- idx_orders_created handles query 4
-- Query 2 needs a separate index if it's frequent:
CREATE INDEX idx_orders_status ON orders(status);

Common anti-patterns to avoid:

Indexing low-cardinality columns (gender, boolean flags) unless part of a composite index
Creating redundant indexes: if you have (A, B, C), you don’t need (A) or (A, B)
Over-indexing: every index slows writes and consumes storage

Performance Monitoring and Optimization

Indexes aren’t “set and forget.” Monitor their effectiveness and adjust as query patterns evolve.

Find unused indexes (PostgreSQL):

SELECT 
    schemaname,
    tablename,
    indexname,
    idx_scan,
    pg_size_pretty(pg_relation_size(indexrelid)) AS index_size
FROM pg_stat_user_indexes
WHERE idx_scan = 0
  AND indexrelname NOT LIKE 'pg_toast%'
ORDER BY pg_relation_size(indexrelid) DESC;

Identify missing indexes by analyzing slow queries:

-- Enable slow query logging (postgresql.conf)
-- log_min_duration_statement = 1000

-- Then analyze patterns in slow queries
SELECT 
    query,
    calls,
    total_time,
    mean_time,
    rows
FROM pg_stat_statements
ORDER BY mean_time DESC
LIMIT 20;

Check index bloat (PostgreSQL):

SELECT
    tablename,
    indexname,
    pg_size_pretty(pg_relation_size(indexrelid)) AS index_size,
    idx_scan,
    idx_tup_read,
    idx_tup_fetch
FROM pg_stat_user_indexes
ORDER BY pg_relation_size(indexrelid) DESC;

Before/after optimization example:

-- Before: slow query
EXPLAIN ANALYZE
SELECT * FROM orders 
WHERE user_id = 12345 
  AND status = 'pending' 
  AND created_at > NOW() - INTERVAL '30 days';
-- Execution Time: 1,234 ms

-- Add composite index
CREATE INDEX idx_orders_user_status_created 
ON orders(user_id, status, created_at);

-- After: fast query
EXPLAIN ANALYZE
SELECT * FROM orders 
WHERE user_id = 12345 
  AND status = 'pending' 
  AND created_at > NOW() - INTERVAL '30 days';
-- Execution Time: 12 ms

Regular index maintenance is essential. Run VACUUM and ANALYZE on PostgreSQL, or OPTIMIZE TABLE on MySQL to keep statistics current and prevent bloat.

The bottom line: start with B-Tree indexes on columns you filter or sort by frequently. Add composite indexes for multi-column queries. Monitor performance, drop unused indexes, and adjust based on actual query patterns. Database performance is empirical—measure, don’t guess.