B-Tree: Disk-Optimized Search Tree

Binary search trees are elegant in memory. With O(log₂ n) height, they provide efficient search for in-memory data. But databases don't live in memory—they live on disk.

Key Insights

  • B-Trees minimize disk I/O by maximizing branching factor—a tree with 1000-way branching stores billions of keys in just 3-4 levels, turning what would be 30+ disk reads into 3-4.
  • Node splitting during insertion propagates upward, meaning B-Trees grow from the root rather than the leaves—this guarantees perfect balance without expensive rebalancing operations.
  • B+Trees extend the B-Tree concept by storing all data in leaf nodes and linking them together, making range queries dramatically faster and explaining why every major database uses them for indexes.

Why Binary Trees Fail on Disk

Binary search trees are elegant in memory. With O(log₂ n) height, they provide efficient search for in-memory data. But databases don’t live in memory—they live on disk.

Consider searching for a key in a binary tree containing one billion records. That’s roughly 30 levels (log₂ 10⁹ ≈ 30). Each level requires reading a node from disk. Even with modern SSDs offering 0.1ms random read latency, that’s 3ms per search. With spinning disks at 10ms per random read, you’re looking at 300ms—completely unacceptable for a database handling thousands of queries per second.

The fundamental problem is that disk I/O operates on blocks, typically 4KB to 16KB. Reading a single 64-byte tree node wastes 99% of each disk read. B-Trees solve this by packing hundreds or thousands of keys into each node, matching the disk’s natural block size and dramatically reducing tree height.

A B-Tree with a branching factor of 1000 stores one billion keys in just 3 levels (1000³ = 10⁹). Three disk reads instead of thirty. That’s the difference between a usable database and an unusable one.

B-Tree Structure and Properties

A B-Tree of order m has these properties:

  1. Every node contains at most m-1 keys and m children
  2. Every non-root node contains at least ⌈m/2⌉-1 keys
  3. All leaves appear at the same depth
  4. Keys within each node are sorted
  5. For any key k in a node, all keys in the left subtree are less than k, and all keys in the right subtree are greater

The order m is chosen to maximize node size while fitting within a disk block. For a 4KB block with 8-byte keys and 8-byte child pointers, you can fit roughly 250 keys per node.

#define MAX_KEYS 255
#define MIN_KEYS 127  // ceil(256/2) - 1

typedef struct BTreeNode {
    int num_keys;                      // Current number of keys
    int keys[MAX_KEYS];                // Sorted array of keys
    void* values[MAX_KEYS];            // Associated values (or record pointers)
    struct BTreeNode* children[MAX_KEYS + 1];  // Child pointers
    bool is_leaf;                      // True if this is a leaf node
} BTreeNode;

typedef struct BTree {
    BTreeNode* root;
    int order;  // Maximum children per node
} BTree;

The children array has one more slot than keys because child pointers sit between keys. If a node has keys [10, 20, 30], it has four children: one for keys < 10, one for keys between 10-20, one for 20-30, and one for keys > 30.

Search Operation

Search in a B-Tree combines two operations: binary search within a node to find the correct position, and following child pointers to descend the tree.

void* btree_search(BTree* tree, int key) {
    BTreeNode* node = tree->root;
    
    while (node != NULL) {
        // Binary search within the node
        int left = 0;
        int right = node->num_keys - 1;
        int pos = node->num_keys;  // Default: key is larger than all keys
        
        while (left <= right) {
            int mid = left + (right - left) / 2;
            
            if (node->keys[mid] == key) {
                // Found the key
                return node->values[mid];
            } else if (node->keys[mid] < key) {
                left = mid + 1;
            } else {
                pos = mid;
                right = mid - 1;
            }
        }
        
        // If we're at a leaf and didn't find it, key doesn't exist
        if (node->is_leaf) {
            return NULL;
        }
        
        // Descend to the appropriate child
        // pos now contains the index of the first key greater than our search key
        node = node->children[pos];
    }
    
    return NULL;
}

The key insight is that we perform O(log m) comparisons within each node (binary search) and visit O(log_m n) nodes. Total comparisons: O(log m × log_m n) = O(log n). But disk I/O is only O(log_m n), which is what matters.

Insertion and Node Splitting

Insertion always happens at a leaf node. We search for the correct leaf, insert the key in sorted order, and handle overflow if the node exceeds capacity.

When a node overflows (contains m keys), we split it:

  1. Create a new node
  2. Move the upper half of keys to the new node
  3. Push the median key up to the parent
  4. If the parent overflows, recursively split it

This upward propagation can reach the root. When the root splits, we create a new root with a single key—this is the only way a B-Tree grows taller.

BTreeNode* create_node(bool is_leaf) {
    BTreeNode* node = malloc(sizeof(BTreeNode));
    node->num_keys = 0;
    node->is_leaf = is_leaf;
    for (int i = 0; i <= MAX_KEYS; i++) {
        node->children[i] = NULL;
    }
    return node;
}

void split_child(BTreeNode* parent, int child_index) {
    BTreeNode* full_child = parent->children[child_index];
    BTreeNode* new_node = create_node(full_child->is_leaf);
    
    int mid = MAX_KEYS / 2;
    new_node->num_keys = MAX_KEYS - mid - 1;
    
    // Copy upper half of keys to new node
    for (int i = 0; i < new_node->num_keys; i++) {
        new_node->keys[i] = full_child->keys[mid + 1 + i];
        new_node->values[i] = full_child->values[mid + 1 + i];
    }
    
    // Copy children if not a leaf
    if (!full_child->is_leaf) {
        for (int i = 0; i <= new_node->num_keys; i++) {
            new_node->children[i] = full_child->children[mid + 1 + i];
        }
    }
    
    full_child->num_keys = mid;
    
    // Make room in parent for new child pointer and promoted key
    for (int i = parent->num_keys; i > child_index; i--) {
        parent->children[i + 1] = parent->children[i];
        parent->keys[i] = parent->keys[i - 1];
        parent->values[i] = parent->values[i - 1];
    }
    
    // Insert new child and promoted key into parent
    parent->children[child_index + 1] = new_node;
    parent->keys[child_index] = full_child->keys[mid];
    parent->values[child_index] = full_child->values[mid];
    parent->num_keys++;
}

void insert_non_full(BTreeNode* node, int key, void* value) {
    int i = node->num_keys - 1;
    
    if (node->is_leaf) {
        // Shift keys to make room
        while (i >= 0 && key < node->keys[i]) {
            node->keys[i + 1] = node->keys[i];
            node->values[i + 1] = node->values[i];
            i--;
        }
        node->keys[i + 1] = key;
        node->values[i + 1] = value;
        node->num_keys++;
    } else {
        // Find child to descend into
        while (i >= 0 && key < node->keys[i]) {
            i--;
        }
        i++;
        
        // Split child if full before descending
        if (node->children[i]->num_keys == MAX_KEYS) {
            split_child(node, i);
            if (key > node->keys[i]) {
                i++;
            }
        }
        insert_non_full(node->children[i], key, value);
    }
}

void btree_insert(BTree* tree, int key, void* value) {
    if (tree->root->num_keys == MAX_KEYS) {
        // Root is full, create new root
        BTreeNode* new_root = create_node(false);
        new_root->children[0] = tree->root;
        tree->root = new_root;
        split_child(new_root, 0);
    }
    insert_non_full(tree->root, key, value);
}

The proactive splitting approach (splitting full children before descending) ensures we never need to backtrack up the tree, making the algorithm simpler and more cache-friendly.

Deletion and Rebalancing

Deletion is more complex than insertion because we must maintain the minimum key count invariant. Three cases arise:

Case 1: Key in leaf with sufficient keys — Simply remove the key.

Case 2: Key in internal node — Replace with predecessor (largest key in left subtree) or successor (smallest in right subtree), then delete that key from the leaf.

Case 3: Underflow after deletion — If a node falls below minimum keys:

  • Borrow from a sibling if it has extra keys
  • Merge with a sibling if borrowing isn’t possible, pulling down a key from the parent
// Pseudocode for deletion complexity illustration
void btree_delete(BTreeNode* node, int key) {
    int idx = find_key_index(node, key);
    
    if (idx < node->num_keys && node->keys[idx] == key) {
        if (node->is_leaf) {
            // Case 1: Remove from leaf
            remove_from_leaf(node, idx);
        } else {
            // Case 2: Replace with predecessor/successor
            if (node->children[idx]->num_keys >= MIN_KEYS + 1) {
                int pred = get_predecessor(node, idx);
                node->keys[idx] = pred;
                btree_delete(node->children[idx], pred);
            } else if (node->children[idx + 1]->num_keys >= MIN_KEYS + 1) {
                int succ = get_successor(node, idx);
                node->keys[idx] = succ;
                btree_delete(node->children[idx + 1], succ);
            } else {
                merge_children(node, idx);
                btree_delete(node->children[idx], key);
            }
        }
    } else {
        // Key not in this node, descend
        ensure_child_has_enough_keys(node, idx);
        btree_delete(node->children[idx], key);
    }
}

The complexity of deletion explains why some databases use “lazy deletion”—marking records as deleted without physically removing them, then cleaning up during compaction.

B-Tree vs. B+Tree

B+Trees modify the B-Tree design in two critical ways:

  1. All data lives in leaves — Internal nodes contain only keys for navigation
  2. Leaves are linked — A pointer chain connects all leaf nodes
typedef struct BPlusTreeNode {
    int num_keys;
    int keys[MAX_KEYS];
    bool is_leaf;
    
    union {
        struct BPlusTreeNode* children[MAX_KEYS + 1];  // Internal nodes
        struct {
            void* values[MAX_KEYS];                     // Leaf nodes
            struct BPlusTreeNode* next_leaf;            // Leaf chain
        };
    };
} BPlusTreeNode;

This design offers significant advantages for databases:

Higher fanout: Internal nodes don’t store values, so they fit more keys per block. More keys per node means shorter trees.

Efficient range queries: To find all keys between 100 and 200, locate 100, then follow the leaf chain. No tree traversal needed for the range itself.

Predictable performance: Every search traverses the same number of levels (root to leaf), making query planning more reliable.

Simpler caching: Internal nodes are accessed frequently and cached effectively. Leaf nodes can be evicted without affecting navigation.

This is why MySQL’s InnoDB, PostgreSQL, SQLite, and virtually every production database uses B+Trees for indexes.

Real-World Applications

Database Indexes: InnoDB uses 16KB pages with B+Trees. A typical index on a bigint column fits roughly 1200 keys per internal node. Three levels handle 1.7 billion rows.

File Systems: NTFS, ext4, and HFS+ all use B-Tree variants for directory indexing and extent tracking. The file system’s block size directly influences B-Tree order.

Practical Considerations:

  • Fill factor: Databases often leave 10-30% free space in nodes to reduce splits during inserts
  • Bulk loading: Building a B-Tree by sorting data first and constructing bottom-up is dramatically faster than individual inserts
  • Compression: Modern databases compress B-Tree pages, trading CPU for I/O reduction

The B-Tree remains the backbone of data storage after 50 years because it perfectly matches the characteristics of block-based storage devices. Understanding its mechanics helps you make better decisions about indexing strategies, query optimization, and database configuration.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.