B-Tree: Balanced Search Tree for Storage

Every time you query a database, search a file system directory, or look up a key in a production key-value store, you're almost certainly traversing a B-Tree. This data structure, invented by Rudolf...

Key Insights

  • B-Trees minimize disk I/O by maximizing branching factor—a tree with order 1000 can index billions of records in just 3-4 levels, meaning 3-4 disk reads per lookup
  • The self-balancing property guarantees O(log n) operations without the pathological cases that plague binary search trees, making B-Trees predictable under all workloads
  • B+ Trees, which store all data in linked leaf nodes, dominate database implementations because they excel at range queries and sequential scans

Why B-Trees Matter for Storage

Every time you query a database, search a file system directory, or look up a key in a production key-value store, you’re almost certainly traversing a B-Tree. This data structure, invented by Rudolf Bayer and Edward McCreight in 1970, remains the dominant indexing structure for disk-based storage systems over fifty years later.

Binary search trees seem like the obvious choice for ordered data—O(log n) search, insert, and delete. But they fail catastrophically for disk-based storage. A balanced BST with one billion records requires about 30 levels (log₂ 10⁹ ≈ 30). Each level means a potential disk seek, and disk seeks are expensive—roughly 10ms for spinning disks, 0.1ms for SSDs. That’s 300ms or 3ms per lookup, respectively. Unacceptable.

B-Trees solve this by dramatically increasing the branching factor. Instead of two children per node, a B-Tree node might have hundreds or thousands. A B-Tree with branching factor 1000 stores one billion records in just 3 levels. Three disk reads instead of thirty. That’s the entire value proposition.

B-Tree Structure and Properties

A B-Tree of order m (also called minimum degree t in some literature) has these properties:

  1. Every node has at most 2t - 1 keys and 2t children
  2. Every non-root node has at least t - 1 keys
  3. The root has at least 1 key (unless the tree is empty)
  4. All leaves appear at the same level
  5. A non-leaf node with k keys has k + 1 children

The node size is chosen to match the disk page size—typically 4KB or 16KB. This ensures each node read is a single disk I/O operation.

#define MAX_KEYS 255  // For order t=128, max keys = 2t-1 = 255
#define MIN_KEYS 127  // Minimum keys in non-root = t-1

typedef struct BTreeNode {
    int num_keys;                      // Current number of keys
    int keys[MAX_KEYS];                // Keys stored in sorted order
    void* values[MAX_KEYS];            // Associated values (or record pointers)
    struct BTreeNode* children[MAX_KEYS + 1];  // Child pointers
    bool is_leaf;                      // True if this is a leaf node
} BTreeNode;

typedef struct BTree {
    BTreeNode* root;
    int order;  // Minimum degree t
} BTree;

In practice, the children array holds disk page numbers rather than memory pointers. The node structure is serialized directly to disk, and child “pointers” are offsets or page IDs that get resolved through the buffer pool.

Search Operation

Search in a B-Tree is straightforward: start at the root, find the appropriate child pointer based on key comparisons, and descend. Repeat until you find the key or reach a leaf.

void* btree_search(BTree* tree, int key) {
    BTreeNode* node = tree->root;
    
    while (node != NULL) {
        // Binary search within the node
        int i = 0;
        while (i < node->num_keys && key > node->keys[i]) {
            i++;
        }
        
        // Found the key
        if (i < node->num_keys && key == node->keys[i]) {
            return node->values[i];
        }
        
        // Key not in this node
        if (node->is_leaf) {
            return NULL;  // Key doesn't exist
        }
        
        // Descend to appropriate child
        node = node->children[i];
    }
    
    return NULL;
}

The complexity is O(log_t n) where t is the minimum degree. With t=500 and n=1 billion, that’s log₅₀₀(10⁹) ≈ 3.3 node accesses. The binary search within each node adds O(log t) comparisons, but those are in-memory operations—essentially free compared to disk I/O.

Insertion and Node Splitting

Insertion always happens at a leaf. The challenge is maintaining balance when nodes overflow. When a node exceeds its maximum capacity, it splits into two nodes, and the median key promotes to the parent.

The elegant approach is “split on the way down”—proactively split any full node encountered during descent. This guarantees the parent always has room for a promoted key.

void btree_insert(BTree* tree, int key, void* value) {
    BTreeNode* root = tree->root;
    
    // If root is full, split it first
    if (root->num_keys == MAX_KEYS) {
        BTreeNode* new_root = create_node(false);
        new_root->children[0] = root;
        split_child(new_root, 0);
        tree->root = new_root;
        insert_non_full(new_root, key, value);
    } else {
        insert_non_full(root, key, value);
    }
}

void split_child(BTreeNode* parent, int child_index) {
    BTreeNode* full_child = parent->children[child_index];
    BTreeNode* new_sibling = create_node(full_child->is_leaf);
    
    int mid = MAX_KEYS / 2;
    
    // Move upper half of keys to new sibling
    new_sibling->num_keys = MAX_KEYS - mid - 1;
    for (int i = 0; i < new_sibling->num_keys; i++) {
        new_sibling->keys[i] = full_child->keys[mid + 1 + i];
        new_sibling->values[i] = full_child->values[mid + 1 + i];
    }
    
    // Move children if not a leaf
    if (!full_child->is_leaf) {
        for (int i = 0; i <= new_sibling->num_keys; i++) {
            new_sibling->children[i] = full_child->children[mid + 1 + i];
        }
    }
    
    full_child->num_keys = mid;
    
    // Insert median key into parent
    for (int i = parent->num_keys; i > child_index; i--) {
        parent->keys[i] = parent->keys[i - 1];
        parent->values[i] = parent->values[i - 1];
        parent->children[i + 1] = parent->children[i];
    }
    
    parent->keys[child_index] = full_child->keys[mid];
    parent->values[child_index] = full_child->values[mid];
    parent->children[child_index + 1] = new_sibling;
    parent->num_keys++;
}

void insert_non_full(BTreeNode* node, int key, void* value) {
    int i = node->num_keys - 1;
    
    if (node->is_leaf) {
        // Shift keys and insert
        while (i >= 0 && key < node->keys[i]) {
            node->keys[i + 1] = node->keys[i];
            node->values[i + 1] = node->values[i];
            i--;
        }
        node->keys[i + 1] = key;
        node->values[i + 1] = value;
        node->num_keys++;
    } else {
        // Find child to descend into
        while (i >= 0 && key < node->keys[i]) {
            i--;
        }
        i++;
        
        // Split child if full (proactive splitting)
        if (node->children[i]->num_keys == MAX_KEYS) {
            split_child(node, i);
            if (key > node->keys[i]) {
                i++;
            }
        }
        insert_non_full(node->children[i], key, value);
    }
}

Deletion and Rebalancing

Deletion is the most complex operation. When removing a key causes a node to underflow (fewer than t-1 keys), you must rebalance by borrowing from a sibling or merging nodes.

void btree_delete(BTree* tree, int key) {
    delete_recursive(tree, tree->root, key);
    
    // If root becomes empty, shrink tree
    if (tree->root->num_keys == 0 && !tree->root->is_leaf) {
        BTreeNode* old_root = tree->root;
        tree->root = tree->root->children[0];
        free(old_root);
    }
}

void ensure_min_keys(BTreeNode* parent, int child_index) {
    BTreeNode* child = parent->children[child_index];
    
    if (child->num_keys >= MIN_KEYS) {
        return;  // Already has enough keys
    }
    
    // Try borrowing from left sibling
    if (child_index > 0) {
        BTreeNode* left_sibling = parent->children[child_index - 1];
        if (left_sibling->num_keys > MIN_KEYS) {
            borrow_from_left(parent, child_index);
            return;
        }
    }
    
    // Try borrowing from right sibling
    if (child_index < parent->num_keys) {
        BTreeNode* right_sibling = parent->children[child_index + 1];
        if (right_sibling->num_keys > MIN_KEYS) {
            borrow_from_right(parent, child_index);
            return;
        }
    }
    
    // Must merge - prefer merging with left sibling
    if (child_index > 0) {
        merge_children(parent, child_index - 1);
    } else {
        merge_children(parent, child_index);
    }
}

B-Tree Variants: B+ Trees and B* Trees

B+ Trees are the variant you’ll encounter in virtually every database. The key difference: internal nodes only store keys for navigation; all actual data lives in leaf nodes. Leaf nodes are linked together, forming a sorted linked list.

This design has major advantages for databases:

  • Range queries traverse the leaf list without touching internal nodes
  • Internal nodes are smaller (no data), so more keys fit per page, reducing tree height
  • Sequential scans are cache-friendly
typedef struct BPlusLeaf {
    int num_keys;
    int keys[MAX_LEAF_KEYS];
    void* values[MAX_LEAF_KEYS];
    struct BPlusLeaf* next;  // Link to next leaf
    struct BPlusLeaf* prev;  // Link to previous leaf
} BPlusLeaf;

// Range query: find all values where low <= key <= high
void bplus_range_query(BPlusTree* tree, int low, int high, 
                       void** results, int* count) {
    // Find leaf containing 'low'
    BPlusLeaf* leaf = find_leaf(tree, low);
    *count = 0;
    
    // Scan leaves until we exceed 'high'
    while (leaf != NULL) {
        for (int i = 0; i < leaf->num_keys; i++) {
            if (leaf->keys[i] > high) {
                return;  // Done
            }
            if (leaf->keys[i] >= low) {
                results[(*count)++] = leaf->values[i];
            }
        }
        leaf = leaf->next;  // Follow leaf chain
    }
}

B* Trees maintain a higher fill factor (2/3 full instead of 1/2) by redistributing keys between siblings before splitting. This reduces space overhead but complicates the implementation.

Real-World Applications

MySQL InnoDB uses B+ Trees for both primary indexes (clustered, storing actual row data in leaves) and secondary indexes (storing primary key references in leaves). The default page size is 16KB, yielding branching factors of several hundred.

PostgreSQL uses B+ Trees for its standard indexes. The implementation includes sophisticated concurrency control (Lehman-Yao algorithm) allowing concurrent reads and writes without global locks.

File systems like NTFS, ext4, and Btrfs use B-Trees to index directory entries and file extents. Btrfs uses a copy-on-write B-Tree variant that enables snapshots and checksumming.

SQLite uses B-Trees for all tables and indexes, with a page size of 4KB by default. The entire database is a single file containing a forest of B-Trees.

When tuning B-Tree-based systems, consider:

  • Page size: Larger pages mean higher branching factor but more wasted space for sparse nodes
  • Fill factor: Lower fill factors leave room for inserts but waste space
  • Key size: Shorter keys mean more keys per node; consider prefix compression

B-Trees aren’t glamorous, but they’re the workhorse of persistent storage. Understanding their mechanics helps you reason about database performance, choose appropriate indexes, and debug mysterious slowdowns. When your query is slow, it’s often because you’re traversing more B-Tree levels than necessary—or worse, scanning leaves sequentially when an index could help.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.