LSM Tree: Log-Structured Merge Tree

B-trees have dominated database indexing for decades, but they carry a fundamental limitation: random I/O on writes. Every insert or update potentially requires reading a page, modifying it, and...

Key Insights

  • LSM trees trade read performance for dramatically better write throughput by converting random writes into sequential I/O through an append-only architecture
  • The compaction strategy you choose (leveled vs. size-tiered) fundamentally determines your read/write amplification trade-offs and should match your workload characteristics
  • Bloom filters are non-negotiable in production LSM implementations—without them, reads degrade to O(n) disk seeks across all SSTables

The Write Optimization Problem

B-trees have dominated database indexing for decades, but they carry a fundamental limitation: random I/O on writes. Every insert or update potentially requires reading a page, modifying it, and writing it back to a random location on disk. For write-heavy workloads—think time-series data, event logging, or messaging systems—this becomes a severe bottleneck.

LSM trees flip this model entirely. Instead of updating data in place, they batch writes in memory and flush them sequentially to disk. This approach powers some of the most write-intensive systems in production today: LevelDB (the storage engine behind Chrome’s IndexedDB), RocksDB (Facebook’s fork that runs much of their infrastructure), Apache Cassandra, and HBase.

The trade-off is explicit: you’re optimizing for write throughput at the cost of read complexity. Understanding when this trade-off makes sense—and how to tune it—separates effective system design from cargo-cult architecture decisions.

Core Architecture: Memtable and SSTables

An LSM tree consists of two primary components: an in-memory buffer called the memtable and a series of immutable on-disk files called Sorted String Tables (SSTables).

The write path is straightforward. When a write arrives, it goes directly to the memtable—typically implemented as a red-black tree, AVL tree, or skip list to maintain sorted order. Writes are fast because they’re purely in-memory operations. When the memtable reaches a configured size threshold, it’s frozen and flushed to disk as a new SSTable.

SSTables are immutable, sorted files. Each contains key-value pairs in sorted order, along with an index for efficient lookups. Immutability is crucial—it eliminates the need for complex locking and enables straightforward crash recovery.

Here’s a minimal memtable implementation:

import threading
from sortedcontainers import SortedDict

class Memtable:
    def __init__(self, size_threshold_bytes=4 * 1024 * 1024):
        self._data = SortedDict()
        self._size = 0
        self._threshold = size_threshold_bytes
        self._lock = threading.RLock()
    
    def put(self, key: bytes, value: bytes) -> bool:
        """Insert key-value pair. Returns True if memtable should be flushed."""
        with self._lock:
            old_value = self._data.get(key)
            if old_value:
                self._size -= len(key) + len(old_value)
            
            self._data[key] = value
            self._size += len(key) + len(value)
            
            return self._size >= self._threshold
    
    def get(self, key: bytes) -> bytes | None:
        with self._lock:
            return self._data.get(key)
    
    def delete(self, key: bytes) -> bool:
        """Deletes are tombstones, not actual removals."""
        return self.put(key, b"__TOMBSTONE__")
    
    def flush_to_sstable(self, path: str):
        """Serialize sorted data to disk."""
        with self._lock:
            with open(path, 'wb') as f:
                for key, value in self._data.items():
                    # Simple format: key_len | key | value_len | value
                    f.write(len(key).to_bytes(4, 'big'))
                    f.write(key)
                    f.write(len(value).to_bytes(4, 'big'))
                    f.write(value)

Notice that deletes don’t actually remove data—they insert tombstone markers. The actual deletion happens during compaction, which we’ll cover next.

The Compaction Process

As writes continue, SSTables accumulate on disk. Without intervention, reads would need to check every SSTable to find a key, degrading performance linearly with the number of files. Compaction solves this by periodically merging SSTables together.

Two primary compaction strategies exist:

Size-tiered compaction groups SSTables of similar sizes and merges them when a threshold count is reached. It’s write-optimized—fewer compaction operations mean less write amplification—but can temporarily double space usage and creates larger files that take longer to compact.

Leveled compaction organizes SSTables into levels, where each level is 10x larger than the previous. Level 0 contains recently flushed memtables; higher levels contain merged, non-overlapping SSTables. This approach bounds space amplification and provides more predictable read performance, but increases write amplification.

The merge algorithm itself is a standard sorted merge:

def merge_sstables(sstable_paths: list[str], output_path: str):
    """Merge multiple sorted SSTables into one."""
    import heapq
    
    readers = []
    heap = []
    
    # Open all SSTables and seed the heap
    for idx, path in enumerate(sstable_paths):
        reader = SSTableReader(path)
        entry = reader.next()
        if entry:
            # Heap entries: (key, sstable_index, value, reader)
            # Lower index = newer SSTable = higher priority
            heapq.heappush(heap, (entry.key, idx, entry.value, reader))
            readers.append(reader)
    
    with open(output_path, 'wb') as out:
        last_key = None
        
        while heap:
            key, idx, value, reader = heapq.heappop(heap)
            
            # Skip duplicate keys (keep newest, which has lowest idx)
            if key != last_key:
                # Skip tombstones in final output
                if value != b"__TOMBSTONE__":
                    write_entry(out, key, value)
                last_key = key
            
            # Advance this reader
            entry = reader.next()
            if entry:
                heapq.heappush(heap, (entry.key, idx, entry.value, reader))
    
    # Cleanup
    for reader in readers:
        reader.close()

The key insight here: newer SSTables take precedence. When the same key appears in multiple files, we keep only the most recent version. Tombstones propagate through compaction until they reach the oldest level, at which point they can be safely discarded.

Read Path and Bloom Filters

Reads in an LSM tree follow a specific order: check the memtable first (it contains the newest data), then search SSTables from newest to oldest. The moment you find the key, you’re done—no need to check older files.

The problem is obvious: if a key doesn’t exist, you’ve just checked every SSTable on disk. For workloads with frequent misses, this is catastrophic.

Bloom filters solve this elegantly. A Bloom filter is a probabilistic data structure that can definitively say “this key is NOT in this SSTable” or “this key MIGHT be in this SSTable.” False positives are possible; false negatives are not.

import mmh3  # MurmurHash3
from bitarray import bitarray

class BloomFilter:
    def __init__(self, expected_items: int, false_positive_rate: float = 0.01):
        # Calculate optimal size and hash count
        self._size = self._optimal_size(expected_items, false_positive_rate)
        self._hash_count = self._optimal_hash_count(self._size, expected_items)
        self._bits = bitarray(self._size)
        self._bits.setall(0)
    
    def _optimal_size(self, n: int, p: float) -> int:
        import math
        return int(-n * math.log(p) / (math.log(2) ** 2))
    
    def _optimal_hash_count(self, m: int, n: int) -> int:
        import math
        return int((m / n) * math.log(2))
    
    def add(self, key: bytes):
        for i in range(self._hash_count):
            idx = mmh3.hash(key, i) % self._size
            self._bits[idx] = 1
    
    def might_contain(self, key: bytes) -> bool:
        for i in range(self._hash_count):
            idx = mmh3.hash(key, i) % self._size
            if not self._bits[idx]:
                return False  # Definitely not present
        return True  # Possibly present

Each SSTable gets its own Bloom filter, built during flush and stored alongside the data. Before reading any SSTable, check its Bloom filter. With a 1% false positive rate, you eliminate 99% of unnecessary disk reads for missing keys.

Write-Ahead Logging (WAL)

The memtable is volatile—a crash loses all unflushed data. Write-ahead logging provides durability by appending every write to a log file before acknowledging it to the client.

The recovery process is simple: on startup, replay the WAL to reconstruct the memtable state. Once the memtable is flushed to an SSTable, its corresponding WAL segment can be deleted.

import struct
import os

class WriteAheadLog:
    def __init__(self, path: str):
        self._path = path
        self._file = open(path, 'ab', buffering=0)  # Unbuffered for durability
    
    def append(self, key: bytes, value: bytes) -> int:
        """Append entry and return log sequence number."""
        entry = struct.pack('>I', len(key)) + key + struct.pack('>I', len(value)) + value
        checksum = self._crc32(entry)
        
        record = struct.pack('>I', checksum) + entry
        self._file.write(record)
        self._file.flush()
        os.fsync(self._file.fileno())  # Force to disk
        
        return self._file.tell()
    
    def _crc32(self, data: bytes) -> int:
        import zlib
        return zlib.crc32(data) & 0xffffffff
    
    @classmethod
    def replay(cls, path: str, memtable: Memtable):
        """Replay WAL entries into memtable."""
        with open(path, 'rb') as f:
            while True:
                header = f.read(4)
                if not header:
                    break
                
                checksum = struct.unpack('>I', header)[0]
                key_len = struct.unpack('>I', f.read(4))[0]
                key = f.read(key_len)
                value_len = struct.unpack('>I', f.read(4))[0]
                value = f.read(value_len)
                
                memtable.put(key, value)

The fsync call is critical—without it, data might sit in the OS buffer cache during a crash. This is also the primary source of write latency in LSM trees, which is why many implementations offer configurable durability (sync every write vs. sync periodically).

Trade-offs and Tuning Considerations

LSM trees involve three types of amplification:

Write amplification: Data is written multiple times—once to WAL, once to memtable flush, and multiple times during compaction. Leveled compaction can reach 10-30x write amplification.

Read amplification: A point lookup might check multiple SSTables. Bloom filters mitigate this, but range scans still suffer.

Space amplification: Obsolete data persists until compaction. Size-tiered compaction can temporarily require 2x the logical data size.

Tuning depends on your workload. For write-heavy, read-light workloads (logging, time-series), use size-tiered compaction with larger memtables. For read-heavy workloads with updates, use leveled compaction with aggressive Bloom filter sizing.

Compared to B-trees, LSM trees win on write throughput (10-100x for random writes) but lose on point reads (2-5x slower) and range scans (variable, depends on compaction state).

Real-World Implementations

LevelDB: Google’s foundational implementation. Single-threaded compaction, leveled strategy only. Clean codebase, excellent for learning.

RocksDB: Facebook’s production-hardened fork. Adds multi-threaded compaction, column families, transactions, and dozens of tuning knobs. The de facto standard for embedded LSM storage.

Cassandra: Uses size-tiered compaction by default, optimized for append-heavy distributed workloads. Offers leveled compaction for read-heavy tables.

HBase: Built on HDFS, uses LSM architecture for its MemStore and HFiles. Optimized for wide-column access patterns.

Each implementation makes different trade-offs, but the core architecture remains consistent. Understanding these fundamentals lets you make informed decisions about storage engines rather than treating them as black boxes.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.