LSM Tree: Log-Structured Merge Tree
B-trees have dominated database indexing for decades, but they carry a fundamental limitation: random I/O on writes. Every insert or update potentially requires reading a page, modifying it, and...
Key Insights
- LSM trees trade read performance for dramatically better write throughput by converting random writes into sequential I/O through an append-only architecture
- The compaction strategy you choose (leveled vs. size-tiered) fundamentally determines your read/write amplification trade-offs and should match your workload characteristics
- Bloom filters are non-negotiable in production LSM implementations—without them, reads degrade to O(n) disk seeks across all SSTables
The Write Optimization Problem
B-trees have dominated database indexing for decades, but they carry a fundamental limitation: random I/O on writes. Every insert or update potentially requires reading a page, modifying it, and writing it back to a random location on disk. For write-heavy workloads—think time-series data, event logging, or messaging systems—this becomes a severe bottleneck.
LSM trees flip this model entirely. Instead of updating data in place, they batch writes in memory and flush them sequentially to disk. This approach powers some of the most write-intensive systems in production today: LevelDB (the storage engine behind Chrome’s IndexedDB), RocksDB (Facebook’s fork that runs much of their infrastructure), Apache Cassandra, and HBase.
The trade-off is explicit: you’re optimizing for write throughput at the cost of read complexity. Understanding when this trade-off makes sense—and how to tune it—separates effective system design from cargo-cult architecture decisions.
Core Architecture: Memtable and SSTables
An LSM tree consists of two primary components: an in-memory buffer called the memtable and a series of immutable on-disk files called Sorted String Tables (SSTables).
The write path is straightforward. When a write arrives, it goes directly to the memtable—typically implemented as a red-black tree, AVL tree, or skip list to maintain sorted order. Writes are fast because they’re purely in-memory operations. When the memtable reaches a configured size threshold, it’s frozen and flushed to disk as a new SSTable.
SSTables are immutable, sorted files. Each contains key-value pairs in sorted order, along with an index for efficient lookups. Immutability is crucial—it eliminates the need for complex locking and enables straightforward crash recovery.
Here’s a minimal memtable implementation:
import threading
from sortedcontainers import SortedDict
class Memtable:
def __init__(self, size_threshold_bytes=4 * 1024 * 1024):
self._data = SortedDict()
self._size = 0
self._threshold = size_threshold_bytes
self._lock = threading.RLock()
def put(self, key: bytes, value: bytes) -> bool:
"""Insert key-value pair. Returns True if memtable should be flushed."""
with self._lock:
old_value = self._data.get(key)
if old_value:
self._size -= len(key) + len(old_value)
self._data[key] = value
self._size += len(key) + len(value)
return self._size >= self._threshold
def get(self, key: bytes) -> bytes | None:
with self._lock:
return self._data.get(key)
def delete(self, key: bytes) -> bool:
"""Deletes are tombstones, not actual removals."""
return self.put(key, b"__TOMBSTONE__")
def flush_to_sstable(self, path: str):
"""Serialize sorted data to disk."""
with self._lock:
with open(path, 'wb') as f:
for key, value in self._data.items():
# Simple format: key_len | key | value_len | value
f.write(len(key).to_bytes(4, 'big'))
f.write(key)
f.write(len(value).to_bytes(4, 'big'))
f.write(value)
Notice that deletes don’t actually remove data—they insert tombstone markers. The actual deletion happens during compaction, which we’ll cover next.
The Compaction Process
As writes continue, SSTables accumulate on disk. Without intervention, reads would need to check every SSTable to find a key, degrading performance linearly with the number of files. Compaction solves this by periodically merging SSTables together.
Two primary compaction strategies exist:
Size-tiered compaction groups SSTables of similar sizes and merges them when a threshold count is reached. It’s write-optimized—fewer compaction operations mean less write amplification—but can temporarily double space usage and creates larger files that take longer to compact.
Leveled compaction organizes SSTables into levels, where each level is 10x larger than the previous. Level 0 contains recently flushed memtables; higher levels contain merged, non-overlapping SSTables. This approach bounds space amplification and provides more predictable read performance, but increases write amplification.
The merge algorithm itself is a standard sorted merge:
def merge_sstables(sstable_paths: list[str], output_path: str):
"""Merge multiple sorted SSTables into one."""
import heapq
readers = []
heap = []
# Open all SSTables and seed the heap
for idx, path in enumerate(sstable_paths):
reader = SSTableReader(path)
entry = reader.next()
if entry:
# Heap entries: (key, sstable_index, value, reader)
# Lower index = newer SSTable = higher priority
heapq.heappush(heap, (entry.key, idx, entry.value, reader))
readers.append(reader)
with open(output_path, 'wb') as out:
last_key = None
while heap:
key, idx, value, reader = heapq.heappop(heap)
# Skip duplicate keys (keep newest, which has lowest idx)
if key != last_key:
# Skip tombstones in final output
if value != b"__TOMBSTONE__":
write_entry(out, key, value)
last_key = key
# Advance this reader
entry = reader.next()
if entry:
heapq.heappush(heap, (entry.key, idx, entry.value, reader))
# Cleanup
for reader in readers:
reader.close()
The key insight here: newer SSTables take precedence. When the same key appears in multiple files, we keep only the most recent version. Tombstones propagate through compaction until they reach the oldest level, at which point they can be safely discarded.
Read Path and Bloom Filters
Reads in an LSM tree follow a specific order: check the memtable first (it contains the newest data), then search SSTables from newest to oldest. The moment you find the key, you’re done—no need to check older files.
The problem is obvious: if a key doesn’t exist, you’ve just checked every SSTable on disk. For workloads with frequent misses, this is catastrophic.
Bloom filters solve this elegantly. A Bloom filter is a probabilistic data structure that can definitively say “this key is NOT in this SSTable” or “this key MIGHT be in this SSTable.” False positives are possible; false negatives are not.
import mmh3 # MurmurHash3
from bitarray import bitarray
class BloomFilter:
def __init__(self, expected_items: int, false_positive_rate: float = 0.01):
# Calculate optimal size and hash count
self._size = self._optimal_size(expected_items, false_positive_rate)
self._hash_count = self._optimal_hash_count(self._size, expected_items)
self._bits = bitarray(self._size)
self._bits.setall(0)
def _optimal_size(self, n: int, p: float) -> int:
import math
return int(-n * math.log(p) / (math.log(2) ** 2))
def _optimal_hash_count(self, m: int, n: int) -> int:
import math
return int((m / n) * math.log(2))
def add(self, key: bytes):
for i in range(self._hash_count):
idx = mmh3.hash(key, i) % self._size
self._bits[idx] = 1
def might_contain(self, key: bytes) -> bool:
for i in range(self._hash_count):
idx = mmh3.hash(key, i) % self._size
if not self._bits[idx]:
return False # Definitely not present
return True # Possibly present
Each SSTable gets its own Bloom filter, built during flush and stored alongside the data. Before reading any SSTable, check its Bloom filter. With a 1% false positive rate, you eliminate 99% of unnecessary disk reads for missing keys.
Write-Ahead Logging (WAL)
The memtable is volatile—a crash loses all unflushed data. Write-ahead logging provides durability by appending every write to a log file before acknowledging it to the client.
The recovery process is simple: on startup, replay the WAL to reconstruct the memtable state. Once the memtable is flushed to an SSTable, its corresponding WAL segment can be deleted.
import struct
import os
class WriteAheadLog:
def __init__(self, path: str):
self._path = path
self._file = open(path, 'ab', buffering=0) # Unbuffered for durability
def append(self, key: bytes, value: bytes) -> int:
"""Append entry and return log sequence number."""
entry = struct.pack('>I', len(key)) + key + struct.pack('>I', len(value)) + value
checksum = self._crc32(entry)
record = struct.pack('>I', checksum) + entry
self._file.write(record)
self._file.flush()
os.fsync(self._file.fileno()) # Force to disk
return self._file.tell()
def _crc32(self, data: bytes) -> int:
import zlib
return zlib.crc32(data) & 0xffffffff
@classmethod
def replay(cls, path: str, memtable: Memtable):
"""Replay WAL entries into memtable."""
with open(path, 'rb') as f:
while True:
header = f.read(4)
if not header:
break
checksum = struct.unpack('>I', header)[0]
key_len = struct.unpack('>I', f.read(4))[0]
key = f.read(key_len)
value_len = struct.unpack('>I', f.read(4))[0]
value = f.read(value_len)
memtable.put(key, value)
The fsync call is critical—without it, data might sit in the OS buffer cache during a crash. This is also the primary source of write latency in LSM trees, which is why many implementations offer configurable durability (sync every write vs. sync periodically).
Trade-offs and Tuning Considerations
LSM trees involve three types of amplification:
Write amplification: Data is written multiple times—once to WAL, once to memtable flush, and multiple times during compaction. Leveled compaction can reach 10-30x write amplification.
Read amplification: A point lookup might check multiple SSTables. Bloom filters mitigate this, but range scans still suffer.
Space amplification: Obsolete data persists until compaction. Size-tiered compaction can temporarily require 2x the logical data size.
Tuning depends on your workload. For write-heavy, read-light workloads (logging, time-series), use size-tiered compaction with larger memtables. For read-heavy workloads with updates, use leveled compaction with aggressive Bloom filter sizing.
Compared to B-trees, LSM trees win on write throughput (10-100x for random writes) but lose on point reads (2-5x slower) and range scans (variable, depends on compaction state).
Real-World Implementations
LevelDB: Google’s foundational implementation. Single-threaded compaction, leveled strategy only. Clean codebase, excellent for learning.
RocksDB: Facebook’s production-hardened fork. Adds multi-threaded compaction, column families, transactions, and dozens of tuning knobs. The de facto standard for embedded LSM storage.
Cassandra: Uses size-tiered compaction by default, optimized for append-heavy distributed workloads. Offers leveled compaction for read-heavy tables.
HBase: Built on HDFS, uses LSM architecture for its MemStore and HFiles. Optimized for wide-column access patterns.
Each implementation makes different trade-offs, but the core architecture remains consistent. Understanding these fundamentals lets you make informed decisions about storage engines rather than treating them as black boxes.