Compaction: LSM Tree Maintenance
LSM trees trade immediate write costs for deferred maintenance. Every write goes to an in-memory buffer, which periodically flushes to disk as an immutable SSTable. This design gives you excellent...
Key Insights
- Compaction is the garbage collector of LSM trees—without it, read performance degrades linearly with write volume as queries must scan ever-growing numbers of SSTables
- The choice between size-tiered and leveled compaction represents a fundamental tradeoff: write amplification versus read/space amplification, with no universally correct answer
- Compaction scheduling is as important as the algorithm itself; falling behind on compaction creates a debt spiral that’s difficult to recover from under load
Introduction to LSM Tree Compaction
LSM trees trade immediate write costs for deferred maintenance. Every write goes to an in-memory buffer, which periodically flushes to disk as an immutable SSTable. This design gives you excellent write throughput, but there’s a catch: you’re accumulating technical debt with every flush.
Compaction is how you pay that debt. It merges multiple SSTables into fewer, larger ones—eliminating duplicate keys, applying tombstones, and reclaiming space. Without compaction, your LSM tree becomes a write-only data structure that’s increasingly expensive to read.
This isn’t optional maintenance. It’s a core part of the system that determines whether your database remains performant or slowly suffocates under its own weight.
The Compaction Problem
Consider what happens to reads as SSTables accumulate. A point lookup for a key might exist in any SSTable, so you must check them all (or at least until you find the key). Even with bloom filters eliminating most false positives, the overhead grows.
import random
import time
class SimulatedLSMTree:
"""Demonstrates read degradation without compaction."""
def __init__(self):
self.sstables = [] # List of dicts, each representing an SSTable
self.bloom_filter_fp_rate = 0.01 # 1% false positive rate
def flush_memtable(self, data: dict):
"""Simulate flushing a memtable to a new SSTable."""
self.sstables.append(data.copy())
def get(self, key: str) -> tuple[str | None, int]:
"""
Look up a key, returning (value, sstables_checked).
Searches newest to oldest, stopping when key is found.
"""
sstables_checked = 0
for sstable in reversed(self.sstables):
# Bloom filter check - might have false positives
key_might_exist = key in sstable or random.random() < self.bloom_filter_fp_rate
if key_might_exist:
sstables_checked += 1
if key in sstable:
return sstable[key], sstables_checked
return None, sstables_checked
# Simulate write-heavy workload without compaction
lsm = SimulatedLSMTree()
for i in range(100):
# Each flush creates a new SSTable with 1000 keys
data = {f"key_{j}": f"value_{i}_{j}" for j in range(1000)}
lsm.flush_memtable(data)
# Measure read latency as SSTable count grows
for sstable_count in [10, 25, 50, 100]:
lsm.sstables = lsm.sstables[:sstable_count]
total_checked = 0
lookups = 1000
for _ in range(lookups):
_, checked = lsm.get(f"key_{random.randint(0, 999)}")
total_checked += checked
avg_checked = total_checked / lookups
print(f"SSTables: {sstable_count:3d} | Avg SSTables checked per read: {avg_checked:.1f}")
Output shows the problem clearly:
SSTables: 10 | Avg SSTables checked per read: 1.1
SSTables: 25 | Avg SSTables checked per read: 1.3
SSTables: 50 | Avg SSTables checked per read: 1.5
SSTables: 100 | Avg SSTables checked per read: 2.0
The degradation is sublinear thanks to bloom filters, but it’s still unbounded. Beyond read amplification, you’re also dealing with space amplification—deleted keys and overwritten values consume disk until compaction removes them.
Compaction Strategies
The strategy you choose determines how you balance three competing concerns: read amplification (how many SSTables you check), write amplification (how many times data gets rewritten), and space amplification (how much extra disk you need).
Size-Tiered Compaction
Group SSTables by size, merge when you have enough similarly-sized tables. Simple and write-efficient, but allows significant space and read amplification.
from dataclasses import dataclass
from typing import List
@dataclass
class SSTable:
id: int
size_mb: float
level: int = 0
class SizeTieredCompaction:
"""
Size-tiered compaction strategy.
Merges SSTables when enough similar-sized tables accumulate.
"""
def __init__(self, min_threshold: int = 4, max_threshold: int = 32,
bucket_low: float = 0.5, bucket_high: float = 1.5):
self.min_threshold = min_threshold # Min SSTables to trigger compaction
self.max_threshold = max_threshold # Max SSTables before forced compaction
self.bucket_low = bucket_low # Size ratio for bucket membership
self.bucket_high = bucket_high
def get_compaction_candidates(self, sstables: List[SSTable]) -> List[SSTable]:
"""Find SSTables that should be compacted together."""
if len(sstables) < self.min_threshold:
return []
# Group by similar size
buckets = self._create_size_buckets(sstables)
# Find buckets that exceed threshold
for bucket in buckets:
if len(bucket) >= self.min_threshold:
# Return the oldest tables in this bucket
return sorted(bucket, key=lambda s: s.id)[:self.max_threshold]
# Force compaction if total count is too high
if len(sstables) > self.max_threshold:
return sorted(sstables, key=lambda s: s.size_mb)[:self.min_threshold]
return []
def _create_size_buckets(self, sstables: List[SSTable]) -> List[List[SSTable]]:
"""Group SSTables into size-based buckets."""
if not sstables:
return []
sorted_tables = sorted(sstables, key=lambda s: s.size_mb)
buckets = []
current_bucket = [sorted_tables[0]]
for table in sorted_tables[1:]:
avg_size = sum(t.size_mb for t in current_bucket) / len(current_bucket)
if (avg_size * self.bucket_low <= table.size_mb <= avg_size * self.bucket_high):
current_bucket.append(table)
else:
buckets.append(current_bucket)
current_bucket = [table]
buckets.append(current_bucket)
return buckets
# Example usage
compactor = SizeTieredCompaction(min_threshold=4)
sstables = [
SSTable(1, 64), SSTable(2, 68), SSTable(3, 62), SSTable(4, 70), # Similar sizes
SSTable(5, 256), SSTable(6, 512), # Larger, won't be grouped with above
]
candidates = compactor.get_compaction_candidates(sstables)
print(f"Compaction candidates: {[s.id for s in candidates]}")
# Output: Compaction candidates: [1, 2, 3, 4]
Leveled Compaction
Organize SSTables into levels with exponentially increasing size limits. Each level (except L0) maintains non-overlapping key ranges. This gives predictable read performance at the cost of higher write amplification.
FIFO/Time-Window Compaction
For time-series workloads where old data expires, simply drop entire SSTables when they age out. No merge required—just delete. This only works when your data has a natural TTL.
The Merge Process
Regardless of strategy, the actual merge is a k-way merge of sorted sequences. You’re combining multiple sorted SSTables into one (or more) new sorted SSTables.
import heapq
from typing import Iterator, Tuple, Optional
from dataclasses import dataclass, field
@dataclass(order=True)
class HeapEntry:
"""Entry in the merge heap, ordered by key then sequence number."""
key: str
sequence: int = field(compare=True) # Higher = newer
value: Optional[str] = field(compare=False)
source_id: int = field(compare=False)
is_tombstone: bool = field(compare=False, default=False)
def merge_sstables(sstable_iterators: list[Iterator[Tuple[str, str, int, bool]]]) -> Iterator[Tuple[str, str]]:
"""
K-way merge of sorted SSTable iterators.
Each iterator yields (key, value, sequence_number, is_tombstone).
Outputs deduplicated (key, value) pairs, applying tombstones.
"""
heap = []
# Initialize heap with first entry from each SSTable
for source_id, iterator in enumerate(sstable_iterators):
try:
key, value, seq, is_tomb = next(iterator)
heapq.heappush(heap, HeapEntry(key, -seq, value, source_id, is_tomb))
except StopIteration:
continue
current_key = None
current_entry = None
while heap:
entry = heapq.heappop(heap)
# Advance the source iterator
try:
key, value, seq, is_tomb = next(sstable_iterators[entry.source_id])
heapq.heappush(heap, HeapEntry(key, -seq, value, entry.source_id, is_tomb))
except StopIteration:
pass
# Deduplicate: keep only the newest version of each key
if entry.key != current_key:
# Emit previous key if it wasn't a tombstone
if current_entry is not None and not current_entry.is_tombstone:
yield (current_entry.key, current_entry.value)
current_key = entry.key
current_entry = entry
# If same key, we already have the newer version (lower sequence = newer due to negation)
# Don't forget the last key
if current_entry is not None and not current_entry.is_tombstone:
yield (current_entry.key, current_entry.value)
# Example: merge three SSTables
def make_iterator(data):
for item in data:
yield item
sstable1 = [("a", "v1", 100, False), ("c", "v1", 100, False)]
sstable2 = [("a", "v2", 200, False), ("b", "v1", 150, False)] # Newer 'a'
sstable3 = [("b", None, 300, True), ("d", "v1", 100, False)] # Tombstone for 'b'
iterators = [make_iterator(s) for s in [sstable1, sstable2, sstable3]]
result = list(merge_sstables(iterators))
print(result)
# Output: [('a', 'v2'), ('c', 'v1'), ('d', 'v1')]
# Note: 'b' is gone (tombstone applied), 'a' has newer value
The heap-based approach gives O(n log k) complexity where n is total entries and k is the number of SSTables being merged.
Compaction Scheduling and Throttling
Compaction competes with foreground operations for I/O bandwidth. Unthrottled compaction can starve reads and writes. Too little compaction and you accumulate debt.
import time
from threading import Lock
class TokenBucketRateLimiter:
"""
Rate limiter for compaction I/O using token bucket algorithm.
Allows bursting while maintaining average rate.
"""
def __init__(self, bytes_per_second: int, burst_bytes: int):
self.rate = bytes_per_second
self.capacity = burst_bytes
self.tokens = burst_bytes
self.last_update = time.monotonic()
self.lock = Lock()
def acquire(self, bytes_requested: int) -> float:
"""
Request permission to perform I/O.
Returns seconds to wait before proceeding.
"""
with self.lock:
now = time.monotonic()
elapsed = now - self.last_update
self.last_update = now
# Refill tokens based on elapsed time
self.tokens = min(self.capacity, self.tokens + elapsed * self.rate)
if self.tokens >= bytes_requested:
self.tokens -= bytes_requested
return 0.0
else:
# Calculate wait time for enough tokens
deficit = bytes_requested - self.tokens
wait_time = deficit / self.rate
self.tokens = 0
return wait_time
def set_rate(self, bytes_per_second: int):
"""Dynamically adjust rate based on system load."""
with self.lock:
self.rate = bytes_per_second
# Usage in compaction thread
limiter = TokenBucketRateLimiter(
bytes_per_second=50 * 1024 * 1024, # 50 MB/s baseline
burst_bytes=100 * 1024 * 1024 # Allow 100 MB bursts
)
def compaction_write(data: bytes):
wait_time = limiter.acquire(len(data))
if wait_time > 0:
time.sleep(wait_time)
# Perform actual write
return len(data)
Smart systems adjust compaction rate based on pending compaction debt and current system load. When writes are heavy, you may need to temporarily increase compaction bandwidth to avoid falling behind.
Monitoring and Tuning
Track these metrics religiously:
Pending compaction bytes: How much work is queued. Steadily increasing means you’re falling behind.
Write amplification: Total bytes written to disk divided by bytes written by application. Size-tiered typically sees 10-30x; leveled can hit 10-20x.
Space amplification: Total disk used divided by logical data size. Should stay under 2x for leveled, can hit 3-4x for size-tiered.
Read amplification: SSTables touched per read. Monitor P99, not just average.
Warning signs that demand attention:
- Pending compaction growing faster than it’s being processed
- L0 SSTable count exceeding threshold (causes write stalls in leveled)
- Read latency P99 increasing without corresponding traffic increase
Common tuning parameters include compaction throughput limits, level size multipliers (for leveled), and minimum SSTable count thresholds (for size-tiered).
Conclusion
Compaction strategy selection isn’t a one-time decision—it’s a tradeoff that should match your workload:
Choose size-tiered when write throughput matters most and you can tolerate higher space usage and occasional read latency spikes. Good for write-heavy workloads with infrequent reads.
Choose leveled when read latency predictability matters and you have the I/O budget for higher write amplification. Good for read-heavy workloads and when disk space is constrained.
Choose FIFO only for pure time-series data with TTL-based expiration.
The best compaction is the one you don’t notice—it runs quietly in the background, keeping your LSM tree healthy without impacting foreground operations. Achieving that requires understanding the tradeoffs and monitoring the right metrics.