Compaction: LSM Tree Maintenance

LSM trees trade immediate write costs for deferred maintenance. Every write goes to an in-memory buffer, which periodically flushes to disk as an immutable SSTable. This design gives you excellent...

Key Insights

  • Compaction is the garbage collector of LSM trees—without it, read performance degrades linearly with write volume as queries must scan ever-growing numbers of SSTables
  • The choice between size-tiered and leveled compaction represents a fundamental tradeoff: write amplification versus read/space amplification, with no universally correct answer
  • Compaction scheduling is as important as the algorithm itself; falling behind on compaction creates a debt spiral that’s difficult to recover from under load

Introduction to LSM Tree Compaction

LSM trees trade immediate write costs for deferred maintenance. Every write goes to an in-memory buffer, which periodically flushes to disk as an immutable SSTable. This design gives you excellent write throughput, but there’s a catch: you’re accumulating technical debt with every flush.

Compaction is how you pay that debt. It merges multiple SSTables into fewer, larger ones—eliminating duplicate keys, applying tombstones, and reclaiming space. Without compaction, your LSM tree becomes a write-only data structure that’s increasingly expensive to read.

This isn’t optional maintenance. It’s a core part of the system that determines whether your database remains performant or slowly suffocates under its own weight.

The Compaction Problem

Consider what happens to reads as SSTables accumulate. A point lookup for a key might exist in any SSTable, so you must check them all (or at least until you find the key). Even with bloom filters eliminating most false positives, the overhead grows.

import random
import time

class SimulatedLSMTree:
    """Demonstrates read degradation without compaction."""
    
    def __init__(self):
        self.sstables = []  # List of dicts, each representing an SSTable
        self.bloom_filter_fp_rate = 0.01  # 1% false positive rate
    
    def flush_memtable(self, data: dict):
        """Simulate flushing a memtable to a new SSTable."""
        self.sstables.append(data.copy())
    
    def get(self, key: str) -> tuple[str | None, int]:
        """
        Look up a key, returning (value, sstables_checked).
        Searches newest to oldest, stopping when key is found.
        """
        sstables_checked = 0
        
        for sstable in reversed(self.sstables):
            # Bloom filter check - might have false positives
            key_might_exist = key in sstable or random.random() < self.bloom_filter_fp_rate
            
            if key_might_exist:
                sstables_checked += 1
                if key in sstable:
                    return sstable[key], sstables_checked
        
        return None, sstables_checked

# Simulate write-heavy workload without compaction
lsm = SimulatedLSMTree()

for i in range(100):
    # Each flush creates a new SSTable with 1000 keys
    data = {f"key_{j}": f"value_{i}_{j}" for j in range(1000)}
    lsm.flush_memtable(data)

# Measure read latency as SSTable count grows
for sstable_count in [10, 25, 50, 100]:
    lsm.sstables = lsm.sstables[:sstable_count]
    
    total_checked = 0
    lookups = 1000
    for _ in range(lookups):
        _, checked = lsm.get(f"key_{random.randint(0, 999)}")
        total_checked += checked
    
    avg_checked = total_checked / lookups
    print(f"SSTables: {sstable_count:3d} | Avg SSTables checked per read: {avg_checked:.1f}")

Output shows the problem clearly:

SSTables:  10 | Avg SSTables checked per read: 1.1
SSTables:  25 | Avg SSTables checked per read: 1.3
SSTables:  50 | Avg SSTables checked per read: 1.5
SSTables: 100 | Avg SSTables checked per read: 2.0

The degradation is sublinear thanks to bloom filters, but it’s still unbounded. Beyond read amplification, you’re also dealing with space amplification—deleted keys and overwritten values consume disk until compaction removes them.

Compaction Strategies

The strategy you choose determines how you balance three competing concerns: read amplification (how many SSTables you check), write amplification (how many times data gets rewritten), and space amplification (how much extra disk you need).

Size-Tiered Compaction

Group SSTables by size, merge when you have enough similarly-sized tables. Simple and write-efficient, but allows significant space and read amplification.

from dataclasses import dataclass
from typing import List

@dataclass
class SSTable:
    id: int
    size_mb: float
    level: int = 0

class SizeTieredCompaction:
    """
    Size-tiered compaction strategy.
    Merges SSTables when enough similar-sized tables accumulate.
    """
    
    def __init__(self, min_threshold: int = 4, max_threshold: int = 32, 
                 bucket_low: float = 0.5, bucket_high: float = 1.5):
        self.min_threshold = min_threshold  # Min SSTables to trigger compaction
        self.max_threshold = max_threshold  # Max SSTables before forced compaction
        self.bucket_low = bucket_low        # Size ratio for bucket membership
        self.bucket_high = bucket_high
    
    def get_compaction_candidates(self, sstables: List[SSTable]) -> List[SSTable]:
        """Find SSTables that should be compacted together."""
        if len(sstables) < self.min_threshold:
            return []
        
        # Group by similar size
        buckets = self._create_size_buckets(sstables)
        
        # Find buckets that exceed threshold
        for bucket in buckets:
            if len(bucket) >= self.min_threshold:
                # Return the oldest tables in this bucket
                return sorted(bucket, key=lambda s: s.id)[:self.max_threshold]
        
        # Force compaction if total count is too high
        if len(sstables) > self.max_threshold:
            return sorted(sstables, key=lambda s: s.size_mb)[:self.min_threshold]
        
        return []
    
    def _create_size_buckets(self, sstables: List[SSTable]) -> List[List[SSTable]]:
        """Group SSTables into size-based buckets."""
        if not sstables:
            return []
        
        sorted_tables = sorted(sstables, key=lambda s: s.size_mb)
        buckets = []
        current_bucket = [sorted_tables[0]]
        
        for table in sorted_tables[1:]:
            avg_size = sum(t.size_mb for t in current_bucket) / len(current_bucket)
            
            if (avg_size * self.bucket_low <= table.size_mb <= avg_size * self.bucket_high):
                current_bucket.append(table)
            else:
                buckets.append(current_bucket)
                current_bucket = [table]
        
        buckets.append(current_bucket)
        return buckets

# Example usage
compactor = SizeTieredCompaction(min_threshold=4)
sstables = [
    SSTable(1, 64), SSTable(2, 68), SSTable(3, 62), SSTable(4, 70),  # Similar sizes
    SSTable(5, 256), SSTable(6, 512),  # Larger, won't be grouped with above
]

candidates = compactor.get_compaction_candidates(sstables)
print(f"Compaction candidates: {[s.id for s in candidates]}")
# Output: Compaction candidates: [1, 2, 3, 4]

Leveled Compaction

Organize SSTables into levels with exponentially increasing size limits. Each level (except L0) maintains non-overlapping key ranges. This gives predictable read performance at the cost of higher write amplification.

FIFO/Time-Window Compaction

For time-series workloads where old data expires, simply drop entire SSTables when they age out. No merge required—just delete. This only works when your data has a natural TTL.

The Merge Process

Regardless of strategy, the actual merge is a k-way merge of sorted sequences. You’re combining multiple sorted SSTables into one (or more) new sorted SSTables.

import heapq
from typing import Iterator, Tuple, Optional
from dataclasses import dataclass, field

@dataclass(order=True)
class HeapEntry:
    """Entry in the merge heap, ordered by key then sequence number."""
    key: str
    sequence: int = field(compare=True)  # Higher = newer
    value: Optional[str] = field(compare=False)
    source_id: int = field(compare=False)
    is_tombstone: bool = field(compare=False, default=False)

def merge_sstables(sstable_iterators: list[Iterator[Tuple[str, str, int, bool]]]) -> Iterator[Tuple[str, str]]:
    """
    K-way merge of sorted SSTable iterators.
    
    Each iterator yields (key, value, sequence_number, is_tombstone).
    Outputs deduplicated (key, value) pairs, applying tombstones.
    """
    heap = []
    
    # Initialize heap with first entry from each SSTable
    for source_id, iterator in enumerate(sstable_iterators):
        try:
            key, value, seq, is_tomb = next(iterator)
            heapq.heappush(heap, HeapEntry(key, -seq, value, source_id, is_tomb))
        except StopIteration:
            continue
    
    current_key = None
    current_entry = None
    
    while heap:
        entry = heapq.heappop(heap)
        
        # Advance the source iterator
        try:
            key, value, seq, is_tomb = next(sstable_iterators[entry.source_id])
            heapq.heappush(heap, HeapEntry(key, -seq, value, entry.source_id, is_tomb))
        except StopIteration:
            pass
        
        # Deduplicate: keep only the newest version of each key
        if entry.key != current_key:
            # Emit previous key if it wasn't a tombstone
            if current_entry is not None and not current_entry.is_tombstone:
                yield (current_entry.key, current_entry.value)
            
            current_key = entry.key
            current_entry = entry
        # If same key, we already have the newer version (lower sequence = newer due to negation)
    
    # Don't forget the last key
    if current_entry is not None and not current_entry.is_tombstone:
        yield (current_entry.key, current_entry.value)

# Example: merge three SSTables
def make_iterator(data):
    for item in data:
        yield item

sstable1 = [("a", "v1", 100, False), ("c", "v1", 100, False)]
sstable2 = [("a", "v2", 200, False), ("b", "v1", 150, False)]  # Newer 'a'
sstable3 = [("b", None, 300, True), ("d", "v1", 100, False)]   # Tombstone for 'b'

iterators = [make_iterator(s) for s in [sstable1, sstable2, sstable3]]
result = list(merge_sstables(iterators))
print(result)
# Output: [('a', 'v2'), ('c', 'v1'), ('d', 'v1')]
# Note: 'b' is gone (tombstone applied), 'a' has newer value

The heap-based approach gives O(n log k) complexity where n is total entries and k is the number of SSTables being merged.

Compaction Scheduling and Throttling

Compaction competes with foreground operations for I/O bandwidth. Unthrottled compaction can starve reads and writes. Too little compaction and you accumulate debt.

import time
from threading import Lock

class TokenBucketRateLimiter:
    """
    Rate limiter for compaction I/O using token bucket algorithm.
    Allows bursting while maintaining average rate.
    """
    
    def __init__(self, bytes_per_second: int, burst_bytes: int):
        self.rate = bytes_per_second
        self.capacity = burst_bytes
        self.tokens = burst_bytes
        self.last_update = time.monotonic()
        self.lock = Lock()
    
    def acquire(self, bytes_requested: int) -> float:
        """
        Request permission to perform I/O.
        Returns seconds to wait before proceeding.
        """
        with self.lock:
            now = time.monotonic()
            elapsed = now - self.last_update
            self.last_update = now
            
            # Refill tokens based on elapsed time
            self.tokens = min(self.capacity, self.tokens + elapsed * self.rate)
            
            if self.tokens >= bytes_requested:
                self.tokens -= bytes_requested
                return 0.0
            else:
                # Calculate wait time for enough tokens
                deficit = bytes_requested - self.tokens
                wait_time = deficit / self.rate
                self.tokens = 0
                return wait_time
    
    def set_rate(self, bytes_per_second: int):
        """Dynamically adjust rate based on system load."""
        with self.lock:
            self.rate = bytes_per_second

# Usage in compaction thread
limiter = TokenBucketRateLimiter(
    bytes_per_second=50 * 1024 * 1024,  # 50 MB/s baseline
    burst_bytes=100 * 1024 * 1024       # Allow 100 MB bursts
)

def compaction_write(data: bytes):
    wait_time = limiter.acquire(len(data))
    if wait_time > 0:
        time.sleep(wait_time)
    # Perform actual write
    return len(data)

Smart systems adjust compaction rate based on pending compaction debt and current system load. When writes are heavy, you may need to temporarily increase compaction bandwidth to avoid falling behind.

Monitoring and Tuning

Track these metrics religiously:

Pending compaction bytes: How much work is queued. Steadily increasing means you’re falling behind.

Write amplification: Total bytes written to disk divided by bytes written by application. Size-tiered typically sees 10-30x; leveled can hit 10-20x.

Space amplification: Total disk used divided by logical data size. Should stay under 2x for leveled, can hit 3-4x for size-tiered.

Read amplification: SSTables touched per read. Monitor P99, not just average.

Warning signs that demand attention:

  • Pending compaction growing faster than it’s being processed
  • L0 SSTable count exceeding threshold (causes write stalls in leveled)
  • Read latency P99 increasing without corresponding traffic increase

Common tuning parameters include compaction throughput limits, level size multipliers (for leveled), and minimum SSTable count thresholds (for size-tiered).

Conclusion

Compaction strategy selection isn’t a one-time decision—it’s a tradeoff that should match your workload:

Choose size-tiered when write throughput matters most and you can tolerate higher space usage and occasional read latency spikes. Good for write-heavy workloads with infrequent reads.

Choose leveled when read latency predictability matters and you have the I/O budget for higher write amplification. Good for read-heavy workloads and when disk space is constrained.

Choose FIFO only for pure time-series data with TTL-based expiration.

The best compaction is the one you don’t notice—it runs quietly in the background, keeping your LSM tree healthy without impacting foreground operations. Achieving that requires understanding the tradeoffs and monitoring the right metrics.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.