Write-Ahead Log: Crash Recovery Technique
Databases lie to you. When your application receives a 'commit successful' response, the data might only exist in volatile memory. A power failure milliseconds later could erase that transaction...
Key Insights
- Write-ahead logging guarantees durability by recording changes to a sequential log before modifying actual data pages, enabling complete recovery after any crash
- The WAL protocol enforces strict ordering: log records must hit disk before data pages, and commit records must be flushed before acknowledging success to clients
- Recovery follows three phases (Analysis, Redo, Undo) that reconstruct exactly which transactions committed and which must be rolled back, regardless of when the crash occurred
The Durability Problem
Databases lie to you. When your application receives a “commit successful” response, the data might only exist in volatile memory. A power failure milliseconds later could erase that transaction completely.
This isn’t a bug—it’s physics. Modern systems use memory hierarchies where writes first land in CPU caches, then DRAM, eventually reaching disk. Each layer is faster but less durable than the next. Without explicit intervention, “committed” data can vanish.
The naive solution—flushing every write directly to disk—destroys performance. Random disk I/O is catastrophically slow compared to memory operations. A database doing synchronous random writes might handle hundreds of transactions per second instead of hundreds of thousands.
Write-ahead logging solves this by converting random writes into sequential ones. Instead of immediately updating scattered data pages, the database appends changes to a sequential log file. Sequential writes are dramatically faster than random writes, even on SSDs. More importantly, the log provides a complete record for reconstructing state after crashes.
How Write-Ahead Logging Works
The core principle is simple: before modifying any data page, write a log record describing the change. If the system crashes, replay the log to recover.
Each log record needs enough information to both redo the operation (if the transaction committed but data pages weren’t flushed) and undo it (if the transaction was incomplete). This requires capturing the before and after states.
typedef struct WALRecord {
uint64_t lsn; // Log Sequence Number - unique, monotonic
uint32_t transaction_id;
uint32_t page_id;
uint16_t offset;
uint16_t length;
uint8_t operation; // INSERT, UPDATE, DELETE, COMMIT, ABORT
// For physiological logging
uint8_t before_image[MAX_RECORD_SIZE];
uint8_t after_image[MAX_RECORD_SIZE];
uint64_t prev_lsn; // Previous LSN for this transaction (undo chain)
uint32_t checksum;
} WALRecord;
The Log Sequence Number (LSN) is critical. It’s a monotonically increasing identifier that establishes total ordering of all changes. Every data page also stores the LSN of the most recent log record that modified it. During recovery, comparing page LSNs to log LSNs tells us exactly which changes need replaying.
Sequential writes matter because disk heads (on HDDs) or write buffers (on SSDs) handle consecutive writes far more efficiently than scattered ones. A single 10MB sequential write is orders of magnitude faster than 10,000 scattered 1KB writes.
The WAL Protocol Rules
Three invariants make WAL work. Violate any of them and you risk data loss or corruption.
Rule 1: Log before data. A log record describing a change must reach stable storage before the modified data page can be written. This ensures we can always redo lost changes.
Rule 2: Flush before commit acknowledgment. The commit record must be on stable storage before telling the client the transaction succeeded. Otherwise, a crash could lose a “committed” transaction.
Rule 3: Flush log before dirty page. Before writing any dirty data page to disk, all log records affecting that page must be flushed. This ensures we can undo incomplete changes.
class WALManager:
def __init__(self):
self.log_buffer = []
self.current_lsn = 0
self.flushed_lsn = 0
def write_and_commit(self, transaction_id, page_id, old_value, new_value):
# Step 1: Create log record for the change
change_lsn = self._append_log_record(
transaction_id, page_id, 'UPDATE', old_value, new_value
)
# Step 2: Modify the page in memory (buffer pool)
page = self.buffer_pool.get_page(page_id)
page.data = new_value
page.page_lsn = change_lsn # Track which log record last modified this
page.dirty = True
# Step 3: Create commit record
commit_lsn = self._append_log_record(
transaction_id, None, 'COMMIT', None, None
)
# Step 4: Flush log through commit record BEFORE acknowledging
self._flush_log_to(commit_lsn)
# Step 5: NOW we can tell the client it's committed
return "COMMIT OK"
def flush_dirty_page(self, page_id):
page = self.buffer_pool.get_page(page_id)
# CRITICAL: Ensure all log records for this page are flushed first
if page.page_lsn > self.flushed_lsn:
self._flush_log_to(page.page_lsn)
# Now safe to write the data page
self.disk.write_page(page_id, page.data)
page.dirty = False
The ordering is non-negotiable. If you flush a dirty page before its log records, a crash could leave you with a partially modified page and no way to fix it.
Checkpointing: Bounding Recovery Time
Without intervention, the log grows forever. Recovery would need to replay the entire history from the beginning of time. Checkpoints solve this by creating known-good starting points.
A checkpoint records: which transactions are active, which pages are dirty, and the current LSN. After a checkpoint, recovery only needs to process log records from that point forward.
Sharp checkpoints stop all activity, flush all dirty pages, then record the checkpoint. Simple but terrible for availability.
Fuzzy checkpoints are more practical. They record the current state without stopping transactions, accepting that some dirty pages won’t be flushed yet. Recovery handles this by starting from the checkpoint but potentially redoing work.
class CheckpointManager:
def fuzzy_checkpoint(self, wal_manager, buffer_pool):
# Record checkpoint start
checkpoint_lsn = wal_manager.current_lsn
# Snapshot current state (without stopping transactions)
active_transactions = self._get_active_transaction_ids()
dirty_pages = buffer_pool.get_dirty_page_ids()
# Find the oldest LSN we might need for recovery
# (oldest of: first log record of active transactions, or
# oldest unflushed log record for any dirty page)
oldest_needed_lsn = self._compute_oldest_needed_lsn(
active_transactions, dirty_pages
)
# Write checkpoint record to log
checkpoint_record = {
'type': 'CHECKPOINT',
'lsn': checkpoint_lsn,
'active_txns': active_transactions,
'dirty_pages': [(p.id, p.page_lsn) for p in dirty_pages],
'oldest_needed_lsn': oldest_needed_lsn
}
wal_manager.write_checkpoint_record(checkpoint_record)
# Flush log including checkpoint record
wal_manager.flush_log()
# Background: gradually flush dirty pages
# (not blocking - transactions continue)
self._schedule_background_flush(dirty_pages)
# Log records before oldest_needed_lsn can now be archived/deleted
return oldest_needed_lsn
The checkpoint’s dirty page list tells recovery which pages might have unflushed changes. The active transaction list identifies work that might need undoing.
Crash Recovery: The ARIES Algorithm
ARIES (Algorithms for Recovery and Isolation Exploiting Semantics) is the gold standard for WAL recovery. It proceeds in three phases.
Analysis phase: Scan forward from the last checkpoint. Reconstruct which transactions were active at crash time and which pages were dirty. Build the transaction table and dirty page table.
Redo phase: Scan forward again, replaying all logged changes. This restores the database to its exact pre-crash state, including uncommitted changes. We redo everything because we don’t know which pages actually made it to disk.
Undo phase: Roll back all transactions that were active at crash time. Scan backward through each incomplete transaction’s log records, applying compensation (undo) operations.
class ARIESRecovery:
def recover(self, wal, last_checkpoint):
# Phase 1: Analysis
active_txns = set(last_checkpoint.active_transactions)
dirty_pages = dict(last_checkpoint.dirty_pages) # page_id -> recovery_lsn
for record in wal.scan_forward_from(last_checkpoint.lsn):
if record.type == 'BEGIN':
active_txns.add(record.transaction_id)
elif record.type == 'COMMIT' or record.type == 'ABORT':
active_txns.discard(record.transaction_id)
elif record.type in ('UPDATE', 'INSERT', 'DELETE'):
if record.page_id not in dirty_pages:
dirty_pages[record.page_id] = record.lsn
# Phase 2: Redo (repeat history)
redo_lsn = min(dirty_pages.values()) if dirty_pages else last_checkpoint.lsn
for record in wal.scan_forward_from(redo_lsn):
if record.type not in ('UPDATE', 'INSERT', 'DELETE'):
continue
page = self.buffer_pool.get_page(record.page_id)
# Only redo if page LSN < log record LSN
# (page might already have this change)
if page.page_lsn < record.lsn:
self._apply_redo(page, record)
page.page_lsn = record.lsn
# Phase 3: Undo (rollback losers)
for txn_id in active_txns:
self._rollback_transaction(txn_id, wal)
def _rollback_transaction(self, txn_id, wal):
# Follow the prev_lsn chain backward
current_lsn = wal.get_last_lsn_for_txn(txn_id)
while current_lsn is not None:
record = wal.read_record(current_lsn)
if record.type in ('UPDATE', 'INSERT', 'DELETE'):
# Apply compensation (undo) and log it
self._apply_undo(record)
wal.write_clr(record) # Compensation Log Record
current_lsn = record.prev_lsn
The redo phase uses LSN comparison to avoid reapplying changes that already reached disk. The undo phase logs compensation records so that a crash during recovery doesn’t re-undo already-undone work.
Implementation Considerations
Group commit batches multiple transactions into a single fsync, dramatically improving throughput:
class GroupCommitManager:
def __init__(self, wal, flush_interval_ms=10, max_batch_size=100):
self.pending_commits = []
self.wal = wal
self.lock = threading.Lock()
self.flush_interval = flush_interval_ms / 1000.0
self.max_batch = max_batch_size
def request_commit(self, transaction_id, commit_lsn):
future = CommitFuture()
with self.lock:
self.pending_commits.append((commit_lsn, future))
if len(self.pending_commits) >= self.max_batch:
self._flush_batch()
return future # Caller waits on this
def _flush_batch(self):
if not self.pending_commits:
return
# Single fsync covers all pending commits
max_lsn = max(lsn for lsn, _ in self.pending_commits)
self.wal.flush_to(max_lsn)
# Wake up all waiting transactions
for _, future in self.pending_commits:
future.complete()
self.pending_commits.clear()
Buffer management must coordinate with WAL. The buffer pool can’t evict a dirty page until its log records are flushed. This creates backpressure when the log writer falls behind.
Log rotation archives old log files once checkpoints guarantee they’re no longer needed for recovery. These archived logs remain valuable for point-in-time recovery and replication.
WAL in Practice
PostgreSQL’s WAL is a textbook implementation. It uses 16MB segment files, supports synchronous and asynchronous replication, and provides pg_waldump for debugging. The WAL is also the foundation for logical replication and point-in-time recovery.
SQLite’s WAL mode offers an alternative to its default rollback journal. WAL mode enables concurrent readers during writes and generally provides better performance for write-heavy workloads.
Distributed consensus protocols like Raft use WAL principles at a different level. Each node maintains a replicated log, and entries are only considered committed when a majority of nodes have persisted them. The log is the source of truth; state machines are derived from it.
WAL alone doesn’t provide complete durability. You still need replication for hardware failures and backups for catastrophic events. But WAL is the foundation—without it, even single-node durability is impossible to achieve efficiently.