Database Write-Ahead Log: Crash Recovery Mechanism

When your application commits a transaction, you expect that data to survive a crash. This is the 'D' in ACID—durability. But here's the challenge: writing every change directly to disk is...

Key Insights

  • Write-Ahead Logging (WAL) guarantees durability by forcing log records to disk before data pages, enabling databases to recover committed transactions after crashes while maintaining performance through deferred writes
  • The recovery process consists of three phases—analysis, redo, and undo—that rebuild database state by replaying committed operations and rolling back incomplete transactions from checkpoint to crash
  • WAL configuration directly impacts your application’s durability guarantees and performance: synchronous commits provide maximum safety at the cost of throughput, while group commits and asynchronous modes trade recovery guarantees for speed

The Durability Problem

When your application commits a transaction, you expect that data to survive a crash. This is the “D” in ACID—durability. But here’s the challenge: writing every change directly to disk is prohibitively slow. Random disk I/O can handle only hundreds of operations per second, while modern applications need thousands of transactions per second.

The naive approach—flush every modified data page to disk on commit—creates a performance bottleneck. Database pages are typically 8KB or 16KB, scattered across the disk. Modifying a single row might require writing an entire page to a random location. This doesn’t scale.

Write-Ahead Logging solves this by transforming random writes into sequential ones. Instead of writing data pages, databases write small log records sequentially to a dedicated log file. This simple architectural decision enables both durability and performance.

Write-Ahead Logging Fundamentals

The WAL protocol has one core rule: log records must reach stable storage before the corresponding data pages. This means before marking a transaction as committed, the database must flush all its log records to disk. The actual data pages can be written lazily, later.

Every log record contains a Log Sequence Number (LSN)—a monotonically increasing identifier that establishes ordering. Each data page also tracks the LSN of the last log record that modified it. This relationship enables recovery.

Here’s a basic log record structure:

type LogRecord struct {
    LSN           uint64
    TransactionID uint64
    Type          OperationType
    TableID       uint32
    PageID        uint32
    Offset        uint16
    BeforeImage   []byte
    AfterImage    []byte
    PrevLSN       uint64  // Previous LSN for this transaction
}

type OperationType int

const (
    Insert OperationType = iota
    Update
    Delete
    Commit
    Abort
    Checkpoint
)

This structure captures everything needed for recovery. The BeforeImage and AfterImage fields contain the old and new values—enabling both redo (reapply changes) and undo (rollback changes).

WAL enables a “no-force, steal” buffer management policy. “No-force” means dirty pages don’t need to be flushed at commit time. “Steal” means the buffer manager can evict dirty pages from uncommitted transactions to disk. These policies improve performance but require WAL for correctness.

WAL Implementation Mechanics

When your application executes an UPDATE statement, here’s what happens:

  1. The database modifies the page in the buffer pool (in-memory cache)
  2. It generates a log record describing the change
  3. It appends the log record to an in-memory log buffer
  4. On commit, it flushes the log buffer to disk with fsync()
  5. Only after the log is safely on disk does it acknowledge the commit

The actual data page might not be written to disk for minutes or hours. It’s marked as “dirty” in the buffer pool and eventually flushed during checkpointing or when the buffer pool needs space.

Here’s pseudocode showing the WAL write flow:

class WALManager:
    def __init__(self):
        self.log_buffer = []
        self.current_lsn = 0
        self.log_file = open("wal.log", "ab")
        self.last_checkpoint_lsn = 0
    
    def write_log_record(self, record):
        # Assign LSN and add to buffer
        self.current_lsn += 1
        record.lsn = self.current_lsn
        self.log_buffer.append(record)
        
        # Flush if buffer is full
        if len(self.log_buffer) >= LOG_BUFFER_SIZE:
            self.flush_log()
        
        return record.lsn
    
    def flush_log(self):
        if not self.log_buffer:
            return
        
        # Write all buffered records
        for record in self.log_buffer:
            serialized = self.serialize(record)
            self.log_file.write(serialized)
        
        # Force to stable storage
        self.log_file.flush()
        os.fsync(self.log_file.fileno())
        
        self.log_buffer.clear()
    
    def commit_transaction(self, txn_id):
        # Write commit record
        commit_record = LogRecord(
            txn_id=txn_id,
            type=OperationType.COMMIT
        )
        lsn = self.write_log_record(commit_record)
        
        # Must flush before returning success
        self.flush_log()
        
        return lsn

The fsync() call is critical—it forces the operating system to write buffered data to physical storage. Without it, data might sit in the OS page cache and be lost on crash.

Checkpointing periodically flushes dirty pages to reduce recovery time. During a checkpoint, the database writes all dirty pages to disk and records a checkpoint log record containing the list of active transactions. This establishes a recovery starting point—the system only needs to replay logs from the last checkpoint forward.

Crash Recovery Process

When a database restarts after a crash, it performs three-phase recovery:

Phase 1: Analysis - Scan the log from the last checkpoint forward to identify which pages were dirty and which transactions were active at crash time.

Phase 2: Redo - Replay all logged operations from the checkpoint forward, reapplying changes to restore the database to its state at the crash moment. This includes operations from uncommitted transactions.

Phase 3: Undo - Roll back all uncommitted transactions by applying their before-images in reverse order.

Here’s a simplified recovery implementation:

class RecoveryManager:
    def recover(self):
        checkpoint_lsn = self.find_last_checkpoint()
        dirty_pages = set()
        active_txns = {}
        
        # Phase 1: Analysis
        for record in self.scan_log(from_lsn=checkpoint_lsn):
            if record.type in (INSERT, UPDATE, DELETE):
                dirty_pages.add(record.page_id)
                active_txns[record.txn_id] = record.lsn
            elif record.type == COMMIT:
                active_txns.pop(record.txn_id, None)
            elif record.type == ABORT:
                active_txns.pop(record.txn_id, None)
        
        # Phase 2: Redo
        for record in self.scan_log(from_lsn=checkpoint_lsn):
            if record.type in (INSERT, UPDATE, DELETE):
                page = self.buffer_pool.get_page(record.page_id)
                
                # Only redo if page LSN is less than record LSN
                if page.lsn < record.lsn:
                    self.apply_after_image(page, record)
                    page.lsn = record.lsn
        
        # Phase 3: Undo
        for txn_id in active_txns:
            self.undo_transaction(txn_id)
    
    def undo_transaction(self, txn_id):
        # Scan backwards through this transaction's records
        records = self.get_transaction_records(txn_id, reverse=True)
        
        for record in records:
            if record.type in (INSERT, UPDATE, DELETE):
                page = self.buffer_pool.get_page(record.page_id)
                self.apply_before_image(page, record)
                
                # Write compensation log record (CLR)
                clr = self.create_clr(record)
                self.wal.write_log_record(clr)

The redo phase is idempotent—replaying it multiple times produces the same result. This is crucial because a crash during recovery might require restarting recovery itself.

Optimizations and Trade-offs

Real databases implement several WAL optimizations:

Group Commit - Instead of flushing the log for every transaction, batch multiple commits into a single fsync(). This dramatically improves throughput but adds milliseconds of latency.

Asynchronous Commit - Return success before the log hits disk. This maximizes performance but risks losing recent commits on crash.

Log Compression - Store only logical operations rather than full before/after images for common operations.

PostgreSQL’s WAL configuration exposes these trade-offs:

-- Synchronous commit: maximum durability, lower throughput
ALTER SYSTEM SET synchronous_commit = 'on';

-- Asynchronous commit: risk losing recent commits, 10x+ throughput
ALTER SYSTEM SET synchronous_commit = 'off';

-- Group commit delay: batch commits for better throughput
ALTER SYSTEM SET commit_delay = 10; -- microseconds
ALTER SYSTEM SET commit_siblings = 5; -- minimum concurrent transactions

-- Checkpoint frequency: recovery time vs. I/O overhead
ALTER SYSTEM SET checkpoint_timeout = '5min';
ALTER SYSTEM SET max_wal_size = '1GB';

MySQL/InnoDB provides similar controls:

[mysqld]
# Flush log to disk on every commit (safest)
innodb_flush_log_at_trx_commit = 1

# Flush every second (faster, less durable)
innodb_flush_log_at_trx_commit = 2

# Control log file size
innodb_log_file_size = 512M
innodb_log_files_in_group = 2

The innodb_flush_log_at_trx_commit setting is critical. Setting it to 1 guarantees durability but limits throughput to your disk’s fsync() rate. Setting it to 2 can lose up to one second of transactions on OS crash.

Real-World Implementations

PostgreSQL uses a sophisticated WAL implementation with full-page writes after checkpoints to handle torn pages. It stores WAL in 16MB segment files and supports streaming replication by shipping WAL to replicas.

MySQL/InnoDB implements redo logs (similar to WAL) and separate undo logs. Its doublewrite buffer provides additional protection against partial page writes. InnoDB uses a circular log file structure that requires periodic checkpointing.

SQLite offers multiple journaling modes. WAL mode (PRAGMA journal_mode=WAL) provides better concurrency than its default rollback journal, allowing readers to access the database while a writer is active. SQLite’s WAL is simpler than enterprise databases but follows the same principles.

Each implementation makes different trade-offs. PostgreSQL prioritizes correctness and features. MySQL/InnoDB optimizes for high-concurrency workloads. SQLite minimizes complexity for embedded use cases.

When and Why WAL Matters

Understanding WAL helps you make informed architectural decisions:

Performance Tuning - If your application can tolerate losing a few seconds of data, asynchronous commits can dramatically improve throughput. For financial systems, synchronous commits are non-negotiable.

Backup Strategies - WAL enables point-in-time recovery. By archiving WAL segments, you can restore the database to any moment between backups.

Replication - Streaming replication works by shipping WAL to replicas. Understanding WAL helps you configure replication lag and failover behavior.

Capacity Planning - WAL generation rate determines disk I/O requirements. Write-heavy workloads need fast storage for WAL files.

The durability guarantees you configure directly impact your recovery point objective (RPO). Synchronous commits give you RPO=0 (no data loss). Asynchronous commits might give you RPO=1 second. Choose based on your business requirements, not default settings.

WAL is the foundation of database durability. It’s not just an implementation detail—it’s a fundamental architectural pattern that transforms the impossibly slow problem of random disk writes into the tractable problem of sequential logging.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.