Database Write-Ahead Log: Crash Recovery Mechanism
When your application commits a transaction, you expect that data to survive a crash. This is the 'D' in ACID—durability. But here's the challenge: writing every change directly to disk is...
Key Insights
- Write-Ahead Logging (WAL) guarantees durability by forcing log records to disk before data pages, enabling databases to recover committed transactions after crashes while maintaining performance through deferred writes
- The recovery process consists of three phases—analysis, redo, and undo—that rebuild database state by replaying committed operations and rolling back incomplete transactions from checkpoint to crash
- WAL configuration directly impacts your application’s durability guarantees and performance: synchronous commits provide maximum safety at the cost of throughput, while group commits and asynchronous modes trade recovery guarantees for speed
The Durability Problem
When your application commits a transaction, you expect that data to survive a crash. This is the “D” in ACID—durability. But here’s the challenge: writing every change directly to disk is prohibitively slow. Random disk I/O can handle only hundreds of operations per second, while modern applications need thousands of transactions per second.
The naive approach—flush every modified data page to disk on commit—creates a performance bottleneck. Database pages are typically 8KB or 16KB, scattered across the disk. Modifying a single row might require writing an entire page to a random location. This doesn’t scale.
Write-Ahead Logging solves this by transforming random writes into sequential ones. Instead of writing data pages, databases write small log records sequentially to a dedicated log file. This simple architectural decision enables both durability and performance.
Write-Ahead Logging Fundamentals
The WAL protocol has one core rule: log records must reach stable storage before the corresponding data pages. This means before marking a transaction as committed, the database must flush all its log records to disk. The actual data pages can be written lazily, later.
Every log record contains a Log Sequence Number (LSN)—a monotonically increasing identifier that establishes ordering. Each data page also tracks the LSN of the last log record that modified it. This relationship enables recovery.
Here’s a basic log record structure:
type LogRecord struct {
LSN uint64
TransactionID uint64
Type OperationType
TableID uint32
PageID uint32
Offset uint16
BeforeImage []byte
AfterImage []byte
PrevLSN uint64 // Previous LSN for this transaction
}
type OperationType int
const (
Insert OperationType = iota
Update
Delete
Commit
Abort
Checkpoint
)
This structure captures everything needed for recovery. The BeforeImage and AfterImage fields contain the old and new values—enabling both redo (reapply changes) and undo (rollback changes).
WAL enables a “no-force, steal” buffer management policy. “No-force” means dirty pages don’t need to be flushed at commit time. “Steal” means the buffer manager can evict dirty pages from uncommitted transactions to disk. These policies improve performance but require WAL for correctness.
WAL Implementation Mechanics
When your application executes an UPDATE statement, here’s what happens:
- The database modifies the page in the buffer pool (in-memory cache)
- It generates a log record describing the change
- It appends the log record to an in-memory log buffer
- On commit, it flushes the log buffer to disk with
fsync() - Only after the log is safely on disk does it acknowledge the commit
The actual data page might not be written to disk for minutes or hours. It’s marked as “dirty” in the buffer pool and eventually flushed during checkpointing or when the buffer pool needs space.
Here’s pseudocode showing the WAL write flow:
class WALManager:
def __init__(self):
self.log_buffer = []
self.current_lsn = 0
self.log_file = open("wal.log", "ab")
self.last_checkpoint_lsn = 0
def write_log_record(self, record):
# Assign LSN and add to buffer
self.current_lsn += 1
record.lsn = self.current_lsn
self.log_buffer.append(record)
# Flush if buffer is full
if len(self.log_buffer) >= LOG_BUFFER_SIZE:
self.flush_log()
return record.lsn
def flush_log(self):
if not self.log_buffer:
return
# Write all buffered records
for record in self.log_buffer:
serialized = self.serialize(record)
self.log_file.write(serialized)
# Force to stable storage
self.log_file.flush()
os.fsync(self.log_file.fileno())
self.log_buffer.clear()
def commit_transaction(self, txn_id):
# Write commit record
commit_record = LogRecord(
txn_id=txn_id,
type=OperationType.COMMIT
)
lsn = self.write_log_record(commit_record)
# Must flush before returning success
self.flush_log()
return lsn
The fsync() call is critical—it forces the operating system to write buffered data to physical storage. Without it, data might sit in the OS page cache and be lost on crash.
Checkpointing periodically flushes dirty pages to reduce recovery time. During a checkpoint, the database writes all dirty pages to disk and records a checkpoint log record containing the list of active transactions. This establishes a recovery starting point—the system only needs to replay logs from the last checkpoint forward.
Crash Recovery Process
When a database restarts after a crash, it performs three-phase recovery:
Phase 1: Analysis - Scan the log from the last checkpoint forward to identify which pages were dirty and which transactions were active at crash time.
Phase 2: Redo - Replay all logged operations from the checkpoint forward, reapplying changes to restore the database to its state at the crash moment. This includes operations from uncommitted transactions.
Phase 3: Undo - Roll back all uncommitted transactions by applying their before-images in reverse order.
Here’s a simplified recovery implementation:
class RecoveryManager:
def recover(self):
checkpoint_lsn = self.find_last_checkpoint()
dirty_pages = set()
active_txns = {}
# Phase 1: Analysis
for record in self.scan_log(from_lsn=checkpoint_lsn):
if record.type in (INSERT, UPDATE, DELETE):
dirty_pages.add(record.page_id)
active_txns[record.txn_id] = record.lsn
elif record.type == COMMIT:
active_txns.pop(record.txn_id, None)
elif record.type == ABORT:
active_txns.pop(record.txn_id, None)
# Phase 2: Redo
for record in self.scan_log(from_lsn=checkpoint_lsn):
if record.type in (INSERT, UPDATE, DELETE):
page = self.buffer_pool.get_page(record.page_id)
# Only redo if page LSN is less than record LSN
if page.lsn < record.lsn:
self.apply_after_image(page, record)
page.lsn = record.lsn
# Phase 3: Undo
for txn_id in active_txns:
self.undo_transaction(txn_id)
def undo_transaction(self, txn_id):
# Scan backwards through this transaction's records
records = self.get_transaction_records(txn_id, reverse=True)
for record in records:
if record.type in (INSERT, UPDATE, DELETE):
page = self.buffer_pool.get_page(record.page_id)
self.apply_before_image(page, record)
# Write compensation log record (CLR)
clr = self.create_clr(record)
self.wal.write_log_record(clr)
The redo phase is idempotent—replaying it multiple times produces the same result. This is crucial because a crash during recovery might require restarting recovery itself.
Optimizations and Trade-offs
Real databases implement several WAL optimizations:
Group Commit - Instead of flushing the log for every transaction, batch multiple commits into a single fsync(). This dramatically improves throughput but adds milliseconds of latency.
Asynchronous Commit - Return success before the log hits disk. This maximizes performance but risks losing recent commits on crash.
Log Compression - Store only logical operations rather than full before/after images for common operations.
PostgreSQL’s WAL configuration exposes these trade-offs:
-- Synchronous commit: maximum durability, lower throughput
ALTER SYSTEM SET synchronous_commit = 'on';
-- Asynchronous commit: risk losing recent commits, 10x+ throughput
ALTER SYSTEM SET synchronous_commit = 'off';
-- Group commit delay: batch commits for better throughput
ALTER SYSTEM SET commit_delay = 10; -- microseconds
ALTER SYSTEM SET commit_siblings = 5; -- minimum concurrent transactions
-- Checkpoint frequency: recovery time vs. I/O overhead
ALTER SYSTEM SET checkpoint_timeout = '5min';
ALTER SYSTEM SET max_wal_size = '1GB';
MySQL/InnoDB provides similar controls:
[mysqld]
# Flush log to disk on every commit (safest)
innodb_flush_log_at_trx_commit = 1
# Flush every second (faster, less durable)
innodb_flush_log_at_trx_commit = 2
# Control log file size
innodb_log_file_size = 512M
innodb_log_files_in_group = 2
The innodb_flush_log_at_trx_commit setting is critical. Setting it to 1 guarantees durability but limits throughput to your disk’s fsync() rate. Setting it to 2 can lose up to one second of transactions on OS crash.
Real-World Implementations
PostgreSQL uses a sophisticated WAL implementation with full-page writes after checkpoints to handle torn pages. It stores WAL in 16MB segment files and supports streaming replication by shipping WAL to replicas.
MySQL/InnoDB implements redo logs (similar to WAL) and separate undo logs. Its doublewrite buffer provides additional protection against partial page writes. InnoDB uses a circular log file structure that requires periodic checkpointing.
SQLite offers multiple journaling modes. WAL mode (PRAGMA journal_mode=WAL) provides better concurrency than its default rollback journal, allowing readers to access the database while a writer is active. SQLite’s WAL is simpler than enterprise databases but follows the same principles.
Each implementation makes different trade-offs. PostgreSQL prioritizes correctness and features. MySQL/InnoDB optimizes for high-concurrency workloads. SQLite minimizes complexity for embedded use cases.
When and Why WAL Matters
Understanding WAL helps you make informed architectural decisions:
Performance Tuning - If your application can tolerate losing a few seconds of data, asynchronous commits can dramatically improve throughput. For financial systems, synchronous commits are non-negotiable.
Backup Strategies - WAL enables point-in-time recovery. By archiving WAL segments, you can restore the database to any moment between backups.
Replication - Streaming replication works by shipping WAL to replicas. Understanding WAL helps you configure replication lag and failover behavior.
Capacity Planning - WAL generation rate determines disk I/O requirements. Write-heavy workloads need fast storage for WAL files.
The durability guarantees you configure directly impact your recovery point objective (RPO). Synchronous commits give you RPO=0 (no data loss). Asynchronous commits might give you RPO=1 second. Choose based on your business requirements, not default settings.
WAL is the foundation of database durability. It’s not just an implementation detail—it’s a fundamental architectural pattern that transforms the impossibly slow problem of random disk writes into the tractable problem of sequential logging.