System Design: Caching Strategies (Write-Through, Write-Back, Write-Around)
Every caching layer introduces a fundamental challenge: how do you keep two data stores in sync when writes happen? Get this wrong and you'll face stale reads, lost writes, or both. Get it right and...
Key Insights
- Write-through guarantees consistency but adds latency; write-back optimizes for speed but risks data loss; write-around prevents cache pollution but causes initial read misses
- Your caching strategy should vary by data type—use write-through for financial transactions, write-back for session data, and write-around for analytics
- Most production systems need hybrid approaches, not a single strategy applied uniformly across all data
The Sync Problem Every System Faces
Every caching layer introduces a fundamental challenge: how do you keep two data stores in sync when writes happen? Get this wrong and you’ll face stale reads, lost writes, or both. Get it right and you’ll have a system that’s both fast and reliable.
The three main write strategies—write-through, write-back, and write-around—represent different trade-offs along the consistency-performance spectrum. There’s no universally correct answer. The right choice depends on your data’s characteristics, your tolerance for inconsistency, and how much complexity you’re willing to manage.
Let’s break down each strategy with real implementation code.
Write-Through Caching
Write-through is the conservative choice. Every write goes to both the cache and the database synchronously. The operation only succeeds when both writes complete.
This gives you strong consistency. The cache always reflects the database state. Reads are fast and trustworthy. The downside is latency—every write operation now has the overhead of two storage operations.
import redis
import psycopg2
from contextlib import contextmanager
from typing import Any, Optional
class WriteThroughCache:
def __init__(self, redis_client: redis.Redis, db_connection):
self.cache = redis_client
self.db = db_connection
@contextmanager
def _transaction(self):
cursor = self.db.cursor()
try:
yield cursor
self.db.commit()
except Exception as e:
self.db.rollback()
raise e
finally:
cursor.close()
def set(self, key: str, value: Any, table: str = "cache_data") -> bool:
"""
Write-through: write to both cache and DB atomically.
Cache write only succeeds if DB write succeeds.
"""
try:
with self._transaction() as cursor:
# Write to database first
cursor.execute(
f"""
INSERT INTO {table} (key, value, updated_at)
VALUES (%s, %s, NOW())
ON CONFLICT (key) DO UPDATE SET value = %s, updated_at = NOW()
""",
(key, value, value)
)
# Only update cache after DB succeeds
self.cache.set(key, value)
return True
except Exception as e:
# If DB fails, don't update cache
# If cache fails after DB succeeds, invalidate cache entry
self.cache.delete(key)
raise e
def get(self, key: str) -> Optional[Any]:
"""Read from cache first, fall back to DB."""
value = self.cache.get(key)
if value is not None:
return value
# Cache miss - fetch from DB and populate cache
with self._transaction() as cursor:
cursor.execute(
"SELECT value FROM cache_data WHERE key = %s",
(key,)
)
row = cursor.fetchone()
if row:
self.cache.set(key, row[0])
return row[0]
return None
The critical detail here is the order of operations. Write to the database first. If that fails, the cache stays unchanged. If the cache write fails after the database succeeds, invalidate the cache entry. This ensures you never have stale data in the cache—at worst, you have a cache miss.
Write-through works best for read-heavy workloads where consistency matters more than write latency. Think user profile data, configuration settings, or any data where serving stale values causes real problems.
Write-Back (Write-Behind) Caching
Write-back flips the trade-off. Writes go to the cache immediately and return to the caller. The database update happens asynchronously, usually batched for efficiency.
This gives you excellent write performance. The caller doesn’t wait for the slow database operation. But you’re now accepting durability risk—if the cache node dies before the async flush, you lose data.
import redis
import json
import threading
import time
from queue import Queue
from typing import Any, Dict, List
from dataclasses import dataclass
from datetime import datetime
@dataclass
class DirtyEntry:
key: str
value: Any
timestamp: datetime
class WriteBackCache:
def __init__(self, redis_client: redis.Redis, db_connection,
flush_interval: int = 5, batch_size: int = 100):
self.cache = redis_client
self.db = db_connection
self.dirty_queue: Queue[DirtyEntry] = Queue()
self.flush_interval = flush_interval
self.batch_size = batch_size
self._start_flush_worker()
def _start_flush_worker(self):
"""Background worker that flushes dirty entries to DB."""
def flush_loop():
while True:
time.sleep(self.flush_interval)
self._flush_batch()
worker = threading.Thread(target=flush_loop, daemon=True)
worker.start()
def _flush_batch(self):
"""Batch flush dirty entries to database."""
batch: List[DirtyEntry] = []
while not self.dirty_queue.empty() and len(batch) < self.batch_size:
try:
entry = self.dirty_queue.get_nowait()
batch.append(entry)
except:
break
if not batch:
return
cursor = self.db.cursor()
try:
# Batch insert/update for efficiency
values = [(e.key, e.value, e.timestamp) for e in batch]
cursor.executemany(
"""
INSERT INTO cache_data (key, value, updated_at)
VALUES (%s, %s, %s)
ON CONFLICT (key) DO UPDATE SET value = EXCLUDED.value,
updated_at = EXCLUDED.updated_at
""",
values
)
self.db.commit()
# Mark entries as clean in cache metadata
for entry in batch:
self.cache.srem("dirty_keys", entry.key)
except Exception as e:
self.db.rollback()
# Re-queue failed entries
for entry in batch:
self.dirty_queue.put(entry)
raise e
finally:
cursor.close()
def set(self, key: str, value: Any) -> bool:
"""Write to cache immediately, queue for async DB write."""
# Write to cache
self.cache.set(key, value)
# Track dirty state
self.cache.sadd("dirty_keys", key)
# Queue for async persistence
self.dirty_queue.put(DirtyEntry(
key=key,
value=value,
timestamp=datetime.now()
))
return True
def get(self, key: str) -> Any:
"""Read from cache - it's always the source of truth."""
return self.cache.get(key)
def force_flush(self):
"""Force immediate flush of all dirty entries."""
while not self.dirty_queue.empty():
self._flush_batch()
The dirty key tracking is essential. You need to know which cache entries haven’t been persisted yet. If a cache node fails, you can use this set to understand what data might be at risk.
Write-back shines for high-throughput write scenarios where you can tolerate some data loss. Session data, analytics events, and real-time metrics are good candidates. Financial transactions are not.
Write-Around Caching
Write-around takes a different approach entirely. Writes go directly to the database, bypassing the cache completely. The cache only gets populated on subsequent reads.
This prevents cache pollution—data that’s written but rarely read never consumes cache space. The trade-off is that immediately after a write, the next read will be a cache miss.
import redis
from typing import Any, Optional
class WriteAroundCache:
def __init__(self, redis_client: redis.Redis, db_connection,
cache_ttl: int = 3600):
self.cache = redis_client
self.db = db_connection
self.cache_ttl = cache_ttl
def set(self, key: str, value: Any, table: str = "cache_data") -> bool:
"""Write directly to database, skip cache entirely."""
cursor = self.db.cursor()
try:
cursor.execute(
f"""
INSERT INTO {table} (key, value, updated_at)
VALUES (%s, %s, NOW())
ON CONFLICT (key) DO UPDATE SET value = %s, updated_at = NOW()
""",
(key, value, value)
)
self.db.commit()
# Invalidate cache entry if it exists
# This prevents serving stale data
self.cache.delete(key)
return True
except Exception as e:
self.db.rollback()
raise e
finally:
cursor.close()
def get(self, key: str) -> Optional[Any]:
"""
Read from cache first. On miss, fetch from DB and populate cache.
This is where cache gets populated - lazy loading.
"""
# Try cache first
value = self.cache.get(key)
if value is not None:
return value
# Cache miss - fetch from database
cursor = self.db.cursor()
try:
cursor.execute(
"SELECT value FROM cache_data WHERE key = %s",
(key,)
)
row = cursor.fetchone()
if row:
# Populate cache for future reads
self.cache.setex(key, self.cache_ttl, row[0])
return row[0]
return None
finally:
cursor.close()
The cache invalidation on write is crucial. Without it, subsequent reads would return stale cached data until the TTL expires.
Write-around is ideal for write-heavy workloads where most data is written once and rarely read. Log ingestion, audit trails, and bulk data imports are perfect use cases.
Comparison Matrix and Decision Framework
| Strategy | Consistency | Write Latency | Read Latency | Durability | Complexity |
|---|---|---|---|---|---|
| Write-Through | Strong | High | Low | High | Low |
| Write-Back | Eventual | Low | Low | Medium | High |
| Write-Around | Strong | Medium | Variable | High | Low |
Choose write-through when: Data consistency is critical, write volume is moderate, and you can tolerate higher write latency.
Choose write-back when: Write performance is critical, you can tolerate eventual consistency, and you have infrastructure to handle cache failures gracefully.
Choose write-around when: Write volume is high, most data is rarely re-read, and you want to prevent cache pollution.
Hybrid Approaches for Production Systems
Real systems rarely use a single strategy. Different data types have different requirements.
class HybridCache:
"""
Uses different strategies based on data classification.
- Critical data: write-through (user balances, orders)
- Session data: write-back (user preferences, cart)
- Analytics: write-around (page views, click events)
"""
def __init__(self, redis_client, db_connection):
self.write_through = WriteThroughCache(redis_client, db_connection)
self.write_back = WriteBackCache(redis_client, db_connection)
self.write_around = WriteAroundCache(redis_client, db_connection)
self.strategy_map = {
"user_balance": "write_through",
"order": "write_through",
"session": "write_back",
"preferences": "write_back",
"page_view": "write_around",
"analytics": "write_around",
}
def _get_strategy(self, data_type: str):
strategy_name = self.strategy_map.get(data_type, "write_through")
return getattr(self, strategy_name)
def set(self, key: str, value: Any, data_type: str) -> bool:
strategy = self._get_strategy(data_type)
return strategy.set(key, value)
def get(self, key: str, data_type: str) -> Any:
strategy = self._get_strategy(data_type)
return strategy.get(key)
Key Takeaways
The thundering herd problem hits write-around hardest. When a popular item gets written, every subsequent read becomes a cache miss until one request populates the cache. Use request coalescing or cache warming to mitigate this.
Write-back requires robust failure handling. Your async flush worker needs retry logic, dead letter queues, and monitoring for queue depth. If the queue backs up, you’re accumulating durability risk.
Don’t optimize prematurely. Start with write-through for everything. It’s the simplest to reason about and debug. Only move to write-back or write-around when you have measured evidence that write latency or cache pollution is actually a problem.
Monitor cache-database drift. Even with write-through, network partitions and partial failures can cause inconsistencies. Implement periodic reconciliation jobs that compare cache and database state for critical data.