System Design: Replication Patterns (Master-Slave, Multi-Master)
Database replication copies data across multiple servers to achieve goals that a single database instance cannot: surviving hardware failures, scaling read capacity, and serving users across...
Key Insights
- Master-slave replication excels at read scaling with simple conflict resolution, but creates a single point of failure for writes—use it when your read-to-write ratio exceeds 10:1 and you can tolerate brief write unavailability during failover.
- Multi-master replication enables geo-distributed writes and eliminates write bottlenecks, but forces you to solve conflict resolution—a problem with no perfect solution, only trade-offs between consistency, latency, and complexity.
- Replication lag is inevitable in any distributed system; design your application to handle it explicitly through patterns like read-your-writes consistency rather than hoping it won’t matter.
Introduction to Database Replication
Database replication copies data across multiple servers to achieve goals that a single database instance cannot: surviving hardware failures, scaling read capacity, and serving users across geographic regions with acceptable latency.
Before diving into patterns, let’s establish terminology. A primary (or master) is a node that accepts writes. A replica (or slave) receives copied data from another node. Synchronous replication waits for replicas to confirm writes before acknowledging to clients—stronger consistency, higher latency. Asynchronous replication acknowledges immediately and copies data in the background—lower latency, potential data loss.
The pattern you choose shapes your entire system architecture. Get it wrong, and you’ll either over-engineer a simple problem or discover your “highly available” system has a single point of failure during your first real outage.
Master-Slave (Primary-Replica) Architecture
Master-slave is the workhorse of database replication. One node handles all writes; replicas receive a stream of changes and serve read traffic.
The model works because most applications read far more than they write. An e-commerce product catalog might see 1,000 reads per write. Spreading those reads across five replicas means each handles only 200 requests while the primary focuses on the 1 write.
Here’s a PostgreSQL streaming replication configuration for the primary:
# postgresql.conf on primary
wal_level = replica
max_wal_senders = 5
wal_keep_size = 1GB
synchronous_commit = on
synchronous_standby_names = 'replica1'
And the replica’s recovery configuration:
# postgresql.conf on replica
primary_conninfo = 'host=primary-host port=5432 user=replicator password=secret'
primary_slot_name = 'replica1_slot'
hot_standby = on
Your application needs to route queries appropriately. Here’s a Python implementation using SQLAlchemy:
from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker
from contextlib import contextmanager
import random
class ReplicatedDatabase:
def __init__(self, primary_url: str, replica_urls: list[str]):
self.primary = create_engine(primary_url, pool_size=10)
self.replicas = [create_engine(url, pool_size=20) for url in replica_urls]
@contextmanager
def write_session(self):
Session = sessionmaker(bind=self.primary)
session = Session()
try:
yield session
session.commit()
except Exception:
session.rollback()
raise
finally:
session.close()
@contextmanager
def read_session(self):
replica = random.choice(self.replicas)
Session = sessionmaker(bind=replica)
session = Session()
try:
yield session
finally:
session.close()
The limitation is obvious: your primary is a single point of failure for writes. When it dies, you need a promotion process—either manual or automated—that takes time. During that window, writes fail.
Multi-Master (Active-Active) Architecture
Multi-master replication allows writes on any node. Each node replicates changes to others, creating a mesh of data flow.
This pattern shines for geo-distributed deployments. Users in Europe write to European nodes; users in Asia write to Asian nodes. Everyone gets low-latency writes instead of crossing oceans for every INSERT.
MySQL Group Replication provides multi-master capability:
# my.cnf for each node in the group
[mysqld]
server_id=1 # unique per node
gtid_mode=ON
enforce_gtid_consistency=ON
binlog_checksum=NONE
# Group Replication settings
plugin_load_add='group_replication.so'
group_replication_group_name="aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee"
group_replication_start_on_boot=OFF
group_replication_local_address="node1:33061"
group_replication_group_seeds="node1:33061,node2:33061,node3:33061"
group_replication_single_primary_mode=OFF # enables multi-master
CockroachDB takes a different approach, building multi-master into its core architecture:
# docker-compose.yml for local CockroachDB cluster
version: '3.8'
services:
roach1:
image: cockroachdb/cockroach:latest
command: start --insecure --join=roach1,roach2,roach3
ports:
- "26257:26257"
- "8080:8080"
roach2:
image: cockroachdb/cockroach:latest
command: start --insecure --join=roach1,roach2,roach3
roach3:
image: cockroachdb/cockroach:latest
command: start --insecure --join=roach1,roach2,roach3
The complexity cost is real. When two nodes modify the same row simultaneously, you have a conflict. Someone has to lose, and deciding who—and how—is the hard part.
Conflict Resolution Strategies
Conflicts in multi-master systems are inevitable. Your strategy for handling them determines whether your system behaves predictably or creates subtle data corruption.
Last-write-wins (LWW) is the simplest approach: attach a timestamp to each write, and the highest timestamp wins. It’s easy to implement but can silently discard valid updates.
from dataclasses import dataclass
from datetime import datetime
import time
@dataclass
class VersionedValue:
value: any
timestamp: float
node_id: str
class LWWRegister:
def __init__(self, node_id: str):
self.node_id = node_id
self.state: VersionedValue | None = None
def write(self, value: any) -> VersionedValue:
new_state = VersionedValue(
value=value,
timestamp=time.time(),
node_id=self.node_id
)
self._merge(new_state)
return new_state
def merge_remote(self, remote: VersionedValue):
self._merge(remote)
def _merge(self, incoming: VersionedValue):
if self.state is None:
self.state = incoming
elif incoming.timestamp > self.state.timestamp:
self.state = incoming
elif incoming.timestamp == self.state.timestamp:
# Tie-breaker: higher node_id wins
if incoming.node_id > self.state.node_id:
self.state = incoming
def read(self) -> any:
return self.state.value if self.state else None
CRDTs (Conflict-free Replicated Data Types) guarantee convergence without coordination. A G-Counter (grow-only counter) is a classic example:
class GCounter:
def __init__(self, node_id: str):
self.node_id = node_id
self.counts: dict[str, int] = {}
def increment(self, amount: int = 1):
current = self.counts.get(self.node_id, 0)
self.counts[self.node_id] = current + amount
def merge(self, other: 'GCounter'):
all_nodes = set(self.counts.keys()) | set(other.counts.keys())
for node in all_nodes:
self.counts[node] = max(
self.counts.get(node, 0),
other.counts.get(node, 0)
)
def value(self) -> int:
return sum(self.counts.values())
Application-level resolution pushes conflict handling to business logic. Shopping cart merges might union items; document edits might require manual review. This approach is more work but produces semantically correct results.
Replication Lag and Consistency Patterns
Asynchronous replication means replicas lag behind the primary. A user updates their profile, immediately reads it back, and sees stale data. This breaks user expectations and causes support tickets.
Read-your-writes consistency ensures users see their own updates. Implement it by tracking the last write timestamp and routing reads appropriately:
from fastapi import FastAPI, Request, Response
from typing import Optional
import time
app = FastAPI()
class ConsistencyManager:
def __init__(self, db: ReplicatedDatabase):
self.db = db
def get_session(self, request: Request, is_write: bool):
if is_write:
return self.db.write_session()
# Check if user recently wrote
last_write = request.cookies.get('last_write_ts')
if last_write:
last_write_time = float(last_write)
# If write was recent, read from primary
if time.time() - last_write_time < 5.0: # 5 second window
return self.db.write_session() # Use primary for read
return self.db.read_session()
def mark_write(self, response: Response):
response.set_cookie('last_write_ts', str(time.time()))
consistency = ConsistencyManager(db)
@app.post("/users/{user_id}")
async def update_user(user_id: int, request: Request, response: Response):
with consistency.get_session(request, is_write=True) as session:
# perform update
pass
consistency.mark_write(response)
return {"status": "updated"}
@app.get("/users/{user_id}")
async def get_user(user_id: int, request: Request):
with consistency.get_session(request, is_write=False) as session:
# perform read
pass
An alternative approach uses version tokens instead of timestamps:
@app.post("/users/{user_id}")
async def update_user(user_id: int):
with db.write_session() as session:
result = session.execute(
"UPDATE users SET ... RETURNING pg_current_wal_lsn()"
)
lsn = result.scalar()
return {"status": "updated", "version": str(lsn)}
@app.get("/users/{user_id}")
async def get_user(user_id: int, min_version: Optional[str] = None):
if min_version:
# Wait for replica to catch up or use primary
with db.read_session() as session:
current_lsn = session.execute(
"SELECT pg_last_wal_replay_lsn()"
).scalar()
if current_lsn < min_version:
# Fall back to primary
pass
Choosing the Right Pattern
| Factor | Master-Slave | Multi-Master |
|---|---|---|
| Read/write ratio | High (10:1+) | Any |
| Write availability | Tolerates brief outages | Critical |
| Geographic distribution | Single region | Multi-region |
| Consistency requirements | Strong preferred | Eventual acceptable |
| Operational complexity | Lower | Higher |
| Conflict handling | None needed | Required |
Choose master-slave when you need simplicity and can tolerate a promotion window during primary failures. Most applications fit this category.
Choose multi-master when writes must succeed in multiple regions simultaneously or when you cannot accept any write downtime. Be prepared to invest in conflict resolution.
Practical Implementation Considerations
Monitor replication lag continuously. Here’s a health check endpoint:
@app.get("/health/replication")
async def replication_health():
lag_seconds = get_replica_lag() # Query pg_stat_replication
status = "healthy" if lag_seconds < 5 else "degraded"
return {
"status": status,
"lag_seconds": lag_seconds,
"threshold_seconds": 5
}
A basic failover script outline for PostgreSQL:
#!/bin/bash
# Simplified failover - production needs more checks
REPLICA_HOST=$1
# Promote replica to primary
ssh $REPLICA_HOST "sudo -u postgres pg_ctl promote -D /var/lib/postgresql/data"
# Update connection strings (via config management, DNS, or load balancer)
update_primary_endpoint $REPLICA_HOST
# Notify on-call
send_alert "Primary failover completed to $REPLICA_HOST"
Test failovers regularly. A disaster recovery plan that’s never been executed is a hypothesis, not a plan. Schedule quarterly chaos exercises where you actually kill the primary and verify recovery works.
Common pitfalls: assuming replication lag won’t affect your application (it will), not monitoring replica health until an outage, and underestimating the operational burden of multi-master conflict resolution. Start simple with master-slave, and add complexity only when you’ve proven you need it.