System Design: Replication Patterns (Master-Slave, Multi-Master)

Database replication copies data across multiple servers to achieve goals that a single database instance cannot: surviving hardware failures, scaling read capacity, and serving users across...

Key Insights

  • Master-slave replication excels at read scaling with simple conflict resolution, but creates a single point of failure for writes—use it when your read-to-write ratio exceeds 10:1 and you can tolerate brief write unavailability during failover.
  • Multi-master replication enables geo-distributed writes and eliminates write bottlenecks, but forces you to solve conflict resolution—a problem with no perfect solution, only trade-offs between consistency, latency, and complexity.
  • Replication lag is inevitable in any distributed system; design your application to handle it explicitly through patterns like read-your-writes consistency rather than hoping it won’t matter.

Introduction to Database Replication

Database replication copies data across multiple servers to achieve goals that a single database instance cannot: surviving hardware failures, scaling read capacity, and serving users across geographic regions with acceptable latency.

Before diving into patterns, let’s establish terminology. A primary (or master) is a node that accepts writes. A replica (or slave) receives copied data from another node. Synchronous replication waits for replicas to confirm writes before acknowledging to clients—stronger consistency, higher latency. Asynchronous replication acknowledges immediately and copies data in the background—lower latency, potential data loss.

The pattern you choose shapes your entire system architecture. Get it wrong, and you’ll either over-engineer a simple problem or discover your “highly available” system has a single point of failure during your first real outage.

Master-Slave (Primary-Replica) Architecture

Master-slave is the workhorse of database replication. One node handles all writes; replicas receive a stream of changes and serve read traffic.

The model works because most applications read far more than they write. An e-commerce product catalog might see 1,000 reads per write. Spreading those reads across five replicas means each handles only 200 requests while the primary focuses on the 1 write.

Here’s a PostgreSQL streaming replication configuration for the primary:

# postgresql.conf on primary
wal_level = replica
max_wal_senders = 5
wal_keep_size = 1GB
synchronous_commit = on
synchronous_standby_names = 'replica1'

And the replica’s recovery configuration:

# postgresql.conf on replica
primary_conninfo = 'host=primary-host port=5432 user=replicator password=secret'
primary_slot_name = 'replica1_slot'
hot_standby = on

Your application needs to route queries appropriately. Here’s a Python implementation using SQLAlchemy:

from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker
from contextlib import contextmanager
import random

class ReplicatedDatabase:
    def __init__(self, primary_url: str, replica_urls: list[str]):
        self.primary = create_engine(primary_url, pool_size=10)
        self.replicas = [create_engine(url, pool_size=20) for url in replica_urls]
    
    @contextmanager
    def write_session(self):
        Session = sessionmaker(bind=self.primary)
        session = Session()
        try:
            yield session
            session.commit()
        except Exception:
            session.rollback()
            raise
        finally:
            session.close()
    
    @contextmanager
    def read_session(self):
        replica = random.choice(self.replicas)
        Session = sessionmaker(bind=replica)
        session = Session()
        try:
            yield session
        finally:
            session.close()

The limitation is obvious: your primary is a single point of failure for writes. When it dies, you need a promotion process—either manual or automated—that takes time. During that window, writes fail.

Multi-Master (Active-Active) Architecture

Multi-master replication allows writes on any node. Each node replicates changes to others, creating a mesh of data flow.

This pattern shines for geo-distributed deployments. Users in Europe write to European nodes; users in Asia write to Asian nodes. Everyone gets low-latency writes instead of crossing oceans for every INSERT.

MySQL Group Replication provides multi-master capability:

# my.cnf for each node in the group
[mysqld]
server_id=1  # unique per node
gtid_mode=ON
enforce_gtid_consistency=ON
binlog_checksum=NONE

# Group Replication settings
plugin_load_add='group_replication.so'
group_replication_group_name="aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee"
group_replication_start_on_boot=OFF
group_replication_local_address="node1:33061"
group_replication_group_seeds="node1:33061,node2:33061,node3:33061"
group_replication_single_primary_mode=OFF  # enables multi-master

CockroachDB takes a different approach, building multi-master into its core architecture:

# docker-compose.yml for local CockroachDB cluster
version: '3.8'
services:
  roach1:
    image: cockroachdb/cockroach:latest
    command: start --insecure --join=roach1,roach2,roach3
    ports:
      - "26257:26257"
      - "8080:8080"
  
  roach2:
    image: cockroachdb/cockroach:latest
    command: start --insecure --join=roach1,roach2,roach3
  
  roach3:
    image: cockroachdb/cockroach:latest
    command: start --insecure --join=roach1,roach2,roach3

The complexity cost is real. When two nodes modify the same row simultaneously, you have a conflict. Someone has to lose, and deciding who—and how—is the hard part.

Conflict Resolution Strategies

Conflicts in multi-master systems are inevitable. Your strategy for handling them determines whether your system behaves predictably or creates subtle data corruption.

Last-write-wins (LWW) is the simplest approach: attach a timestamp to each write, and the highest timestamp wins. It’s easy to implement but can silently discard valid updates.

from dataclasses import dataclass
from datetime import datetime
import time

@dataclass
class VersionedValue:
    value: any
    timestamp: float
    node_id: str

class LWWRegister:
    def __init__(self, node_id: str):
        self.node_id = node_id
        self.state: VersionedValue | None = None
    
    def write(self, value: any) -> VersionedValue:
        new_state = VersionedValue(
            value=value,
            timestamp=time.time(),
            node_id=self.node_id
        )
        self._merge(new_state)
        return new_state
    
    def merge_remote(self, remote: VersionedValue):
        self._merge(remote)
    
    def _merge(self, incoming: VersionedValue):
        if self.state is None:
            self.state = incoming
        elif incoming.timestamp > self.state.timestamp:
            self.state = incoming
        elif incoming.timestamp == self.state.timestamp:
            # Tie-breaker: higher node_id wins
            if incoming.node_id > self.state.node_id:
                self.state = incoming
    
    def read(self) -> any:
        return self.state.value if self.state else None

CRDTs (Conflict-free Replicated Data Types) guarantee convergence without coordination. A G-Counter (grow-only counter) is a classic example:

class GCounter:
    def __init__(self, node_id: str):
        self.node_id = node_id
        self.counts: dict[str, int] = {}
    
    def increment(self, amount: int = 1):
        current = self.counts.get(self.node_id, 0)
        self.counts[self.node_id] = current + amount
    
    def merge(self, other: 'GCounter'):
        all_nodes = set(self.counts.keys()) | set(other.counts.keys())
        for node in all_nodes:
            self.counts[node] = max(
                self.counts.get(node, 0),
                other.counts.get(node, 0)
            )
    
    def value(self) -> int:
        return sum(self.counts.values())

Application-level resolution pushes conflict handling to business logic. Shopping cart merges might union items; document edits might require manual review. This approach is more work but produces semantically correct results.

Replication Lag and Consistency Patterns

Asynchronous replication means replicas lag behind the primary. A user updates their profile, immediately reads it back, and sees stale data. This breaks user expectations and causes support tickets.

Read-your-writes consistency ensures users see their own updates. Implement it by tracking the last write timestamp and routing reads appropriately:

from fastapi import FastAPI, Request, Response
from typing import Optional
import time

app = FastAPI()

class ConsistencyManager:
    def __init__(self, db: ReplicatedDatabase):
        self.db = db
    
    def get_session(self, request: Request, is_write: bool):
        if is_write:
            return self.db.write_session()
        
        # Check if user recently wrote
        last_write = request.cookies.get('last_write_ts')
        if last_write:
            last_write_time = float(last_write)
            # If write was recent, read from primary
            if time.time() - last_write_time < 5.0:  # 5 second window
                return self.db.write_session()  # Use primary for read
        
        return self.db.read_session()
    
    def mark_write(self, response: Response):
        response.set_cookie('last_write_ts', str(time.time()))

consistency = ConsistencyManager(db)

@app.post("/users/{user_id}")
async def update_user(user_id: int, request: Request, response: Response):
    with consistency.get_session(request, is_write=True) as session:
        # perform update
        pass
    consistency.mark_write(response)
    return {"status": "updated"}

@app.get("/users/{user_id}")
async def get_user(user_id: int, request: Request):
    with consistency.get_session(request, is_write=False) as session:
        # perform read
        pass

An alternative approach uses version tokens instead of timestamps:

@app.post("/users/{user_id}")
async def update_user(user_id: int):
    with db.write_session() as session:
        result = session.execute(
            "UPDATE users SET ... RETURNING pg_current_wal_lsn()"
        )
        lsn = result.scalar()
    return {"status": "updated", "version": str(lsn)}

@app.get("/users/{user_id}")
async def get_user(user_id: int, min_version: Optional[str] = None):
    if min_version:
        # Wait for replica to catch up or use primary
        with db.read_session() as session:
            current_lsn = session.execute(
                "SELECT pg_last_wal_replay_lsn()"
            ).scalar()
            if current_lsn < min_version:
                # Fall back to primary
                pass

Choosing the Right Pattern

Factor Master-Slave Multi-Master
Read/write ratio High (10:1+) Any
Write availability Tolerates brief outages Critical
Geographic distribution Single region Multi-region
Consistency requirements Strong preferred Eventual acceptable
Operational complexity Lower Higher
Conflict handling None needed Required

Choose master-slave when you need simplicity and can tolerate a promotion window during primary failures. Most applications fit this category.

Choose multi-master when writes must succeed in multiple regions simultaneously or when you cannot accept any write downtime. Be prepared to invest in conflict resolution.

Practical Implementation Considerations

Monitor replication lag continuously. Here’s a health check endpoint:

@app.get("/health/replication")
async def replication_health():
    lag_seconds = get_replica_lag()  # Query pg_stat_replication
    status = "healthy" if lag_seconds < 5 else "degraded"
    
    return {
        "status": status,
        "lag_seconds": lag_seconds,
        "threshold_seconds": 5
    }

A basic failover script outline for PostgreSQL:

#!/bin/bash
# Simplified failover - production needs more checks

REPLICA_HOST=$1

# Promote replica to primary
ssh $REPLICA_HOST "sudo -u postgres pg_ctl promote -D /var/lib/postgresql/data"

# Update connection strings (via config management, DNS, or load balancer)
update_primary_endpoint $REPLICA_HOST

# Notify on-call
send_alert "Primary failover completed to $REPLICA_HOST"

Test failovers regularly. A disaster recovery plan that’s never been executed is a hypothesis, not a plan. Schedule quarterly chaos exercises where you actually kill the primary and verify recovery works.

Common pitfalls: assuming replication lag won’t affect your application (it will), not monitoring replica health until an outage, and underestimating the operational burden of multi-master conflict resolution. Start simple with master-slave, and add complexity only when you’ve proven you need it.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.