System Design: CAP Theorem and Trade-offs
In 2000, Eric Brewer presented a conjecture at the ACM Symposium on Principles of Distributed Computing that would fundamentally shape how we think about distributed systems. Two years later, Seth...
Key Insights
- CAP theorem forces a choice between consistency and availability only during network partitions—understanding when partitions occur helps you make informed trade-offs rather than blanket decisions.
- Most modern distributed databases offer tunable consistency, letting you adjust the CAP trade-off per operation rather than being locked into a single configuration.
- The PACELC theorem extends CAP by acknowledging that even during normal operation, you’re trading latency for consistency—a trade-off that affects every request, not just failure scenarios.
Introduction to CAP Theorem
In 2000, Eric Brewer presented a conjecture at the ACM Symposium on Principles of Distributed Computing that would fundamentally shape how we think about distributed systems. Two years later, Seth Gilbert and Nancy Lynch formally proved it. The CAP theorem states that a distributed data store can provide at most two of three guarantees: Consistency, Availability, and Partition Tolerance.
Here’s the critical insight most developers miss: this isn’t about choosing two properties for your system design. It’s about what happens during a network partition. Since network partitions are inevitable in distributed systems, you’re really choosing between consistency and availability when things go wrong.
Understanding CAP isn’t academic—it directly impacts how you design systems, choose databases, and configure replication. Get it wrong, and you’ll either lose data or serve stale results when your system is under stress.
Understanding Each Property
Consistency means every read receives the most recent write or an error. All nodes in the system see the same data at the same time. Think of it like a bank account: if you deposit $100, every ATM should immediately show your updated balance.
Availability means every request receives a non-error response, without guarantee that it contains the most recent write. The system always responds, even if the data might be slightly stale. Think of a social media feed—you’d rather see posts from five minutes ago than get an error page.
Partition Tolerance means the system continues operating despite network failures between nodes. Messages between nodes can be dropped or delayed. In any real distributed system, partitions will happen—network switches fail, cables get cut, cloud regions lose connectivity.
Here’s a simple distributed key-value store that illustrates these concepts:
class DistributedKVStore:
def __init__(self, node_id, peers, consistency_mode="CP"):
self.node_id = node_id
self.peers = peers # List of peer node connections
self.data = {}
self.consistency_mode = consistency_mode
def write(self, key, value):
if self.consistency_mode == "CP":
# Wait for majority acknowledgment before confirming
acks = 1 # Count self
for peer in self.peers:
try:
peer.replicate(key, value, timeout=5)
acks += 1
except NetworkTimeout:
continue
if acks < (len(self.peers) + 1) // 2 + 1:
raise UnavailableError("Cannot reach quorum")
self.data[key] = value
return "OK"
else: # AP mode
# Write locally, replicate asynchronously
self.data[key] = value
for peer in self.peers:
self._async_replicate(peer, key, value)
return "OK" # Always responds
def read(self, key):
if self.consistency_mode == "CP":
# Read from quorum to ensure consistency
values = [self.data.get(key)]
for peer in self.peers:
try:
values.append(peer.get(key, timeout=5))
except NetworkTimeout:
continue
if len(values) < (len(self.peers) + 1) // 2 + 1:
raise UnavailableError("Cannot reach quorum")
return self._resolve_latest(values)
else: # AP mode
# Return local value immediately
return self.data.get(key)
The Three Trade-off Combinations
CP Systems: Consistency Over Availability
CP systems refuse to respond rather than return potentially stale data. When a partition occurs, nodes that can’t confirm they have the latest data will return errors.
Examples: ZooKeeper, etcd, Consul, HBase
Use when: Correctness matters more than uptime. Configuration management, leader election, distributed locks.
AP Systems: Availability Over Consistency
AP systems always respond, accepting that different nodes might temporarily have different views of the data. They rely on eventual consistency—given enough time without new updates, all nodes will converge.
Examples: Cassandra, DynamoDB, CouchDB, Riak
Use when: Uptime matters more than immediate consistency. User sessions, product catalogs, social media feeds.
CA Systems: The Myth
You’ll sometimes see CA listed as an option. It’s not. A CA system would be consistent and available but not partition tolerant—meaning it would simply stop working during network issues. That’s not a distributed system; that’s a single-node database. The moment you add a second node, you must handle partitions.
Here’s how Redis behaves differently based on cluster configuration:
import redis
# CP-style configuration: Redis Cluster with WAIT
def cp_write(client, key, value):
"""
Blocks until write is replicated to specified number of replicas.
May timeout and fail if replicas are unreachable.
"""
pipe = client.pipeline()
pipe.set(key, value)
pipe.wait(numreplicas=2, timeout=5000) # Wait for 2 replicas, 5s timeout
results = pipe.execute()
replicas_confirmed = results[1]
if replicas_confirmed < 2:
raise Exception(f"Only {replicas_confirmed} replicas confirmed")
return True
# AP-style configuration: Fire and forget
def ap_write(client, key, value):
"""
Returns immediately after primary acknowledges.
Replication happens asynchronously.
"""
client.set(key, value)
return True # Always succeeds if primary is up
# Reading with different consistency requirements
def cp_read(client, key):
"""
Read from primary only to ensure latest value.
Fails if primary is unreachable.
"""
return client.get(key) # Redis reads from primary by default
def ap_read(client, key):
"""
Allow reading from replicas for higher availability.
May return stale data.
"""
client.readonly() # Enable replica reads
return client.get(key)
Real-World System Analysis
MongoDB
MongoDB defaults to AP behavior but can be configured for CP semantics through write concerns:
// AP configuration: Fast, potentially inconsistent
db.orders.insertOne(
{ orderId: "12345", status: "pending" },
{ writeConcern: { w: 1 } } // Acknowledge after primary write
)
// CP configuration: Slower, strongly consistent
db.orders.insertOne(
{ orderId: "12345", status: "pending" },
{ writeConcern: { w: "majority", j: true } } // Wait for majority + journal
)
// Read concern affects consistency too
db.orders.find({ orderId: "12345" }).readConcern("majority")
// Only returns data acknowledged by majority of nodes
PostgreSQL with Streaming Replication
PostgreSQL offers synchronous replication for CP semantics:
-- On primary: Configure synchronous replication
ALTER SYSTEM SET synchronous_commit = 'on';
ALTER SYSTEM SET synchronous_standby_names = 'replica1, replica2';
-- Transactions wait for replica acknowledgment
BEGIN;
UPDATE accounts SET balance = balance - 100 WHERE id = 1;
COMMIT; -- Blocks until replica confirms
Apache Kafka
Kafka lets you tune consistency per producer:
// CP configuration
Properties props = new Properties();
props.put("acks", "all"); // Wait for all in-sync replicas
props.put("min.insync.replicas", 2);
props.put("retries", Integer.MAX_VALUE);
// AP configuration
Properties props = new Properties();
props.put("acks", "1"); // Only wait for leader
props.put("retries", 0);
Beyond CAP: PACELC Theorem
CAP only describes behavior during partitions. But what about normal operation? The PACELC theorem extends CAP: if there’s a Partition, choose Availability or Consistency; Else, choose Latency or Consistency.
This matters because even when your network is healthy, you’re constantly trading latency for consistency. A synchronous write to three replicas takes longer than a write to one. A quorum read across data centers adds latency.
Consider this breakdown:
| System | During Partition (PAC) | Normal Operation (ELC) |
|---|---|---|
| DynamoDB | AP | EL (low latency, eventual consistency) |
| MongoDB (default) | AP | EL |
| MongoDB (w:majority) | CP | EC |
| Spanner | CP | EC (pays latency cost) |
| Cassandra | AP | EL (tunable) |
Making Trade-off Decisions
Here’s my decision framework:
Choose CP when:
- Financial transactions where inconsistency means lost money
- Inventory systems where overselling is unacceptable
- Distributed locks and leader election
- Configuration that affects system behavior
Choose AP when:
- User-facing features where errors hurt engagement
- Caching layers where stale data is acceptable
- Analytics and logging where some loss is tolerable
- Shopping carts (with conflict resolution)
For AP systems, you need a conflict resolution strategy:
class ShoppingCart:
"""
AP shopping cart with vector clock conflict resolution.
Multiple nodes can accept concurrent updates.
"""
def __init__(self):
self.items = {}
self.vector_clock = {} # {node_id: counter}
def add_item(self, node_id, item_id, quantity):
# Increment this node's clock
self.vector_clock[node_id] = self.vector_clock.get(node_id, 0) + 1
# Add item with clock timestamp
self.items[item_id] = {
"quantity": quantity,
"clock": self.vector_clock.copy()
}
def merge(self, other_cart):
"""
Merge concurrent updates from another node.
Uses vector clocks to detect and resolve conflicts.
"""
for item_id, other_item in other_cart.items.items():
if item_id not in self.items:
self.items[item_id] = other_item
else:
my_item = self.items[item_id]
if self._clock_greater(other_item["clock"], my_item["clock"]):
# Other is newer, take it
self.items[item_id] = other_item
elif self._clock_concurrent(other_item["clock"], my_item["clock"]):
# Concurrent updates: merge by taking max quantity
self.items[item_id] = {
"quantity": max(my_item["quantity"], other_item["quantity"]),
"clock": self._merge_clocks(my_item["clock"], other_item["clock"])
}
# else: my item is newer, keep it
Conclusion
CAP theorem isn’t about permanently sacrificing one property—it’s about understanding what happens during the inevitable network partition. Most modern databases give you knobs to tune this trade-off per operation, letting you choose strong consistency for critical writes and high availability for reads.
When designing systems, ask these questions: What’s the cost of serving stale data? What’s the cost of being unavailable? How often do partitions actually occur in your infrastructure? Your answers should drive your database choice and configuration, not theoretical purity.
Remember that PACELC extends this thinking to normal operation. Even when everything is working, you’re trading latency for consistency on every request. Design for your actual access patterns and failure modes, not worst-case scenarios you’ll never encounter.