System Design: Idempotency in Distributed Systems
Idempotency means that performing an operation multiple times produces the same result as performing it once. In distributed systems, this property isn't a nice-to-have—it's essential for correctness.
Key Insights
- Idempotency keys combined with atomic storage operations are the foundation for preventing duplicate processing in distributed systems—without them, network retries will eventually cause data corruption.
- The hardest part isn’t storing idempotency state; it’s handling concurrent duplicate requests that arrive within milliseconds of each other, requiring distributed locking patterns.
- Idempotency must be designed into your operations from the start—retrofitting it onto non-idempotent workflows requires state machines and careful consideration of partial failure scenarios.
Why Idempotency Matters
Idempotency means that performing an operation multiple times produces the same result as performing it once. In distributed systems, this property isn’t a nice-to-have—it’s essential for correctness.
Consider what happens when a payment API times out. Did the payment succeed? The client doesn’t know. The rational response is to retry, but without idempotency, that retry might charge the customer twice. I’ve seen this happen in production systems, and the resulting customer support tickets and refund processes are painful.
The fundamental problem is that networks are unreliable. TCP connections drop. Load balancers timeout. Services restart mid-request. The CAP theorem tells us we can’t have it all, and in practice, this means we can’t achieve true “exactly-once” semantics. What we can achieve is “at-least-once” delivery with idempotent processing—which gives us the same practical outcome.
Idempotency Keys: The Foundation
An idempotency key is a unique identifier that clients attach to requests, allowing servers to recognize and deduplicate retries. The key should be unique per logical operation, not per HTTP request.
There are two approaches to generating these keys:
Client-generated keys (typically UUIDs) give clients full control and work well for user-initiated actions. The client generates a key when the user clicks “Submit” and reuses it for all retries of that specific action.
Deterministic keys are computed from request content using hashing. This works when the request payload itself defines uniqueness—for example, hashing the combination of user ID, amount, and merchant ID for a payment.
Here’s a basic API endpoint that validates idempotency keys:
from fastapi import FastAPI, Header, HTTPException
from pydantic import BaseModel
from typing import Optional
import hashlib
app = FastAPI()
class PaymentRequest(BaseModel):
user_id: str
amount: int
currency: str
merchant_id: str
def generate_deterministic_key(request: PaymentRequest) -> str:
"""Generate idempotency key from request content."""
content = f"{request.user_id}:{request.amount}:{request.currency}:{request.merchant_id}"
return hashlib.sha256(content.encode()).hexdigest()[:32]
@app.post("/payments")
async def create_payment(
request: PaymentRequest,
idempotency_key: Optional[str] = Header(None, alias="Idempotency-Key")
):
if not idempotency_key:
# Fall back to deterministic key generation
idempotency_key = generate_deterministic_key(request)
if len(idempotency_key) < 16 or len(idempotency_key) > 64:
raise HTTPException(
status_code=400,
detail="Idempotency-Key must be between 16 and 64 characters"
)
# Process with idempotency key...
return {"idempotency_key": idempotency_key, "status": "processing"}
Storage Strategies for Idempotency State
You need to store idempotency state somewhere. The choice depends on your durability requirements and scale.
In-memory storage is fast but doesn’t survive restarts and doesn’t work across multiple server instances. Only use this for development or single-instance deployments.
Redis is the sweet spot for most applications. It’s fast, supports atomic operations, handles TTL expiration automatically, and is shared across all your application instances.
Database storage provides durability and transactional guarantees. Use this when idempotency state must survive Redis failures or when you need to query historical idempotency data.
Here’s a Redis-based middleware implementation:
import redis
import json
from functools import wraps
from typing import Callable, Any
import time
class IdempotencyMiddleware:
def __init__(self, redis_client: redis.Redis, ttl_seconds: int = 86400):
self.redis = redis_client
self.ttl = ttl_seconds
def _key(self, idempotency_key: str) -> str:
return f"idempotency:{idempotency_key}"
def check_and_set(self, idempotency_key: str) -> tuple[bool, Any]:
"""
Returns (is_duplicate, cached_response).
Uses atomic operations to prevent race conditions.
"""
key = self._key(idempotency_key)
# Try to get existing response
cached = self.redis.get(key)
if cached:
data = json.loads(cached)
if data.get("status") == "completed":
return True, data.get("response")
elif data.get("status") == "processing":
# Request is in flight - caller should wait or retry
return True, None
# Atomically set "processing" status
# NX = only set if not exists
processing_data = json.dumps({
"status": "processing",
"started_at": time.time()
})
was_set = self.redis.set(key, processing_data, nx=True, ex=self.ttl)
if not was_set:
# Another request beat us - recheck
return self.check_and_set(idempotency_key)
return False, None
def store_response(self, idempotency_key: str, response: Any):
"""Store the completed response."""
key = self._key(idempotency_key)
completed_data = json.dumps({
"status": "completed",
"response": response,
"completed_at": time.time()
})
self.redis.set(key, completed_data, ex=self.ttl)
Handling Concurrent Duplicate Requests
The trickiest scenario is when two identical requests arrive within milliseconds of each other—before either has completed processing. Without proper handling, both might execute the operation.
Distributed locking solves this. The first request acquires a lock, processes the operation, and stores the result. Concurrent requests either wait for the lock or return immediately with a “request in progress” response.
import redis
import time
from contextlib import contextmanager
from typing import Optional
class DistributedLock:
def __init__(self, redis_client: redis.Redis):
self.redis = redis_client
@contextmanager
def acquire(
self,
lock_key: str,
timeout_seconds: int = 30,
retry_interval: float = 0.1,
max_retries: int = 50
):
"""
Acquire a distributed lock with automatic expiration.
Uses the Redlock-like pattern for safety.
"""
lock_value = f"{time.time()}:{id(self)}"
full_key = f"lock:{lock_key}"
acquired = False
for _ in range(max_retries):
# SET NX with expiration - atomic acquire
acquired = self.redis.set(
full_key,
lock_value,
nx=True,
ex=timeout_seconds
)
if acquired:
break
time.sleep(retry_interval)
if not acquired:
raise TimeoutError(f"Could not acquire lock for {lock_key}")
try:
yield
finally:
# Only release if we still own the lock
# Use Lua script for atomic check-and-delete
release_script = """
if redis.call("get", KEYS[1]) == ARGV[1] then
return redis.call("del", KEYS[1])
else
return 0
end
"""
self.redis.eval(release_script, 1, full_key, lock_value)
class IdempotentProcessor:
def __init__(self, redis_client: redis.Redis):
self.redis = redis_client
self.lock = DistributedLock(redis_client)
self.middleware = IdempotencyMiddleware(redis_client)
def process(self, idempotency_key: str, operation: Callable) -> Any:
# First, quick check without locking
is_duplicate, cached = self.middleware.check_and_set(idempotency_key)
if is_duplicate and cached:
return cached
# Acquire lock for processing
with self.lock.acquire(idempotency_key):
# Double-check after acquiring lock
is_duplicate, cached = self.middleware.check_and_set(idempotency_key)
if is_duplicate and cached:
return cached
# Execute the actual operation
result = operation()
# Store result
self.middleware.store_response(idempotency_key, result)
return result
Designing Idempotent Operations
Some operations are naturally idempotent. HTTP PUT (replace resource) and DELETE (remove resource) can be called multiple times safely. POST (create resource) is not—calling it twice creates two resources.
For non-idempotent operations, use state machines with conditional updates:
from enum import Enum
from dataclasses import dataclass
from typing import Optional
import uuid
class PaymentStatus(Enum):
PENDING = "pending"
PROCESSING = "processing"
COMPLETED = "completed"
FAILED = "failed"
@dataclass
class Payment:
id: str
idempotency_key: str
amount: int
status: PaymentStatus
external_transaction_id: Optional[str] = None
class PaymentService:
def __init__(self, db, payment_gateway):
self.db = db
self.gateway = payment_gateway
def process_payment(self, idempotency_key: str, amount: int) -> Payment:
# Check for existing payment with this idempotency key
existing = self.db.find_payment_by_idempotency_key(idempotency_key)
if existing:
if existing.status == PaymentStatus.COMPLETED:
return existing # Already done, return cached result
elif existing.status == PaymentStatus.FAILED:
# Could allow retry of failed payments
raise PaymentFailedError(existing)
# Status is PENDING or PROCESSING - continue below
payment = existing
else:
# Create new payment record
payment = Payment(
id=str(uuid.uuid4()),
idempotency_key=idempotency_key,
amount=amount,
status=PaymentStatus.PENDING
)
self.db.insert_payment(payment)
# Atomic status transition: PENDING -> PROCESSING
updated = self.db.update_payment_status(
payment_id=payment.id,
expected_status=PaymentStatus.PENDING,
new_status=PaymentStatus.PROCESSING
)
if not updated:
# Another process is handling this - fetch and return
return self.db.get_payment(payment.id)
try:
# Call external payment gateway
result = self.gateway.charge(amount, reference=payment.id)
# Atomic transition: PROCESSING -> COMPLETED
self.db.update_payment_completed(
payment_id=payment.id,
external_transaction_id=result.transaction_id
)
payment.status = PaymentStatus.COMPLETED
payment.external_transaction_id = result.transaction_id
except PaymentGatewayError as e:
self.db.update_payment_status(
payment_id=payment.id,
expected_status=PaymentStatus.PROCESSING,
new_status=PaymentStatus.FAILED
)
raise
return payment
Idempotency Across Service Boundaries
In microservice architectures, a single user request might trigger calls to multiple downstream services. Each service needs to handle idempotency, and you need to propagate idempotency keys through the chain.
import httpx
from typing import Dict, Any
class ServiceClient:
def __init__(self, base_url: str):
self.base_url = base_url
self.client = httpx.Client(timeout=30.0)
def call_with_idempotency(
self,
method: str,
path: str,
parent_idempotency_key: str,
operation_name: str,
**kwargs
) -> Dict[Any, Any]:
"""
Propagate idempotency through service chain.
Derives child key from parent to maintain request lineage.
"""
# Derive a unique key for this specific downstream call
child_key = f"{parent_idempotency_key}:{operation_name}"
headers = kwargs.pop("headers", {})
headers["Idempotency-Key"] = child_key
headers["X-Parent-Idempotency-Key"] = parent_idempotency_key
response = self.client.request(
method,
f"{self.base_url}{path}",
headers=headers,
**kwargs
)
response.raise_for_status()
return response.json()
class OrderService:
def __init__(self, payment_client: ServiceClient, inventory_client: ServiceClient):
self.payments = payment_client
self.inventory = inventory_client
def create_order(self, idempotency_key: str, order_data: dict) -> dict:
# Reserve inventory - idempotent with derived key
inventory_result = self.inventory.call_with_idempotency(
"POST",
"/reservations",
parent_idempotency_key=idempotency_key,
operation_name="reserve_inventory",
json={"items": order_data["items"]}
)
# Process payment - idempotent with derived key
payment_result = self.payments.call_with_idempotency(
"POST",
"/charges",
parent_idempotency_key=idempotency_key,
operation_name="charge_payment",
json={"amount": order_data["total"]}
)
return {
"order_id": idempotency_key,
"inventory": inventory_result,
"payment": payment_result
}
For message queues, most platforms provide built-in deduplication. SQS content-based deduplication uses message body hashing. Kafka supports idempotent producers. Use these features—don’t reinvent them.
Production Considerations
Key expiration: Set TTLs based on your retry window. 24 hours is reasonable for most APIs. Too short, and legitimate retries fail. Too long, and you waste storage.
Monitoring: Track duplicate detection rates. A sudden spike might indicate client bugs, network issues, or an attack. Alert on anomalies.
Testing: Use chaos engineering to verify idempotency works under failure. Kill services mid-request. Introduce network partitions. Replay requests from logs.
Common pitfalls: Non-deterministic responses break idempotency if clients expect identical responses. Timestamps, random IDs in responses, or different error messages for the same cached request will confuse clients. Partial failures are harder—if step 2 of 3 fails, ensure retries don’t re-execute step 1.
Idempotency isn’t optional in distributed systems. Build it in from the start, test it aggressively, and monitor it in production. Your future self—and your customers—will thank you.