System Design: Saga Pattern for Distributed Transactions

When you split a monolith into microservices, you inherit a fundamental problem: transactions that once lived in a single database now span multiple services with their own data stores. The classic...

Key Insights

  • Sagas replace distributed ACID transactions with a sequence of local transactions and compensating actions, trading strong consistency for availability and partition tolerance in microservices architectures.
  • Choose choreography for simple, loosely-coupled flows and orchestration when you need explicit control, complex branching logic, or better observability into transaction state.
  • Compensating transactions must be idempotent and semantically correct—you’re not rolling back state, you’re applying a new transaction that reverses the business effect.

The Distributed Transaction Problem

When you split a monolith into microservices, you inherit a fundamental problem: transactions that once lived in a single database now span multiple services with their own data stores. The classic solution—two-phase commit (2PC)—doesn’t work well in this world.

2PC requires a coordinator that locks resources across all participants, waits for everyone to vote “ready,” then commits. This creates several problems: it’s synchronous and slow, any participant failure blocks the entire transaction, and it doesn’t tolerate network partitions. In a microservices environment where services are deployed independently and networks are unreliable, 2PC becomes a availability bottleneck.

The CAP theorem forces a choice. Most distributed systems choose availability and partition tolerance over strong consistency. This means accepting eventual consistency—and that requires a different approach to transactions.

What is the Saga Pattern?

A saga is a sequence of local transactions where each transaction updates data within a single service and publishes an event or message to trigger the next step. If any step fails, the saga executes compensating transactions to undo the changes made by preceding steps.

Consider an e-commerce order flow: reserve inventory, charge payment, create shipment. In a saga, each step is a local ACID transaction. If payment fails after inventory is reserved, a compensating transaction releases that inventory.

The key insight is that you’re not rolling back—you’re moving forward with corrective actions. The inventory was reserved (that transaction committed), and now you’re applying a new transaction that unreserves it. This distinction matters because it changes how you think about failure handling.

Choreography vs. Orchestration

There are two ways to coordinate saga steps: let services figure it out themselves (choreography) or have a central coordinator manage the flow (orchestration).

Choreography works through events. Each service listens for events it cares about, performs its local transaction, and emits new events. No service knows the full workflow—they just react to what happens.

# Order Service - initiates the saga
class OrderService:
    def __init__(self, event_bus):
        self.event_bus = event_bus
    
    def create_order(self, order_data):
        order = Order.create(order_data)
        order.status = "PENDING"
        order.save()
        
        self.event_bus.publish("order.created", {
            "order_id": order.id,
            "items": order.items,
            "customer_id": order.customer_id,
            "total_amount": order.total
        })
        return order

# Inventory Service - reacts to order creation
class InventoryService:
    def __init__(self, event_bus):
        self.event_bus = event_bus
        event_bus.subscribe("order.created", self.handle_order_created)
        event_bus.subscribe("payment.failed", self.handle_payment_failed)
    
    def handle_order_created(self, event):
        try:
            reservation = self.reserve_items(event["order_id"], event["items"])
            self.event_bus.publish("inventory.reserved", {
                "order_id": event["order_id"],
                "reservation_id": reservation.id
            })
        except InsufficientStockError:
            self.event_bus.publish("inventory.reservation_failed", {
                "order_id": event["order_id"],
                "reason": "insufficient_stock"
            })
    
    def handle_payment_failed(self, event):
        # Compensating transaction
        self.release_reservation(event["order_id"])

# Payment Service - reacts to inventory reservation
class PaymentService:
    def __init__(self, event_bus):
        self.event_bus = event_bus
        event_bus.subscribe("inventory.reserved", self.handle_inventory_reserved)
    
    def handle_inventory_reserved(self, event):
        try:
            payment = self.charge_customer(event["order_id"])
            self.event_bus.publish("payment.completed", {
                "order_id": event["order_id"],
                "payment_id": payment.id
            })
        except PaymentDeclinedError:
            self.event_bus.publish("payment.failed", {
                "order_id": event["order_id"]
            })

Choreography keeps services decoupled but makes the overall flow hard to understand. You have to trace events across multiple services to see what’s happening.

Orchestration centralizes the workflow logic. A dedicated orchestrator service knows the full saga definition and explicitly tells each service what to do.

Designing Compensating Transactions

Compensating transactions are where sagas get tricky. You’re not undoing a transaction—you’re applying a new one that semantically reverses the effect. This has important implications.

First, compensations must be idempotent. Network failures mean messages might be delivered multiple times. If your compensation runs twice, the result should be the same as running once.

class PaymentService:
    def charge_customer(self, order_id, amount, idempotency_key):
        # Check if we've already processed this charge
        existing = Payment.find_by_idempotency_key(idempotency_key)
        if existing:
            return existing
        
        payment = Payment.create(
            order_id=order_id,
            amount=amount,
            idempotency_key=idempotency_key,
            status="COMPLETED"
        )
        self.payment_gateway.charge(payment)
        return payment
    
    def refund_payment(self, order_id, idempotency_key):
        refund_key = f"refund_{idempotency_key}"
        
        # Idempotent refund - check if already processed
        existing_refund = Refund.find_by_idempotency_key(refund_key)
        if existing_refund:
            return existing_refund
        
        payment = Payment.find_by_order(order_id)
        if not payment:
            # Nothing to refund - this is fine, not an error
            return None
        
        if payment.status == "REFUNDED":
            # Already refunded - idempotent success
            return payment.refund
        
        refund = Refund.create(
            payment_id=payment.id,
            amount=payment.amount,
            idempotency_key=refund_key
        )
        self.payment_gateway.refund(refund)
        payment.status = "REFUNDED"
        payment.save()
        
        return refund

Second, understand the difference between semantic and technical rollbacks. A technical rollback would somehow undo the credit card charge. A semantic rollback issues a refund—a new transaction that achieves the same business outcome. Sometimes semantic rollbacks aren’t perfect mirrors: a refund might take days to process, or you might not be able to un-send an email notification.

Third, some actions can’t be compensated. Once you’ve shipped a physical package, you can’t un-ship it. These are called pivot transactions—the point of no return. Design your saga so retriable, compensatable steps happen before pivot transactions.

Orchestrator Implementation

An orchestrator explicitly manages saga state and coordinates steps. This approach trades some coupling for better visibility and control.

from enum import Enum
from dataclasses import dataclass
from typing import Callable, List, Optional
import json

class SagaStatus(Enum):
    RUNNING = "running"
    COMPLETED = "completed"
    COMPENSATING = "compensating"
    FAILED = "failed"

@dataclass
class SagaStep:
    name: str
    action: Callable
    compensation: Callable

class SagaOrchestrator:
    def __init__(self, saga_id: str, steps: List[SagaStep], state_store):
        self.saga_id = saga_id
        self.steps = steps
        self.state_store = state_store
        self.current_step = 0
        self.status = SagaStatus.RUNNING
        self.context = {}
        self.completed_steps = []
    
    def execute(self, initial_context: dict):
        self.context = initial_context
        self._persist_state()
        
        while self.current_step < len(self.steps):
            step = self.steps[self.current_step]
            
            try:
                result = self._execute_with_retry(step.action, self.context)
                self.context.update(result or {})
                self.completed_steps.append(step.name)
                self.current_step += 1
                self._persist_state()
                
            except Exception as e:
                self._handle_failure(e)
                return False
        
        self.status = SagaStatus.COMPLETED
        self._persist_state()
        return True
    
    def _execute_with_retry(self, action: Callable, context: dict, max_retries=3):
        for attempt in range(max_retries):
            try:
                return action(context)
            except RetriableError:
                if attempt == max_retries - 1:
                    raise
                time.sleep(2 ** attempt)  # Exponential backoff
    
    def _handle_failure(self, error: Exception):
        self.status = SagaStatus.COMPENSATING
        self._persist_state()
        
        # Compensate in reverse order
        for step_name in reversed(self.completed_steps):
            step = next(s for s in self.steps if s.name == step_name)
            try:
                step.compensation(self.context)
            except Exception as comp_error:
                # Log but continue - compensations should be idempotent
                # and can be retried later
                logger.error(f"Compensation failed for {step_name}: {comp_error}")
        
        self.status = SagaStatus.FAILED
        self._persist_state()
    
    def _persist_state(self):
        state = {
            "saga_id": self.saga_id,
            "status": self.status.value,
            "current_step": self.current_step,
            "completed_steps": self.completed_steps,
            "context": self.context
        }
        self.state_store.save(self.saga_id, state)

# Usage
def create_order_saga(order_data):
    steps = [
        SagaStep(
            name="reserve_inventory",
            action=lambda ctx: inventory_client.reserve(ctx["items"]),
            compensation=lambda ctx: inventory_client.release(ctx["reservation_id"])
        ),
        SagaStep(
            name="process_payment",
            action=lambda ctx: payment_client.charge(ctx["customer_id"], ctx["total"]),
            compensation=lambda ctx: payment_client.refund(ctx["payment_id"])
        ),
        SagaStep(
            name="create_shipment",
            action=lambda ctx: shipping_client.create(ctx["order_id"]),
            compensation=lambda ctx: shipping_client.cancel(ctx["shipment_id"])
        ),
    ]
    
    orchestrator = SagaOrchestrator(
        saga_id=f"order_{order_data['order_id']}",
        steps=steps,
        state_store=redis_state_store
    )
    return orchestrator.execute(order_data)

The orchestrator persists state after each step, enabling recovery if the orchestrator itself crashes. On restart, it can load state and resume from where it left off—either continuing forward or completing compensations.

Failure Handling and Edge Cases

Real-world sagas face challenges beyond the happy path.

Timeouts require careful handling. If a service doesn’t respond, is it failed or just slow? Use explicit timeout tracking and consider the operation failed if no response arrives within a deadline. The compensation will handle cleanup, and idempotency ensures no harm if the original operation eventually succeeds.

Duplicate messages are inevitable in distributed systems. Every action and compensation must be idempotent. Use idempotency keys, check for existing records before creating new ones, and design operations to be safely repeatable.

Observability is critical. Log saga state transitions with correlation IDs. Emit metrics on saga duration, failure rates, and compensation frequency. Build dashboards that show in-flight sagas and their current states. When something goes wrong at 3 AM, you need to quickly understand which sagas are stuck and why.

When to Use (and Avoid) Sagas

Sagas add complexity. Use them when you genuinely need distributed transactions across services with separate data stores.

Use sagas when:

  • Operations span multiple services that own their data
  • You can tolerate eventual consistency
  • You can define meaningful compensating actions
  • The business process naturally fits a sequential or parallel flow

Avoid sagas when:

  • A single database transaction would work (don’t over-engineer)
  • You need strong consistency guarantees (consider a different architecture)
  • Compensations would be too complex or impossible
  • The workflow has too many branches and conditions

Consider alternatives: sometimes you can restructure service boundaries to keep transactions local. Event sourcing with projections can provide consistency without distributed transactions. For simple cases, outbox patterns with eventual consistency might suffice.

Sagas are a powerful tool, but they’re not free. The complexity of managing compensations, ensuring idempotency, and debugging distributed flows is real. Make sure the problem justifies the solution.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.