Saga Pattern: Long-Running Transaction Coordination

Traditional ACID transactions work beautifully within a single database. You start a transaction, make changes across multiple tables, and either commit everything or roll it all back. The database...

Key Insights

  • Sagas replace distributed ACID transactions with a sequence of local transactions, each paired with a compensating action that semantically undoes its effects when failures occur downstream.
  • Choreography-based sagas offer loose coupling through event-driven communication, while orchestration-based sagas provide explicit flow control and easier debugging at the cost of a central coordinator.
  • Idempotency is non-negotiable for saga steps and compensations—without it, retries and failure recovery will corrupt your system state.

The Distributed Transaction Problem

Traditional ACID transactions work beautifully within a single database. You start a transaction, make changes across multiple tables, and either commit everything or roll it all back. The database handles the complexity.

Distribute your system across multiple services with their own databases, and this model falls apart. Two-phase commit (2PC) technically works across boundaries, but it’s fragile. Holding locks across network calls creates availability nightmares. One slow or unavailable service blocks everyone else. In practice, 2PC doesn’t scale.

The Saga pattern, introduced by Hector Garcia-Molina and Kenneth Salem in their 1987 paper, offers an alternative. Instead of one atomic transaction, you execute a sequence of local transactions. Each transaction commits immediately. If something fails partway through, you execute compensating transactions to undo the work.

You need sagas when business processes span multiple services: e-commerce orders that touch inventory, payments, and shipping; travel bookings coordinating flights, hotels, and car rentals; financial transfers moving money between accounts at different institutions.

Saga Fundamentals: Compensating Transactions

A saga consists of transaction/compensation pairs. For each step Ti, you define a compensating transaction Ci that semantically reverses Ti’s effects. “Semantically” matters here—you can’t always technically undo something. A sent email can’t be unsent. A shipped package can’t be unshipped. But you can send a cancellation email or arrange a return.

Consider an order processing saga:

from dataclasses import dataclass
from enum import Enum
from typing import Callable, List, Optional

class SagaStepStatus(Enum):
    PENDING = "pending"
    COMPLETED = "completed"
    COMPENSATED = "compensated"
    FAILED = "failed"

@dataclass
class SagaStep:
    name: str
    action: Callable[[dict], dict]
    compensation: Callable[[dict], None]
    status: SagaStepStatus = SagaStepStatus.PENDING

class OrderSaga:
    def __init__(self, order_id: str):
        self.order_id = order_id
        self.context = {"order_id": order_id}
        self.steps: List[SagaStep] = [
            SagaStep(
                name="create_order",
                action=self._create_order,
                compensation=self._cancel_order
            ),
            SagaStep(
                name="reserve_inventory",
                action=self._reserve_inventory,
                compensation=self._release_inventory
            ),
            SagaStep(
                name="process_payment",
                action=self._process_payment,
                compensation=self._refund_payment
            ),
            SagaStep(
                name="confirm_order",
                action=self._confirm_order,
                compensation=self._void_confirmation
            ),
        ]
    
    def execute(self) -> bool:
        completed_steps = []
        
        for step in self.steps:
            try:
                self.context = step.action(self.context)
                step.status = SagaStepStatus.COMPLETED
                completed_steps.append(step)
            except Exception as e:
                print(f"Step {step.name} failed: {e}")
                step.status = SagaStepStatus.FAILED
                self._compensate(completed_steps)
                return False
        
        return True
    
    def _compensate(self, completed_steps: List[SagaStep]):
        for step in reversed(completed_steps):
            try:
                step.compensation(self.context)
                step.status = SagaStepStatus.COMPENSATED
            except Exception as e:
                print(f"Compensation {step.name} failed: {e}")
                # Alert operations team for manual intervention
                self._alert_manual_intervention(step, e)

This illustrates backward recovery—when a step fails, we compensate in reverse order. Forward recovery is an alternative where you retry failed steps until they succeed, useful when eventual success is guaranteed.

Choreography-Based Sagas

In choreography, no central coordinator exists. Services publish domain events, and other services react to them. The saga emerges from the collective behavior of independent services.

# order_service.py
from kafka import KafkaProducer, KafkaConsumer
import json

class OrderService:
    def __init__(self):
        self.producer = KafkaProducer(
            bootstrap_servers=['localhost:9092'],
            value_serializer=lambda v: json.dumps(v).encode('utf-8')
        )
        self.consumer = KafkaConsumer(
            'payment-events', 'inventory-events',
            bootstrap_servers=['localhost:9092'],
            value_deserializer=lambda m: json.loads(m.decode('utf-8'))
        )
    
    def create_order(self, order_data: dict) -> str:
        order_id = self._persist_order(order_data, status="PENDING")
        
        self.producer.send('order-events', {
            'type': 'OrderCreated',
            'order_id': order_id,
            'items': order_data['items'],
            'customer_id': order_data['customer_id'],
            'total': order_data['total']
        })
        
        return order_id
    
    def handle_events(self):
        for message in self.consumer:
            event = message.value
            
            if event['type'] == 'InventoryReserved':
                self._update_order_status(event['order_id'], "INVENTORY_RESERVED")
                
            elif event['type'] == 'InventoryReservationFailed':
                self._update_order_status(event['order_id'], "CANCELLED")
                self.producer.send('order-events', {
                    'type': 'OrderCancelled',
                    'order_id': event['order_id'],
                    'reason': 'inventory_unavailable'
                })
                
            elif event['type'] == 'PaymentProcessed':
                self._update_order_status(event['order_id'], "CONFIRMED")
                self.producer.send('order-events', {
                    'type': 'OrderConfirmed',
                    'order_id': event['order_id']
                })
                
            elif event['type'] == 'PaymentFailed':
                self._update_order_status(event['order_id'], "CANCELLED")
                self.producer.send('order-events', {
                    'type': 'OrderCancelled',
                    'order_id': event['order_id'],
                    'reason': 'payment_failed'
                })
                # Inventory service will react to OrderCancelled

Choreography shines when you have genuinely independent services that shouldn’t know about each other. The inventory service doesn’t need to know about payments—it just reacts to order events.

The downsides are real though. Tracing a saga’s progress requires aggregating events across services. Cyclic event dependencies can create infinite loops. Adding a new step means modifying multiple services. Debugging production issues becomes archaeology.

Orchestration-Based Sagas

Orchestration centralizes saga logic in a coordinator. The orchestrator knows the complete flow and tells each service what to do.

from enum import Enum
from typing import Dict, Any, Optional
import asyncio

class SagaState(Enum):
    STARTED = "started"
    ORDER_CREATED = "order_created"
    INVENTORY_RESERVED = "inventory_reserved"
    PAYMENT_PROCESSED = "payment_processed"
    COMPLETED = "completed"
    COMPENSATING = "compensating"
    FAILED = "failed"

class OrderSagaOrchestrator:
    def __init__(self, order_service, inventory_service, payment_service):
        self.order_service = order_service
        self.inventory_service = inventory_service
        self.payment_service = payment_service
        self.state_store = {}  # In production: persistent store
    
    async def execute(self, saga_id: str, order_data: dict) -> bool:
        self._save_state(saga_id, SagaState.STARTED, order_data)
        
        try:
            # Step 1: Create order
            order_id = await self.order_service.create(order_data)
            self._save_state(saga_id, SagaState.ORDER_CREATED, 
                           {"order_id": order_id, **order_data})
            
            # Step 2: Reserve inventory
            reservation_id = await self.inventory_service.reserve(
                order_id, order_data['items']
            )
            self._save_state(saga_id, SagaState.INVENTORY_RESERVED,
                           {"reservation_id": reservation_id})
            
            # Step 3: Process payment
            payment_id = await self.payment_service.charge(
                order_data['customer_id'], order_data['total']
            )
            self._save_state(saga_id, SagaState.PAYMENT_PROCESSED,
                           {"payment_id": payment_id})
            
            # Step 4: Confirm everything
            await self.order_service.confirm(order_id)
            self._save_state(saga_id, SagaState.COMPLETED, {})
            
            return True
            
        except Exception as e:
            await self._compensate(saga_id)
            return False
    
    async def _compensate(self, saga_id: str):
        state_data = self.state_store.get(saga_id, {})
        current_state = state_data.get('state')
        context = state_data.get('context', {})
        
        self._save_state(saga_id, SagaState.COMPENSATING, context)
        
        # Compensate in reverse order based on how far we got
        if current_state in [SagaState.PAYMENT_PROCESSED, SagaState.INVENTORY_RESERVED]:
            if 'payment_id' in context:
                await self.payment_service.refund(context['payment_id'])
        
        if current_state in [SagaState.PAYMENT_PROCESSED, SagaState.INVENTORY_RESERVED]:
            if 'reservation_id' in context:
                await self.inventory_service.release(context['reservation_id'])
        
        if 'order_id' in context:
            await self.order_service.cancel(context['order_id'])
        
        self._save_state(saga_id, SagaState.FAILED, context)

Orchestration makes the saga flow explicit. You can look at one class and understand the entire process. Debugging is straightforward—check the orchestrator’s state. Adding steps means modifying one place.

The trade-off is coupling. The orchestrator knows about all participating services. It’s also a potential single point of failure, though you mitigate this with proper persistence and recovery.

Handling Failures and Edge Cases

Idempotency is the foundation of reliable sagas. Network failures mean you’ll retry operations. Services might process the same request multiple times. Without idempotency, you’ll double-charge customers or reserve inventory twice.

class PaymentService:
    def __init__(self, db, payment_gateway):
        self.db = db
        self.gateway = payment_gateway
    
    async def charge(self, idempotency_key: str, customer_id: str, 
                     amount: float) -> str:
        # Check if we've already processed this request
        existing = await self.db.get_payment_by_idempotency_key(idempotency_key)
        if existing:
            return existing.payment_id
        
        # Record intent before calling external service
        payment_record = await self.db.create_payment_record(
            idempotency_key=idempotency_key,
            customer_id=customer_id,
            amount=amount,
            status="PENDING"
        )
        
        try:
            # Call payment gateway with our idempotency key
            result = await self.gateway.charge(
                idempotency_key=idempotency_key,
                customer_id=customer_id,
                amount=amount
            )
            
            await self.db.update_payment_status(
                payment_record.id, "COMPLETED", 
                gateway_id=result.transaction_id
            )
            
            return payment_record.id
            
        except PaymentGatewayError as e:
            await self.db.update_payment_status(
                payment_record.id, "FAILED",
                error=str(e)
            )
            raise

The idempotency key (often the saga ID plus step name) ensures that retrying a charge doesn’t create duplicate transactions. Most payment gateways support idempotency keys natively—use them.

Compensation failures require human intervention. You can’t always automatically recover. Build alerting into your compensation logic and provide operations teams with tools to manually complete compensations.

Implementation Patterns and Frameworks

Building sagas from scratch teaches you the concepts, but production systems benefit from frameworks that handle the infrastructure.

Temporal (formerly Cadence) models sagas as workflows with automatic state persistence and retry handling:

from temporalio import workflow, activity
from datetime import timedelta

@activity.defn
async def create_order(order_data: dict) -> str:
    # Implementation
    pass

@activity.defn  
async def reserve_inventory(order_id: str, items: list) -> str:
    pass

@activity.defn
async def release_inventory(reservation_id: str) -> None:
    pass

@workflow.defn
class OrderSagaWorkflow:
    @workflow.run
    async def run(self, order_data: dict) -> str:
        order_id = await workflow.execute_activity(
            create_order,
            order_data,
            start_to_close_timeout=timedelta(seconds=30),
        )
        
        try:
            reservation_id = await workflow.execute_activity(
                reserve_inventory,
                args=[order_id, order_data['items']],
                start_to_close_timeout=timedelta(seconds=30),
            )
        except Exception:
            await workflow.execute_activity(
                cancel_order, order_id,
                start_to_close_timeout=timedelta(seconds=30),
            )
            raise
        
        # Continue with payment, confirmation...
        return order_id

Temporal handles persistence, retries, timeouts, and recovery automatically. Your workflow code looks like normal async Python, but it survives process restarts and infrastructure failures.

Practical Considerations and Trade-offs

Sagas mean eventual consistency. Between saga start and completion, your system is in an intermediate state. Users might see an order as “processing” for seconds or minutes. Design your UX around this reality—show pending states, send confirmation emails only after completion, and handle the “what if I refresh the page?” scenario.

Testing sagas requires simulating failures at each step. Unit test individual steps and compensations. Integration test the happy path. Chaos test by injecting failures randomly. Property-based testing can verify that any failure sequence results in a consistent final state.

Observability is critical. Assign correlation IDs to sagas and propagate them through all service calls. Build dashboards showing saga states, completion rates, and compensation frequencies. Alert on sagas stuck in intermediate states.

Sometimes you don’t need sagas. If operations naturally fit in one service, use a local transaction. If you can tolerate eventual consistency without compensations, simple event-driven updates might suffice. If failures are rare and manual correction is acceptable, maybe you just need good monitoring and a runbook. Sagas add complexity—use them when that complexity is justified.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.